Pandas, matplotlib and Ipython – all you need for great data anaylsis

I’ve been playing around a little with the stats-centric programming language R, mostly to get a better handle on the Bioconductor differential gene expression analysis packages edgeR and DEseq. The language is designed to allow for easy manipulation of tabular data as well as providing access to a rich library of statistical packages and graphing tools. One thing I learned was that often it was more efficient (at least for me) to spend a little time pre-formatting the data using python/perl before even getting started with R. The charting features of R are pretty cool, but once again I missed my familiar Python environment)-:

But, as always there is a Pythonic way, I came across a great book called Python for Data Analysis by Wes McKinney. The book introduced me to the pandas library, which contains R-like tools for tabular data manipulations and analyses. The book also introduced the Ipython development environment; basically a souped up feature rich but light weight “python IDLE”. The best features of Ipython for me are the logging capabilities, advanced history and de-bugging utilities – very cool! Ipython has been designed to work well with the matplotlib, thus allowing production and manipulation of nice looking graphs and charts within an interactive python environment. I’m still very much learning this (and I’m a hack programmer), but here is some fun data crunching based on a USDA food database wrangled into into Json format by Ashley Williams.

Ok, I think it would look better on one chart. Here is the top 50 protein containing foods with their fat content as part of one graph.

Note that I chopped the y-axis labels off, doh, no problem just modify them in real time using the slider editor!

Not too bad at all for a newbee. Hopefully by next post I will feel a little more comfortable to share some hard learned lessons on data manipulation using pandas.

I’m selling my Car!

Aside

car

FOR SALE: 2000 Isuzu Trooper LS 4×4, 3.5 L V6

GET READY FOR ROCHESTERS FAMOUS SNOWY WINTERS!

In good condition and very well maintained (Buy with confidence; always professionally serviced, full service history recorded, including major on-schedule maintenance (timing belt, transmission clean, brake servicing)).

Priced for a quick sale @ only $5400 obo

Bluebook price listed @ $6100. Sad sale )-: returning to Australia :-)
Air conditioning, On demand 4×4 , Cruise-control
Power: steering, windows, seats, mirrors, sunroof
Heated: Seats and mirrors
Privacy windows, large tinted sunroof and running boards
98.1K miles, Clean NY title (I own the car, no bank loans etc)
CONTACT DAVE TO ORGANIZE A TEST DRIVE TODAY!

dwheele9 at z.rochester.edu

My Review of Python for Data Analysis

Data processing heaven

By dave_au from Rochester on 7/7/2012

 

4out of 5

Pros: Well-written, Easy to understand, Helpful examples

Best Uses: Intermediate

I’ve only made it through the first few chapters, and the book is not complete, but I have been getting regular updates which is great. So far though it is well written and as far as I can tell there are very few technical errors.

It starts with a “Whetting your appetite” chapter where the author takes you on a journey through the possibilities of data analysis in Python; its a lot of fun.

The chapter on getting the most our of IPython has made this book worth the purchase price alone, its my new editor and I love it.

Highly recommended for Bioinformaticians or biologists who have to use R, but would prefer to stick with their beloved Python as much as possible.

(legalese)

Why perl for Science should die!

I think its fair to say the most popular programming/scripting language for Bioinformatics is Perl. My language of choice Python would certainly be coming in second. I reallly really don’t like perl, and I think it is a poor language for Science in general, here’s why!

1. Its hard to read

This is the biggest issue for me. Well commented Perl is actually pretty fun to read, its a punchy language so you don’t have to do too much scrolling to work out what is going on. However, the key point of that last comment is the word “comment”, poorly commented Perl scripts are a nightmare to read. And frankly, most of the scripts that I have read fall into this latter category. Turnover of people in Science is pretty high, so the idea using a language that makes it really hard for other people to just pick up where the other person left of is such a silly idea. The other key thing is that Science should be about transparency, and Perls silly syntax just adds another layer dust that hides mistakes and errors in logic. This is latter point is especially important when bioinformaticians are working with biologists who don’t know a computer language, essentially no oversight or review because the language is so god dam hard to understand unless you know the syntax.