Setting up VIM as a python IDE

If you are looking for a lightweight but feature rich editor you can’t go past VIM! Sure those keyboard shortcuts are difficult remember and it’s frustratingly hard trying to stop reaching for that bloody mouse (the mouse will actually work in this set-up BTW). However, what I have found is the best way to learn VIM (I don’t think you ever really learn vim) is to make it useful enough that you want to start using it; this helps build up the muscle memory required to become efficient [enough]. The ultimate payback is that with practise you can become really really efficient relative to a user stuck with a mouse and complex GUI. The mouse is so 1980’s anyway 😛


So this post is about making VIM useful! The steps are based around Ubuntu/Debian Linux but should be adaptable for OSX and other NIX distros. Windows users, like yourselves I have no idea (-;?

Of course nothing I ever do is original and this especially goes for this post which borrows heavily from an excellent (but old) PyCon Asia talk by Martin Brochhaus. The .vimrc and much of the instructions are taken from the talk so feel free to watch it on youtube. I’ve also tried to update some of the info and stick to the minimum to get you coding away ASAP.

Section 1: Basic VIM install

These steps will clone the current version of VIM from source and enable additional features. If you already have a copy installed (ie vim starts when you type ‘vim’) I would first fun ‘apt-get remove vim’ (or your equivalent). First dependencies and some tools.

sudo apt-get update
sudo apt-get install mercurial curl python-pip flake8
sudo apt-get build-dep vim

#install python packages
sudo pip install ipdb
sudo pip install jedi
sudo pip install flake8

As we are going to use a bin folder in our home directory we need to put this location into our path so that our new version of vim will run from the command line.

#add a home bin dir to your path
nano ~/.bashrc

#add the following to the end of the file
if [ -d $HOME/bin ]; then

#save and exit

Now we make the ~/bin dir as well as an ~/opt for the vim install. Eventually we will use a simlink from ~/bin (now in our path) to the ~/opt folders in our home directory, the -p flag says ‘don’t complain if this dir is already there’. The final command reloads your bashrc without you having logout. Cool.

mkdir -p ~/bin
mkdir -p ~/opt
source ~/.bashrc

Now to install VIM.

cd ~/Desktop
hg clone vim
cd vim/src
./configure --enable-pythoninterp --with-features=huge --prefix=$HOME/opt/vim
make install
cd ~/bin
ln -s ~/opt/vim/bin/vim

#the following should return /home/YOUR_USERNAME/bin/vim
which vim


Section 2: The vimrc settings file

Now we need a “.vimrc file”. This file will contain our customisations (the dot says I’m a hidden file). You can get mine from here (download to your desktop or copy contents into a blank txt file). It is worth look at this file in a text editor as it contains some information on the key short-cuts and the plugins we are going to use (BTW in .vimrc files a single quote ” indicates a comment (aka #) a double quote “” indicates a bit of code that can be uncommented to enable something). The next bit of code simply copies the file to your home dir and the gets a custom colour scheme (mostly just to show how to do it).

mv ~/Desktop/vimrc.txt ~/.vimrc
mkdir -p ~/.vim/colors
cd ~/.vim/colors
wget -O wombat256mod.vim

Section 3: Getting set-up for the plugins with pathogen

This next part is important. To manage plugins we will use a bit of kit called pathogen. The plugins can then be installed (mostly by a git clone) right into the “~/.vim/bundle/” (this will result in a folder structure like this: .vim/bundle/plugin-name) and pathogen will handle everything for us – awesome! Most plugin developers set-up their folders to work nicely with pathogen to make life easy.

mkdir -p ~/.vim/autoload ~/.vim/bundle
curl -so ~/.vim/autoload/pathogen.vim

Section 4: Installing plugins

The first plugin we will use is called powerline and it adds features and makes a better looking status line.

#install powerline plugin
cd ~/.vim/bundle
git clone git://

The next plugin allows code folding to make it easier to look through long blocks of code. Its simple to use, just type f to collapse a section of code of F to collapse it all. Type again to reverse the folding – sweet as.

# install folding flugin us f to fold block of F to fold all
mkdir -p ~/.vim/ftplugin
wget -O ~/.vim/ftplugin/python_editing.vim

The next one installs ctrlp which allows nice fuzzy file searches from directly inside vim.

#install ctrlp
cd ~/.vim/bundle
git clone

Install the jedi plugin which allows awesome auto-completion of commands and imports using [ctl][space].

#install jedi plugin
git clone --recursive

Install syntastic is another awesome plugin that does syntax checking and will check your code for compliance with PEP8.

#install syntastic
git clone

Forget tabbing into the terminal to stage/commit/merge etc just install git support with fugitive.

# install git support
cd ~/.vim/bundle
git clone git://
vim -u NONE -c "helptags vim-fugitive/doc" -c q

So that is it! Your editor should look something like this:


Well that’s the install. Next post will be the shortcuts I use all the time and some that I want to learn as a bookmark for myself. Otherwise checkout the links to the plugins for detailed descriptions of what they do. But now you should be up and running with a pretty nice looking IDE!

SAM / BAM files explained?


Lets face it there aren’t too many HTS pipelines that don’t touch a SAM/BAM files somewhere along the way. The name SAM comes from the acronym Sequence Alignment/Map format and they are really nothing more than tab delineated text files describing mapping information for short read data. Of course with the millions of reads generated by next gen sequencers these days the files become really really big. That is where BAM files come in, these contain the same information as SAM files but are encoded in condensed computer readable binary format to save disk space.

If you open up a SAM file in a text editor (or use “head” in bash) the first part of the file typically contains lines beginning with AT (@) symbol, which represents the “header”. The header contains general information about the alignment, such as the reference name, reference length, MD5 checksum, the program used for the alignment, etc etc. Every line after the header represents a single read, with 11 mandatory tab separated fields of information. Lets see how this works with a really simple example.

What I’ve done is mapped four ‘pairs’ of reads to the alcohol dehydrogenase gene (my favourite gene) using bowtie2. The figures below shows the stranded nature of the read pairs as well as where they map on the gene.

Screenshot - 281115 - 11:07:01

I have set this up so that the reads fall into these three groupings.

  1. Two pairs of high quality reads that concordantly map
  2. A read pair where only one read is high quality
  3. An orphan read (the R1 read is only 1 nt long aka trimming)

The next figure describes each of the mapping scenario’s using the colour encoding. The red set of reads behave as we would expect for paired end data.

Where the reads map and the different read scenario

Where the reads map and the different read scenarios

For the alignment I’m going to use bowtie2, all of the sequences are here if you want to do this yourself.

#index the reference and specify the prefix as reference
bowtie2-build reference.fa reference

#map the reads
bowtie2 reference -1 test_R1.fq -2 test_R2.fq > adh.sam

####bowtie output###
Warning: skipping mate #1 of read 'HWI-7001326F:39:C7N3UANXX:1:1101:10000:62296/1' because length (1) <= # seed mismatches (0)
Warning: skipping mate #1 of read 'HWI-7001326F:39:C7N3UANXX:1:1101:10000:62296/1' because it was < 2 characters long 4 reads; of these: 4 (100.00%) were paired; of these: 1 (25.00%) aligned concordantly 0 times 3 (75.00%) aligned concordantly exactly 1 time 0 (0.00%) aligned concordantly >1 times
    1 pairs aligned concordantly 0 times; of these:
      0 (0.00%) aligned discordantly 1 time
    1 pairs aligned 0 times concordantly or discordantly; of these:
      2 mates make up the pairs; of these:
        1 (50.00%) aligned 0 times
        1 (50.00%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
87.50% overall alignment rate

The “adh.sam” file can be opened with any text editor, but for most of this post I’m going to use the spreadsheet version just to easily work with all the columns. Note the header lines indicated with the “@”. In this case the @SQ relates to the reference and @PG relates to the program parameters. Each header line has fields in the format tag:value. For example, the @SQ line has SN (SN=reference sequence name) and LN (LN=reference sequence length). A full set of these header can be found in the SAM specs.

@HD	VN:1.0	SO:unsorted
@SQ	SN:gi|568815594:c99321442-99306387	LN:1890
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0

Below the header each read is represented by a line, it is worth nothing that even read “62296/1” which shouldn’t map since it is only 1 nt long also gets a line. Two other things to note are that firstly the reads can be ordered by reference coordinate, this is the default Tophat behaviour. Alternatively, they can be listed in pairs. Samtools can be used to switch between formats using the “sort” command and why does this matter? Some software needs to know how the reads are sorted, for example htseq-count expects reads ordered in pairs because read pairs can map different distances away from each other and finding them in the file increases memory requirements.

OK the fist four column. The QNAME (seq name) and RNAME (reference name) are pretty self explanatory. The FLAG uses information dense bit-wise encoding containing information on the alignment, such as the strand, if the pair mapped, if this if the first or second pair. If your dumb like me you’ll be thankful for the Broads has a nifty website that translates these numbers into plain English. For example, the 83 represents “read paired”, “read mapped in proper pair”, “read reverse strand”, “first in pair”. The 163 is “read paired”, “read mapped in proper pair”, “mate reverse strand” and “second in pair”. I think that makes sense. For reference the other two numbers are 137 (read paired, mate unmapped) and 69 indicating an unmapped read pair. Very cool. The final flag here POS corresponds to the alignment coordinates. We can confirm the reads are ordered R1 then R2 based on the mapping coordinates (see the figure at the top of the page).



The next flags worth looking at are the MAPQ and CIGAR strings. The mapping quality is represented as a log transformed probability of mapping to the wrong position. Interestingly, the low quality (fastq quality “$”) has the same mapping score as the other high quality reads. I gather this is because the read really can only go one place when there is a 100% match and the reference is only short like this. The CIGAR string starts with a letter, in this case “M” for “alignment match” and 50 indicating the length of the alignment. The “*” indicates that no information is available. PNEXT is the position of the mate pair and TLEN is the observed template length, in other words the length of the sequenced fragment represented by the pairs.


Finally, we have sets of tags that provide other information on the alignment. Things like the number of ambiguous bases or gaps etc. Checkout the documentation of your aligner to find out what each tag defines if you are interested.




New Zealanders and their sheep – part 2

Ok, based on the graphs in the last post NZ is slowly being cow-a-fyed, so whats driving this trend. Well google will tell you that its …

WARNING: This data is dodgy, but I’m really just using it to demonstrate how cool pandas is. So I found some information on milk and lamb meet prices, we’ll load them up as dataframes and work out the percent change since 1994 like we did before. We’ll try out the datetime functionality of pandas, which is really quite nice. But first just to import our table from the last post and make the year the index so we can easily merge the new data.

import pandas as pd
per_decline = pd.DataFrame(pd.read_csv('percent_decline.csv'))
cols = per_decline.columns.values
cols[0] = 'Year'
per_decline.columns = cols
per_decline.index = per_decline['Year']
per_decline = per_decline.ix[:,1:] #all rows, skip first column (date is now the index)

	Total beef cattle	Total dairy cattle	Total sheep	Total deer	Total pigs	Total horses
1994	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000
2002	-11.025827	 34.444950	-20.002034	 33.858009	-19.100637	 11.811093
2003	 -8.344764	 32.882482	-20.041908	 37.229441	-10.766476	 18.504488
2004	-11.895128	 34.207998	-20.609926	 42.707754	 -8.072078	 13.376472
2005	-12.366101	 32.506699	-19.379727	 38.499840	-19.230733	-100.000000

Now we are going to create a table from the dodgy lamb price data, this table is in a slightly different format so we will have to use the groupby method to wrangle it into the shape we need.

lamb_data = pd.DataFrame(pd.read_excel('lamb_usd.xlsx',sheetname='lamb'))
	Month	Price	Change
0	 Apr 1994	 130.00	 -
1	 May 1994	 126.59	 -2.62 %
2	 Jun 1994	 127.03	 0.35 %
3	 Jul 1994	 126.11	 -0.72 %
4	 Aug 1994	 119.62	 -5.15 %

Now to use datetime to make an index based on the month data.

lamb_data.index = pd.to_datetime(lamb_data['Month'])
lamb_data=lamb_data.ix[:,1:2] #just grab the price
1994-04-02	 130.00
1994-05-02	 126.59
1994-06-02	 127.03
1994-07-02	 126.11
1994-08-02	 119.62

Pandas did a good job of converting the date format into a datetime index. As you’ll see in a second this datetime object has some extra functionality that makes dealing with dates a breeze. Although this new data has the date and price information we need, its divided into quarterly amounts. As you can see by the commented out code, initially I made a mistake and summed these values, but really we want the mean to get the average yearly price. I left the mistake code there as it shows how easy it would have been to get the sum using groupby.

#wrong! lamb_prices = lamb_data.groupby(lamb_data.index.year)['Price'].sum()
lamb_prices = lamb_data.groupby(lamb_data.index.year)['Price'].mean()
lamb_prices = pd.DataFrame(lamb_prices[:-1]) #get rid of 2014
1994	 124.010000
1995	 113.242500
1996	 145.461667
1997	 150.282500
1998	 116.013333

We pass the year index to groupby and get it to do its magic on the price column (our only column in this case, but you get the idea), we then just call the mean method to return the mean price per year. The datetime object made specifying the year easy. Now we are going to write a quick function to calculate the percent change since 1994.

def percent(start,data):
    '''calculate percent change relative to first column (1994), better than previous attempt )-:'''
    ans = 100*((start - data)/start)
    return 0-ans

lamb_change = percent(lamb_prices.Price[1994],lamb_prices)
1994	 0.000000
1995	 -8.682768
1996	 17.298336
1997	 21.185791
1998	 -6.448405

Great! Now just add that column to our original dataframe. Notice how only the intersect of the dates are used, very handy (ie it drops 1995-2001 from the lamb price data as these dates are not in our stock number table)!

per_decline['Lambprice'] = lamb_change
	Total beef cattle	Total dairy cattle	Total sheep	Total deer	Total pigs	Total horses	Lambprice
1994	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000
2002	-11.025827	 34.444950	-20.002034	 33.858009	-19.100637	 11.811093	 17.768056
2003	 -8.344764	 32.882482	-20.041908	 37.229441	-10.766476	 18.504488	 28.869984
per_decline.index=per_decline.index.astype(int) #lamb2
plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')



The next series of code and graphs adds in milk and lamb prices to try and see why farmers are moving from ovines to bovines!

milk_data = pd.DataFrame(pd.read_excel('milk_prices_usd.xlsx',sheetname='milk'))
	year	thousand head	pounds	mill lbs	price_cwt
1989	 1989	 10046	 14323	 143893	 13.56
1990	 1990	 9993	 14782	 147721	 13.68
1991	 1991	 9826	 15031	 147697	 12.24
1992	 1992	 9688	 15570	 150847	 13.09
1993	 1993	 9581	 15722	 150636	 12.80
#get rid of the info we don't need
milk_data = pd.DataFrame(milk_data.ix[5:,:])
milk_change = percent(milk_data.price_cwt[1994],milk_data)
per_decline['milk_price'] = milk_change
	Total beef cattle	Total dairy cattle	Total sheep	Total deer	Total pigs	Total horses	Lambprice	milk_price
1994	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000
2002	-11.025827	 34.444950	-20.002034	 33.858009	-19.100637	 11.811093	 17.768056	 -6.630686
2003	 -8.344764	 32.882482	-20.041908	 37.229441	-10.766476	 18.504488	 28.869984	 -3.469545
2004	-11.895128	 34.207998	-20.609926	 42.707754	 -8.072078	 13.376472	 33.670672	 23.747109
2005	-12.366101	 32.506699	-19.379727	 38.499840	-19.230733	-100.000000	 29.762385	 16.653816

lamb_3 These graphs are a little busy, lets just concentrate on the important stuff.

<pre>animals=['Total dairy cattle','Total sheep','Lambprice','milk_price']





Pandas data analysis – New Zealanders and their sheep!


As an Aussie I cop a little bit of flack living in New Zealand. It helps that since I’m from the South of Australia I follow Aussie Rules footy, not this rugby shenanigans, so I don’t bother too much with the constant focus on how much better the All Blacks are relative to our poor Wallabies (Australia has won as many world cups as New Zealand – Take that!).

They make bloody good gum boots (multi-purpose - inside joke (-; )

Kiwi’s have bloody good gum boots; they have to do more than keep the mud off! – inside joke (-;

That said as a bit of fun I thought I would do a post on the Pandas python module using data from StatsNZ. It has too be about sheep right! The actual spreadsheet I downloaded from StatsNZ and used in this is here.

I’ll be using pandas, ipython, and matplotlib to create some graphs to show the decline of sheep in NZ. In the next post I’ll try and work out why (yes I know you could find this out in 3 seconds with google, but thats no fun).

First we will read the file using pandas and make a dataframe. Pandas can import in all kinds of file formats, with excel you need to include a sheet name (that little tab at the bottom of the excel sheet).

import pandas as pd
import matplotlib.pyplot as plt
#sheetname required for xlsx
data_file = pd.read_excel('livestock1.xlsx',sheetname='livestock')

#make a dataframe
data = pd.DataFrame(data_file)
                                                 1994          2002      2003  \
Total beef cattle                             5047848       4491281   4626617
Calves born alive to beef heifers/cows        1262522       1083485   1079334
Total dairy cattle                            3839184       5161589   5101603
Calves born alive to dairy heifers/cows       2455975       3225238   3115897
Total sheep                              4.946605e+07  3.957184e+07  39552113
                                             2010      2011      2012
Total beef cattle                         3948520   3846414   3734412
Calves born alive to beef heifers/cows     901258    901375    827749
Total dairy cattle                        5915452   6174503   6445681
Calves born alive to dairy heifers/cows   3640914   3884257   3879543
Total sheep                              32562612  31132329  31262715

So we made a dataframe, the head method acts like bash head in that it shows the start of the frame rather than the whole thing. Currently the years are columns and the stock type are rows, lets flip the table, which is super easy!

 #we really want the dates as the index
data = data.T
     Total beef cattle Calves born alive to beef heifers/cows  \
1994           5047848                                1262522
2002           4491281                                1083485
2003           4626617                                1079334
2004           4447400                                1013893
2005           4423626                                1018730

Now that we have the years as rows, actually an index, it will be much easier to do our plotting. But the column names are overly informative, as in long, lets shorten them.

#the column names are pretty long, lets fix that now
Out[13]: Index([u'Total beef cattle', u'Calves born alive to beef heifers/cows', u'Total dairy cattle', u'Calves born alive to dairy heifers/cows', u'Total sheep', u'Total lambs marked and/or tailed', u'Total deer', u'Fawns weaned', u'Total pigs', u'Piglets weaned', u'Total horses'], dtype=object)
sub_data = data.loc[:,['Total beef cattle','Total dairy cattle','Total sheep','Total deer',
'Total pigs','Total horses']]
sub_data = sub_data.replace('..',0)#replace .. with 0
sub_data = sub_data.astype('float')
      Total beef cattle  Total dairy cattle  Total sheep  Total deer  \
1994            5047848             3839184     49466054     1231109
2002            4491281             5161589     39571837     1647938
2003            4626617             5101603     39552113     1689444
2004            4447400             5152492     39271137     1756888
2005            4423626             5087176     39879668     1705084
2006            4439136             5169557     40081594     1586918
2007            4393617             5260850     38460477     1396023
2008            4136872             5578440     34087864     1223324
2009            4100718             5860776     32383589     1145858
2010            3948520             5915452     32562612     1122695
2011            3846414             6174503     31132329     1088533
2012            3734412             6445681     31262715     1060694

Great! Now to make the plot easier to look at lets divide the dataframe by 1 million.

#now divide by a million
sub_data = sub_data.divide(1000000)
#first plot everything
plt.ylabel('Total stock (millions)')
plt.title('NZ farm stock')

Yes that is correct, back in the 90’s there were 50 million sheep in NZ, not bad for a country with a population of ~3 million people. Baa. But their numbers have been in serious decline since then, replaced by their bigger brothers the cows.
Poor sheepies!

Lets face it, NZ is all cows and sheep, lets just look at that data.

#lets just plot cows and sheep, that being the first 3 columns
cow_sheep = sub_data.ix[:,[0,1,2]]
cow_sheep.plot(label=True,title="Numbers of cattle and sheep in NZ")
plt.ylabel('Total stock (millions)')

fig2The data has meat cows and milk cows as separate columns, its easy to combine them.

#a milk cow and meat cow are still cattle
cow_sheep['Total cows'] = cow_sheep.ix[:,0] +  cow_sheep.ix[:,1]
cow_sheep.ix[:,2:].plot(legend=True,title="Total cattle and sheep in NZ")
plt.ylabel('Total stock (millions)')

fig_3So the country is cowafying, mostly dairy replacing beef cattle. Although the increase looks slight, basically its 25% up! That’s a lot of grass, cow shit, and methane!

Now lets look at the change in the numbers of each stock since 1994 (the start of our data). We do this by passing all the data as well as all the data minus 1994 to a function that handles the calculation. Pandas handles this all in the back end and parses the data in the frame through the function, nice.

def percent(dataframe,d1):
    '''calculate percent change relative to first column (1994)'''
    a = 100*((dataframe.ix[0,:]-d1)/dataframe.ix[0,:])
    return 0-a

#pass the entire data frame to this function
percent_data = percent(sub_data,sub_data.ix[0:,:])

      Total beef cattle  Total dairy cattle  Total sheep  Total deer  \
1994           0.000000            0.000000     0.000000    0.000000   
2002         -11.025827           34.444950   -20.002034   33.858009   
2003          -8.344764           32.882482   -20.041908   37.229441   
2004         -11.895128           34.207998   -20.609926   42.707754   
2005         -12.366101           32.506699   -19.379727   38.499840   

      Total pigs  Total horses  
1994    0.000000      0.000000  
2002  -19.100637     11.811093  
2003  -10.766476     18.504488  
2004   -8.072078     13.376472  
2005  -19.230733   -100.000000  
#the years are plotted as floats, its easy to convert them!
#figure 4
plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')

fig_4I really want the graph the other way around, so lets re-index. Also, lets get rid of the nags, they look funny because the frame was missing some data. More proof that Far Lap was an Australian horse.

horseless = sub_data.ix[:,:-1]
horseless_per = percent(horseless,horseless.ix[0:,:])
#flip the axis
horseless_per = horseless_per.reindex( index=data.index[ ::-1 ] )
plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')


Finally, its silly having the 1994 data as its zero, but it was a nice sanity check to make sure the % function was working correctly. But lets get rid of it now by just plotting part of the frame.

plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')

The ‘barh’ is just bar graph horizontal.
fig_9So it looks like there was a venison craze in the early 00’s, but mostly just more and more dairy.

#save the file

Ok, so although there are still a LOT of sheep in NZ, it really is the country of the cow now. What might be driving that? Lets look at some commodity prices in the next post!

How to use DESeq2 to analyse RNAseq data


News: My colleagues at NZGL have developed an open source R based GUI for generating plots using cuffdiff data. One for deseq2 will be available soon! Feel free to check it out and get back to us with any suggestions. Only requires R-studio.

There is only one thing better than DESeq and thats DESeq2! The updated version is out and I’m keen to give it a whirl. Like with my old DESeq post, once again I am really just following the excellent DESeq2 manual, thanks again to the authors for the great documentation!

Just a quick warning that I haven’t tested this workflow extensively, let me know if things don’t work. Also, DESeq2 is new so some of the function might have changed, so if you have problems makes sure you check what version you are using versus what I used (see sessionInfo below).

The files I used for this are found here if you want to give them a go or follow along. I’ve already got a blow by blow run through on how to do use DESeq and much of that applies to the new package, so here I’ll just concentrate on some DESeq2 specific stuff as well as all the graphs. I’ve included some side by side comparisons between DESeq and DESeq2.

Installing is easy:


One important change is that now you can directly create the count table using raw HT-Seq-count output, and I’ll show you how to do that below. Remember HT-Seq-count will create a single file for each replicate of a condition (based on an SAM alignment file), so in my case with two conditions (control and treatment) and 3 reps each, that makes a total of 6 files. I called these files treated1.txt, treated2.txt, treated3.txt, untreated1 untreated2, untreated3 so that i can use grep to import them (explained below). Previously i would use a script to merge them all together, now DESeq2 allows you to import these files directory.

Below I set the directory where the files are located; use grep to catch all these files based on the string match “treated” that they all share (be carefully it doesn’t catch anything else), this is stored in sampleFiles. If you like you could just directly specify the files using “sampleFiles<-c(“treated1.txt”,..etc..”untreated3.txt”). Importantly, we need to setup the sampleConditions, with the same order as that found for the file names in sampleFiles (so it knows what each files represents). Finally we make a dataframe that becomes a deseq table. Note my “#” lines are just nonfunctional comment lines that I am using to print out the output from the screen!

#use grep to search for the 'treated' part of filename to collect files
# sampleFiles
#[1] 'treated1.txt'   'treated2.txt' 'treated3.txt'  'untreated1.txt'
#[5] 'untreated2.txt' 'untreated3.txt'

sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition)
#     sampleName       fileName condition
#1   treated1.txt   treated1.txt   treated
#2   treated2.txt   treated2.txt   treated
#3   treated3.txt   treated3.txt   treated
#4 untreated1.txt untreated1.txt untreated
#5 untreated2.txt untreated2.txt untreated
#6 untreated3.txt untreated3.txt untreated

ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition)
#class: DESeqDataSet
#dim: 7921 6
#assays(1): counts
#rownames(7921): seq_1 seq_2 ... seq_7920 seq_7921
#rowData metadata column names(0):
#colnames(6): treated1.txt treated2.txt ... untreated2.txt
#  untreated3.txt
#colData names(1): condition
colData(ddsHTSeq)$condition<-factor(colData(ddsHTSeq)$condition, levels=c('untreated','treated'))

The levels in colData are important because they are used in the log calculations; it makes sense to set untreated or control first so that the direction of the logs fold changes doesn’t confuse everyone (typically we do comparisons to the control)! Now for the guts of the DEseq2 analysis.

#DataFrame with 6 rows and 6 columns
#          baseMean log2FoldChange      lfcSE      stat       pvalue
#seq_3146  997.5419      0.7894523 0.08297687  9.514125 1.832488e-21
#seq_1802  746.3972      0.5685789 0.08533961  6.662544 2.691282e-11
#seq_2146  406.1395      0.9424543 0.14108613  6.679993 2.389544e-11
#seq_7548  466.5453      0.6036683 0.10178158  5.931017 3.010637e-09
#seq_3240 1569.6556      0.6132326 0.11145966  5.501835 3.758596e-08
#seq_958   149.6504      0.7398193 0.14154162  5.226868 1.724055e-07
#                 padj
#seq_3146 1.299050e-17
#seq_1802 6.359498e-08
#seq_2146 6.359498e-08
#seq_7548 5.335601e-06
#seq_3240 5.328937e-05
#seq_958  2.036971e-04

Looking good, time for some plots. BTW I’m using the same dataset I used for the original DESeq blog post (links to data and that blog at top of page).

MAPlot of DESeq1 (left) and DESeq2  (right) for the same data

MAPlot of DESeq1 (left) and DESeq2 (right) for the same data

As expected for this dataset there are not many differentially expressed genes (red). There certainly is a difference in the level of scatter with this dataset using DESeq and DESeq2. Also note that there is good reduction in scatter for low count reads (left hand side of the graph) in DESeq2 versus the original version. DESeq tends to be a conservative approach, I like that, and with that in mind the update uses a test called cooks distance to remove outliers from the analysis. Cooks distance looks to see how much each sample contributes to a genes overall value fold change, with samples that cause extreme effects removed. To be specific, the gene will not be analysed for differential expression if one of its samples is considered an outlier. The idea being here that we want to see only DE genes that show a consistent pattern. The draw back of this approach is that there is a loss of power, as some genes that are truly DE will be removed before the statistical tests are performed.

We can save the table, and also print out some information on what the columns mean.

#DataFrame with 6 rows and 2 columns
#                       type

#baseMean       intermediate
#log2FoldChange      results
#lfcSE               results
#stat                results
#pvalue              results
#padj                results

#baseMean                        the base mean over all rows
#log2FoldChange log2 fold change (MAP): condition treated vs untreated
#lfcSE           standard error: condition treated vs untreated
#stat            Wald statistic: condition treated vs untreated
#pvalue          Wald test p-value: condition treated vs untreated
#padj            BH adjusted p-values
#write the table to a csv file

BTW take this with a pinch of salt because its only a simple sample dataset, but the difference in gene counts is that deseq only found a single differentially expressed gene (at padj 0.1), whilst deseq2 called this same gene plus 23 others.  Also, reducing the cut-off multiple testing correction to 0.05 only removes 3 genes from the list with DESeq2.

Now we want to transform the raw discretely distributed counts so that we can do clustering. (Note: when you expect a large treatment effect you should actually set blind=FALSE (see

rld<- rlogTransformation(dds, blind=TRUE)
vsd<-varianceStabilizingTransformation(dds, blind=TRUE)

Here we choose blind so that the initial conditions setting does not influence the outcome, ie we want to see if the conditions cluster based purely on the individual datasets, in an unbiased way. According to the documentation, the rlogTransformation method that converts counts to log2 values is apparently better than the old varienceStabilisation method when the data size factors vary by large amounts.

The code and plot below shows the [nice] effect of the transformation.

par(mai=ifelse(1:4 <= 2, par('mai'), 0))
px     <- counts(dds)[,1] / sizeFactors(dds)[1]
ord    <- order(px)
ord    <- ord[px[ord]<150]
ord    <- ord[seq(1, length(ord), length=50)]
last   <- ord[length(ord)]
vstcol <- c('blue', 'black')
matplot(px[ord], cbind(assay(vsd)[, 1], log2(px))[ord, ], type=l, lty=1, col=vstcol, xlab='n', ylab='f(n)')
legend('bottomright', legend = c(expression('variance stabilizing transformation'), expression(log[2](n/s[1]))), fill=vstcol)
Graph showing variance stabilizing transformation for sample 1 (blue) and of the transformation f (n) = log2 (n/s1 ) (black)

Graph showing variance stabilizing transformation for sample 1 (blue) and
of the transformation f (n) = log2 (n/s1 ) (black)

The x axis is the square root of variance over the mean for all samples, so this will naturally included variance due to the treatment. The goal here is to flattern the curve so that there is consistent variance across the read counts, and that is what we got.

notAllZero <- (rowSums(counts(dds))>0)
meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1), ylim = c(0,2.5))
meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5))
meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5))


This interesting plot shows the standard deviation across all samples against the mean counts using three different methods of transformation. With this data you can see that the shifted logarithm method (left) seems to do pretty badly at for low count genes, with both regularized log (center) and DESeqs variance stabilisation (right) doing a much better job across the entire dynamic range of counts.

For some reason, everyone loves a good heat map!

select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:30]
hmcol<- colorRampPalette(brewer.pal(9, 'GnBu'))(100)
heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol,
Rowv = FALSE, Colv = FALSE, scale='none',
dendrogram='none', trace='none', margin=c(10,6))
heatmap.2(assay(rld)[select,], col = hmcol,
Rowv = FALSE, Colv = FALSE, scale='none',
dendrogram='none', trace='none', margin=c(10, 6))
heatmap.2(assay(vsd)[select,], col = hmcol,
Rowv = FALSE, Colv = FALSE, scale='none',
dendrogram='none', trace='none', margin=c(10, 6))


The above shows heatmaps for 30 most highly expressed genes (not necessarily the biggest fold change). The data is of raw counts (left), regularized log transformation (center) and from variance stabilizing transformation (right) and you can clearly see the effect of the transformation has by shrinking the variance so that we don’t get the squish effect shown in the left hand graph.

Now we calculate sample to sample distances so we can make a dendrogram to look at the clustering of samples.

distsRL <- dist(t(assay(rld)))
mat<- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds),
paste(condition,sampleFiles , sep=' : '))

#updated in latest vignette (See comment by Michael Love)
#this line was incorrect
#heatmap.2(mat, trace='none', col = rev(hmcol), margin=c(16, 16))
#From the Apr 2015 vignette
hc <- hclust(distsRL)
heatmap.2(mat, Rowv=as.dendrogram(hc),
symm=TRUE, trace='none',
col = rev(hmcol), margin=c(13, 13))


Although this result looks terrible, as we would expect samples to cluster by treatment, in this case I’m actually happy by this result. Why? Well this was actually a control experiment to show that slightly different (and unavoidable) experimental setup for the different samples, wasn’t responsible for the observed expression differences, so seeing that there is little treatment effect makes me happy. Remember, always try and do what Fisher tells us to, replicate, randomised, block.

Similarly the pca.

print(plotPCA(rld, intgroup=c('condition')))

I hope your’s looks better!

Previously we talked about the cooks distance treatment of outliers, in that a gene is thrown away if one of its samples is deemed to be an outlier. You may not want this to happen so DESeq2 we can take a different approach by replacing the outlier value with one estimated value as predicted by the distribution using the trimmed mean approach. DESeq2 recomends you only do this if you have several replicates per treatment, and indeed it automatically uses this feature if you have 7 or more replicates in your datatable.

ddsClean <- replaceOutliersWithTrimmedMean(dds)
ddsClean <- DESeq(ddsClean)
tab <- table(initial = results(dds)$padj < .1,
cleaned = results(ddsClean)$padj < .1)
resClean <- results(ddsClean)

In my case it didn’t really make much difference.

Dispersion plot shows how the estimates are shrunk from the gene wise values (black dots) toward the fitted estimates, with the final values used in testing being the blue dots.

Dispersion plots deseq and deseq2 (right)

Dispersion plots deseq and deseq2 (right)

and compared DESeq1 and 2.

dispersion plots Deseq1 Left and right deseq2

dispersion plots Deseq1 Left and right deseq2

Now independent filtering to remove any tests that have little chance of pass to reduce the number of tests we have to perform, thus reducing the effects of multiple testing error. (false discovery). You can see how many genes are rejected based on different values of alpha (FDR)

#filtering threashold
#     10%
plot(attr(res,'filterNumRej'),type='b', ylab='number of rejections')

deseq2_filtering threashold_Ed

W <- res$stat
maxCooks <- apply(assays(dds)[['cooks']],1,max)
idx <- !
plot(rank(W[idx]), maxCooks[idx], xlab='rank of Wald statistic',
ylab='maximum Cook's distance per gene',
ylim=c(0,5), cex=.4, col=rgb(0,0,0,.3))
m <- ncol(dds)
p <- 3
abline(h=qf(.99, p, m - p))
cooks distance

cooks distance

Plot of the maximum Cook’s distance per gene over the rank of the Wald statistics for the condition.

Here more about independent filtering. What it shows in genes with very low counts are unlikely to have a significant p-value due to excessive dispersion at the left side of the dynamic range of counts. The y-axis here is -log10, so bigger numbers are smaller p-values (better).

plot(res$baseMean+1, -log10(res$pvalue),
log='x', xlab='mean of normalized counts',
cex=.4, col=rgb(0,0,0,.3))


All those dots on the left hand side the graph represent failed tests due to very low count values, thus we can really just get rid of them to reduce our chance of making a type I error.

And again, you can see that only a few small (or no) p-values are discarded by the filtering. NOTE: You might only see blue lines [I’ve broken something??]

use <- res$baseMean > attr(res,'filterThreshold')
h1 <- hist(res$pvalue[!use], breaks=0:50/50, plot=FALSE)
h2 <- hist(res$pvalue[use], breaks=0:50/50, plot=FALSE)
colori <- c('do not pass'='khaki', 'pass'='powderblue')
barplot(height = rbind(h1$counts, h2$counts), beside = FALSE,
col = colori, space = 0, main = '', ylab='frequency')
text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)),
adj = c(0.5,1.7), xpd=NA)
legend('topright', fill=rev(colori), legend=rev(names(colori)))

deseq2 pvals and mulittest
The graph on the left ranks the p-values from smallest to biggest (x-axis) and plots them. The black line is the actual p-value numbers (remember only about 23 genes had a p-value lower than 0.05). The red line has a slope that represents the number of tests divided by the false discovery rate (0.1). The idea here is the FDR is controlled at the 0.1% value for all tests that occur to the left of the right-most intersection of the black and red line.

The code for the right hand plot above.

resFilt <- res[use & !$pvalue),]
orderInPlot <- order(resFilt$pvalue)
showInPlot <- (resFilt$pvalue[orderInPlot] <= 0.08)
alpha <- 0.1

plot(seq(along=which(showInPlot)), resFilt$pvalue[orderInPlot][showInPlot],
pch='.', xlab = expression(rank(p[i])), ylab=expression(p[i]))
abline(a=0, b=alpha/length(resFilt$pvalue), col='red3', lwd=2)
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)


attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] gplots_2.12.1 RColorBrewer_1.0-5 BiocInstaller_1.12.0
[4] DESeq2_1.2.8 RcppArmadillo_0.3.920.3 Rcpp_0.10.6
[7] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.6
[10] BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] affy_1.40.0 affyio_1.30.0 annotate_1.40.0
[4] AnnotationDbi_1.24.0 Biobase_2.22.0 bitops_1.0-6
[7] caTools_1.16 DBI_0.2-7 DESeq_1.14.0
[10] gdata_2.13.2 genefilter_1.44.0 geneplotter_1.40.0
[13] grid_3.0.2 gtools_3.1.1 KernSmooth_2.23-10
[16] lattice_0.20-24 limma_3.18.5 locfit_1.5-9.1
[19] preprocessCore_1.24.0 RSQLite_0.11.4 splines_3.0.2
[22] stats4_3.0.2 survival_2.37-4 tools_3.0.2
[25] vsn_3.30.0 XML_3.98-1.1 xtable_1.7-1
[28] zlibb

[Updated July ’14: to fix errors with distance matrix plot, cooks distance, and the Benjamini-Hochberg multiple testing adjustment procedure (props to Stefan for pointing them out]

Basic data plotting with ipython and pandas

I’ve posted before on how much I love plotting in python using pandas, matplotlib, and ipython. But I thought I would do a blog post that would contain some important ‘bookmarks’ on how to do some essential plotting tasks. Just as a simple example of how pandas could be used in a biological sense, how about we make a dataframe from a blast table outputted from NCBI BLAST. The example data is just a blast search using transcripts as a qry with output option 8, which creates a tabular output that is nice for parsing using scripts. Note, if you want to use my table go for it, its here as an tab-delimited Excel spreadsheet, for this example I just copied it into a text document called ‘blast_output.txt’ – I can’t upload txt files at wordpress)-:)

1) Basic table manipulations and very simple excel-like maths functions.

Normally I probably would pre-process the data using a simple python script, but here I’ll do it all in pandas.

from pandas import Series,DataFrame
import matplotlib.pyplot as plt
import pandas as pd
blast_data = pd.read_table('blast_output.txt',header = 0)
blast_data = DataFrame(blast_data)

#          QUERY  NUMHITS                                            HITDESC  \
#0  Sequence_7116        1  ref|XP_004521769.1| PREDICTED: M-phase phospho...   
#1  Sequence_7117        1  ref|XP_002059188.1| GJ16167 [Drosophila virili...   
#2  Sequence_7118        1  ref|XP_002008154.1| GI11977 [Drosophila mojave...   
#3  Sequence_7119        1  ref|XP_001955873.1| GF24906 [Drosophila ananas...   
#4  Sequence_7120        1  ref|XP_004520750.1| PREDICTED: proteoglycan 4-...   
#0       1   2.000000e-44       158      0.702290        0.793893  0.511719   
#1       1  1.000000e-175       502      0.766667        0.886667  0.853081   
#2       3   1.000000e-46       132      0.680412        0.762887  0.126928   
#3       1   2.000000e-62       211      0.452830        0.577358  0.716599   
#4       2   0.000000e+00       734      0.514886        0.643608  0.577610   

#     HITCOV  
#0  0.796178  
#1  0.974026  
#2  0.572414  
#3  0.933824  
#4  0.650243

So now we have the blast table stored as a pandas data frame, it might be a good time to show how to splice and dice this frame so that it shows information that we are interested in. Firstly, column and index headings are easy to access using (in this example) “blast_data.columns” and “blast_data.index”. To grab a column is simple:

0     Sequence_7116
1     Sequence_7117
2     Sequence_7118
3     Sequence_7119
4     Sequence_7120
5     Sequence_7121

To slice section of the table we would use “blast_data.ix”. As this is a two demensional table sometimes its a little tricky to workout how “ix” works. But its quite simple really, its just the two dimensions separated by a comma, with the row range first and columns second, remembering that as this is python counting starts at 0! So for some easy examples, lets see how to get the first line and all rows (1), first 10 rows and 3 columns (2), first 10 rows and last 2 columns (3). As you can see its just normal python list splicing but in 2 dimensions.

#first line and all rows (1)
blast_data.ix[0:0,:] #index 0 to 0 ie first row and ":" columns ie all of them
 QUERY  NUMHITS                                            HITDESC  \
0  Sequence_7116        1  ref|XP_004521769.1| PREDICTED: M-phase phospho...   
0       1  2.000000e-44       158       0.70229        0.793893  0.511719   
0  0.796178  

#first 10 rows and 3 columns (2) 
 blast_data.ix[0:10,0:3] #standard python slice
            QUERY  NUMHITS                                            HITDESC
0   Sequence_7116        1  ref|XP_004521769.1| PREDICTED: M-phase phospho...
1   Sequence_7117        1  ref|XP_002059188.1| GJ16167 [Drosophila virili...
2   Sequence_7118        1  ref|XP_002008154.1| GI11977 [Drosophila mojave...
3   Sequence_7119        1  ref|XP_001955873.1| GF24906 [Drosophila ananas...
4   Sequence_7120        1  ref|XP_004520750.1| PREDICTED: proteoglycan 4-...
5   Sequence_7121        1  ref|XP_004537614.1| PREDICTED: protein suppres...
6   Sequence_7122        1  ref|XP_002073069.1| GK13353 [Drosophila willis...
7   Sequence_7123        1  ref|XP_001981946.1| GG11315 [Drosophila erecta...
8   Sequence_7124        1  ref|XP_002052580.1| GJ17621 [Drosophila virili...
9   Sequence_7125        1  ref|XP_002007909.1| GI13202 [Drosophila mojave...
10  Sequence_7126        1  ref|XP_004520332.1| PREDICTED: nucleolar prote...

#first 10 lines and last 2 rows (3)
blast_data.ix[0:10,-2:] #standard python again, '-2' is 2 from end
0   0.511719  0.796178
1   0.853081  0.974026
2   0.126928  0.572414
3   0.716599  0.933824
4   0.577610  0.650243
5   0.312649  0.376129
6   0.615196  0.996016
7   0.934222  1.000000
8   0.840226  1.000000
9   0.157009  0.342857
10  0.837484  0.960961

The great thing about the pandas dataframe is you can do excel like maths work using the table, for example firstly lets just look at E-values < 1E-100.

good_hits = blast_data[blast_data['EVAL']<1e-100]
#output 576

           QUERY  NUMHITS                                            HITDESC  \
1  Sequence_7117        1  ref|XP_002059188.1| GJ16167 [Drosophila virili...   
4  Sequence_7120        1  ref|XP_004520750.1| PREDICTED: proteoglycan 4-...   
5  Sequence_7121        1  ref|XP_004537614.1| PREDICTED: protein suppres...   
6  Sequence_7122        1  ref|XP_002073069.1| GK13353 [Drosophila willis...   
7  Sequence_7123        1  ref|XP_001981946.1| GG11315 [Drosophila erecta...   

1       1  1.000000e-175       502      0.766667        0.886667  0.853081   
4       2   0.000000e+00       734      0.514886        0.643608  0.577610   
5       2  1.000000e-108       390      0.438655        0.605042  0.312649   
6       1  1.000000e-112       342      0.648221        0.833992  0.615196   
7       1   0.000000e+00      1845      0.513501        0.672667  0.934222   

1  0.974026  
4  0.650243  
5  0.376129  
6  0.996016  
7  1.000000

We can even collect the best hits by sorting by the EVAL column.

            QUERY  NUMHITS  \
501  Sequence_7617        1   
400  Sequence_7516        1   
754  Sequence_7870        1   
398  Sequence_7514        1   
755  Sequence_7871        1   

                                               HITDESC  NUMHSP  EVAL  \
501  ref|XP_004525475.1| PREDICTED: kelch domain-co...       1     0   
400  ref|XP_001362065.2| GA19724 [Drosophila pseudo...       1     0   
754  ref|XP_004524178.1| PREDICTED: mitogen-activat...       1     0   
398  ref|XP_004523028.1| PREDICTED: sphingosine kin...       1     0   
755  ref|XP_004536256.1| PREDICTED: V-type proton A...       1     0   

501       691            74        0.863248  0.537193  0.902724  
400       842            61        0.744382  0.904181  0.973574  
754      2147            73        0.837891  0.934356  0.965003  
398       636            53        0.680445  0.797900  0.993399  
755       897            84        0.897143  0.379302  0.949367

Some *very* simple math might be to convert the percent identity into a percentage value

good_hits['PERCENTIDENT'] = good_hits['PERCENTIDENT']*100
good_hits['PERCENTIDENT'] = good_hits['PERCENTIDENT'].astype(int)

Of course you could do a lot more complex excel type maths than that, but hopefully you get the idea!

2) Histograms – everyone loves a good histogram.

For this histogram how about we collect information from a blast search and parse out the species, so that we can look at histogram of ‘species richness’ in the best BLAST hits. In this format column 3 contains a hit descriptor that also lists the species name in square brackets (ie [Nasonia vitripennis]). So I want to parse out that information, collect all the totals, and plot the most frequent 10 (for example).  We really only need the species part, so we discard the genus information to avoid problems with all those contaminating Drosophila genomes in the database (-;

Now we are really interested in the column “HITDESC”. What we want is just the frequency distribution of the species in the square brackets, all the other info is not required for a histogram.

desc = blast_data['HITDESC']
0     ref|XP_004521769.1| PREDICTED: M-phase phospho...
1     ref|XP_002059188.1| GJ16167 [Drosophila virili...
2     ref|XP_002008154.1| GI11977 [Drosophila mojave...
3     ref|XP_001955873.1| GF24906 [Drosophila ananas.

Although names are descriptions are incomplete this is just a display issue, the data is still there. First we will write a quick filtering function to grab the species name, then use list comprehension to make a collection of names, finally, we will use another function to count each occurrence of the names in the list.

def species(desc):
    #a function to pull out the species name from [species genus]
    line = line.split(' [')[1] #get species genus
    line = line.split(' ')[0] #trim off genus part
    return line

def counter(seq):
    """make a freq dict with species as key"""
    seq_dict = {}
    for n in seq:
        if n in seq_dict:
            seq_dict[n] += 1
            seq_dict[n] = 1
    return seq_dict

desc_list = [species(des) for des in desc] #make a list of names

desc_dict = counter(desc_list)
#lambda magic to get a sorted list from an unsorted dictionary
most_freq8 = sorted(spec_dict,key=lambda x:spec_dict[x],reverse=True)[:8]
#if you wanted to check this works
#chk = spec_dict.values()
#chk[-8:]#show last ten values, crossref these against diction usin#g key as most freq species from most_freq8

Nearly there, you’ll probably notice our most_freq8 list is no good for a histogram because we have lost our frequency information. We can get it back using some functional programing magic (BTW for an excellent run through of functional programing in python check out the great how-to-here here).

most_freq_dict = dict(((spec,spec_dict[spec]) for spec in most_freq8)) 
{'Acyrthosiphon': 7,
 'Aedes': 8,
 'Anopheles': 6,
 'Ceratitis': 501,
 'Drosophila': 388,
 'Glossina': 7,
 'Musca': 7,
 'Sarcophaga': 8}

Briefly, a dictionary can be created from tuple of key value pairs, so all we do here is use list comprehension to combine the “spec_dict” frequency counts with the keys from the most frequent list to create a new dictionary. As it turns out, dictionaries can be converted directly to pandas dataframes, however, this will create a frame with all the species as column headings. We could flip the axes using DataFrame.transpose fucntion, but that is messy. Instead we will use list comprehension again to reformat the data (Yes I know I probably could have done this a few steps ago to avoid all this data mushing).

hist_data = DataFrame([[key,most_freq_dict[key]] for key in most_freq_dict])
               0    1
0          Musca    7
1          Aedes    8
2     Drosophila  388
3      Anopheles    6
4      Ceratitis  501
5  Acyrthosiphon    7
6     Sarcophaga    8
7       Glossina    7
#need the index to be the species to get them to display

Great, sort of! Really, we want the species names to be the index so they are displayed in the histogram, we could just re-index from here, but lets re-make the table directly specifying that we want the index to be the key from the dictionary

hist_data = DataFrame([most_freq_dict[key] for key in most_freq_dict],columns = ['Frequency'],index=most_freq_dict.keys())
hist_data #we use the same dictionary(most_freq_dict) for columns and index, so this will work!
Musca            7
Aedes            8
Drosophila     388
Anopheles        6
Ceratitis      501
Acyrthosiphon    7
Sarcophaga       8
Glossina         7

We can make a histogram directly from out dataframe, sweet.

fig = plt.figure()
axes = fig.add_subplot(1,1,1)
axes.set_title("Bla bla")


Finally, we really want the data_frame sorted by size to make the histogram look a little nicer.

hist_data = hist_data.sort('Frequency',ascending=False) #not in place
axes = fig.add_subplot(1,1,1)
axes.set_title("Bla bla")


Bash for Biologists: A gateway to the terminal!

What is bash

Bash programming is a form of scripting that originates from the early days when a computer was a keyboard with a flashing cursor on a monochrome screen!

Credit: Wtshymanski @ wikipedia

Bash acts as a command line interpreter, taking input from a user (via the keyboard) and using it to make the computer to act in a certain way. Many of these commands are small scripts that do a multitude of tasks, from text processing to network communications. Although most of us now use a graphic user interface (GUI) to communicate with our computers, a basic understanding of bash programing can dramatically improve our productivity in certain situations. This is especially with dealing with the large files generated by most bioinformatic applications.

Where do I get this bash program

Well technically speaking Bash is not really a program, its more a command line interpreter and a scripting language. That said it is available by default on any NIX (Unix-like system ie ubuntu or fedora) Macintosh (OSX) and windows using a free program called cygwin (there are others as well.

Enough already!

First thing, don’t copy and paste! Why? Typing these commands is part of the hard wiring required to make them stick! Also, “with great power comes great responsibility”, and bash has the power to wipe your computer – no trash recovery – completely gone forever (except for that backup – right?)! By typing commands it gives you another opportunity to think about the command before you execute it and typing is more likely to trigger a big red flag in dangerous situations! Think twice, type once.

Ok, Ok, in the following tutorial I have attempted to provide command line solutions to some common problems faced by life scientists who have to deal with files generated by next generation sequencing technologies. I’ll probably do a few of these as with the explanations they get quite long. Of course, this is not an attempt to teach you to be a terminal master, that can only come from practice, but hopefully it will give you a taste of the power hiding behind that drab terminal exterior. It is also not designed to teach you how to program a computer, but I’m hoping this might be the gateway drug you need to get started.

The solutions in this tutorial were tested on a linux machine (Ubuntu), but should mostly work on Mac and windows through cygwin.

Problem 1. I have this huge file of DNA sequences in FASTA format (>seq1\nATGTAA) and I want to know how many sequences are in there?

Solution. This is a common problem, if your text editor manages to open the file without freezing you are still left with the problem of counting sequences. To our rescue is grep, grep is a little application that searches for strings. In our case as this is a fasta file we can search and count instances of the “>” to let us know how many sequences are in the file. And yes, this is a small file and you probably could have opened it on a 10 year old computer, but use your imagination on this one. Very important put the > in single quotes(‘>’)!!

grep ‘>’ fasta_example.fas


>fid|47273465|locus|VBIADEClo150191_2639|   >fid|47286894|locus|VBIADEClo157809_4509|   >fid|47397394|locus|VBIAERSal180824_4606|   >fid|20665099|locus|VBIAciCap40988_0842|   >fid|20666571|locus|VBIAciCap40988_1572|   >fid|20668069|locus|VBIAciCap40988_2310|   >fid|46816701|locus|VBIAciSp155132_0285|   >fid|46817921|locus|VBIAciSp155132_0889|   >fid|38082798|locus|VBIAciSp155501_0265|   >fid|38086805|locus|VBIAciSp155501_2252|   >fid|38088330|locus|VBIAciSp155501_3002|  >fid|38088628|locus|VBIAciSp155501_3149|  >fid|38090581|locus|VBIAciSp155501_4107|  >fid|85865228|locus|VBIActMas228228_1098|  >fid|101184055|locus|VBIActMas238937_2668|

So grep found all the lines with the ‘>’, but thats not exactly what we wanted. How do we count the lines. Now is a good time to introduce the manual entry (using the man command) available for all bash commands. This manual entry for grep is rather long, however it contains the information we need (highlighted in red below).

man grep


GREP(1)                                                                GREP(1)NAME

      grep, egrep, fgrep, rgrep – print lines matching a pattern


      grep [OPTIONS] PATTERN [FILE…]

      grep [OPTIONS] [-e PATTERN | -f FILE] [FILE…]

…………….shortened for clarity…………………

  General Output Control

      -c, –count

             Suppress normal output; instead print a count of matching  lines

             for  each  input  file.  With the -v, –invert-match option (see

             below), count non-matching lines.  (-c is specified by POSIX.)

—————-end of output—————-

It looks like ‘-c’ flag (or switch) will cause grep to count the lines that match the search string rather than just print them out – perfect! Before we try it out its worth taking a second to look at another important part of the manual that specifies how to use grep. BTW you can use either -c or –count, the double minus sign just specifies the longer format of the flag that is often a little more human readable.


The square brackets are not used as part of the command, they merely specify that whatever is in the brackets is OPTIONAL. So the minimum command is “grep pattern”, with the possibility of adding a OPTION flag/switch and a FILE to search.This time let’s add our optional flag and repeat the search, note that the command order fits the structure described in the manual (grep ->option-> pattern ->file) .

grep -c ‘>’ fasta_example.fasta



Thats more like it. But reading through the manual was a little painful, we know we want to count something, so why don’t we use grep to search through the manual entry for grep, to find a line that uses the string ‘count’. Circular enough for you?To do that we need to introduce what are called pipes (“|”), they are the bar symbol on the keyboard (SHIFT backslash on my computer). First the command, then the explanation.

man grep ‘count’ | grep ‘count’


 -c, –count

             Suppress normal output; instead print a count of matching  lines

             below), count non-matching lines.  (-c is specified by POSIX.)

      -m NUM, –max-count=NUM

             –count  option  is  also  used,  grep  does  not output a count

      Large repetition counts in the {n,m} construct may cause  grep  to  use

Sweet as Bro! Pipes are a redirection, that is rather than send the output of a program to the screen (called “stdout”) or a file etc, send it to another program. In this case we redirect the output from the man program to the grep program (that the manual entry is about grep is irrelevant to your stupid computer) so that grep can look for the string “count”.

A quick aside: To ‘single quote’, “double quote”, or Not to quote, that is a good question! This will be covered a little later, but for now it is important to use single quotes whenever you want to specify a string of letter(s). We have already used one situation where this can cause problems, for example: grep > filename acts very differently to grep ‘>’ filename. As you will learn now.

Since we are talking about pipes now is a good time to talk about another type of redirection. The > symbol means redirect the output from a command to a file, which it will overwrite if it already exists (>> will append the output onto the end of the file). So

grep -c ‘>’ fasta_example1.fasta > tmp.txt

Will output nothing to the screen, rather it will put the number 15 in a file called tmp.txt. You’ll see in a moment how useful this can be when dealing with sequence files. So what do you think the grep > fasta_example1.fasta (without the single quotes around the > symbol) might do to your wonderful sequence file (Hint, it won’t end well, so if you want to try it use some throw away file)? Well time to get back to some biology!

Problem 2. I have this huge file of DNA sequences in FASTA format (>seq1\nATGTAA) and I want to split it in half (as best I can).

Solution 2. Lets combine what we have learnt with a couple of new commands to solve this problem. Firstly, lets just count the sequences to find out how many each file will have to include for the files to be split in two.

grep -c ‘>’ fasta_example1.fasta



Well thats not an even number, so lets just put 7 in the first file and 8 in the second file. But first, two new commands, head and tail.

head fasta_example.fasta

—————-shortened output—————-


tail fasta_example.fasta

—————-shortened output—————-


Head by default outputs the first 10 lines of a file, whilst tail outputs the last 10 lines. You can change this default behaviour using the -n flag. Lets start by finding out what the 7th sequence is in the file.

grep ‘>’ fasta_example1.fasta | head -n 8


>fid|47273465|locus|VBIADEClo150191_2639|  >fid|47286894|locus|VBIADEClo157809_4509|   >fid|47397394|locus|VBIAERSal180824_4606|   >fid|20665099|locus|VBIAciCap40988_0842|   >fid|20666571|locus|VBIAciCap40988_1572|   >fid|20668069|locus|VBIAciCap40988_2310|  >fid|46816701|locus|VBIAciSp155132_0285|


Normally the first part of this command just prints each line that contains the ‘>’ FASTA symbol, but here we piped the output from grep to head, using the -n flag to show only the first 8 lines. So what? Well remember that you file might have 1000s of sequences, finding out which one is the 501’st might be a little more tricky if you lived in a world without grep, head and tail. So now we know that “>fid|46817921|locus|VBIAciSp155132_0889|” is the first sequence that needs to go in our second file. Now once again after looking at the manual page for grep we see that the -n flag reports the line number of the pattern match.

grep -n ‘>fid|46817921|locus|VBIAciSp155132_0889|’ fasta_example1.fasta



There are multiple ways of doing this. We could use another command wc -l filename which counts the number of lines in a file, then subtract 102 from this number and feed that to tail, basically to print out the last x lines. However, rather than introducing a new command (too late) lets look at the man page for tail and see if we can do what we want that way.

man tail

—————-output shortened—————-

-n, –lines=K

             output the last K lines, instead of the last 10; or use -n +K to

             output lines starting with the Kth

Great, so all we need to do is pass the line number we want tail to start from with the switch -n and the “+” symbol. Putting this all together.

head -n 101 fasta_example1.fasta > first_half.fasta
tail -n +102 fasta_example1.fasta > second_hald.fasta
grep '>' first_half.fasta

Lets quickly check to see if it worked.

>fid|47273465|locus|VBIADEClo150191_2639|  >fid|47286894|locus|VBIADEClo157809_4509|   >fid|47397394|locus|VBIAERSal180824_4606|   >fid|20665099|locus|VBIAciCap40988_0842|   >fid|20666571|locus|VBIAciCap40988_1572|   >fid|20668069|locus|VBIAciCap40988_2310|   >fid|46816701|locus|VBIAciSp155132_0285|

grep ‘>’ second_half.fasta

>fid|46817921|locus|VBIAciSp155132_0889|   >fid|38082798|locus|VBIAciSp155501_0265|   >fid|38086805|locus|VBIAciSp155501_2252|   >fid|38088330|locus|VBIAciSp155501_3002|  >fid|38088628|locus|VBIAciSp155501_3149|  >fid|38090581|locus|VBIAciSp155501_4107|  >fid|85865228|locus|VBIActMas228228_1098|  >fid|101184055|locus|VBIActMas238937_2668|

Perfect! 7 sequences in the first file, 8 in the second. As I said previously this will work just as well with 100000000 sequences as it did in this small example!

How to use DESeq to analyse RNAseq data


Several packages have been written to measure gene expression differences using RNAseq data. In this post I’m going to do a run through of the popular R based DESeq package. Basically, I’m just following along with the packages Vignette which is really well written and informative! For the test data I’m using 2 treatments, with 3 biological reps each containing 750K reads (yes its small).

A. Counting reads

DEseq uses a table of raw counts as the basis of DE calls. Any transformed counts, such as FPKM counts will not work with this package. The top row of the table can be a descriptor, these can be imported into R to act as column labels. Gene names should be column 1 (remember R counting starts from 1 rather than 0), each column after that should be a raw count for each biological replicate and for each treatment. Note that if technical reps are performed, these should be merged based on the sum of the two runs (they will be normalised by read count latter).

Here I used BWA to align the reads to my reference and generate SAM files that are then processed by HTseq count. Note that you have to run this on each sample, so in this current example I will generate 6 sam files, 1 for each treatment and condition. Also, see that I used the intersection-strict mode for HTseq.

>bwa aln sarcop_pep_rnaseqfiltered.fa Sample_ST72-1/ST72-1.subset.fastq > ST72-1.subset.sai

>bwa samse sarcop_pep_rnaseqfiltered.fa ST72-1.subset.sai Sample_ST72-1/ST72-1.subset.fastq > ST72-1.subset.sam

>htseq-count -m intersection-strict –stranded=no ST72-1.subset.sam sarcop_pep_rnaseqfiltered.gtf > ST72-1.subset.counts

In the above ST72-1.subset.counts is a table of counts that can be merged with the other count data before being imported into DEseq as a R data table. Here is a python script for merging, see the header of the script for usage, if that helps. For full detailed explanation of HTseq see ( and this seems to be the default way of doing it (instructions :

B. R DESeq package 

Enter R environment from a terminal window open where your data is.



>countsTable <- read.delim(“merged_counts.txt”,header=TRUE)

>rownames(countsTable) <- countsTable$gene

>countsTable <- countsTable[,-1]

 This imports the DESeq library, reads the tab delineated file created by the python script into a data table. The next two lines create row names from the gene column created by the script, if you use some other method to make the counts you need to change the name of the gene column to whatever it is. Finally, we get rid of the column used to make the row names. My table looks like this

> head( countsTable)

                  UN72_1 UN72_2 UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2          0      0      0      0      0      0

g10671t00001            2      1      1      0      0      0

g16789t00001            0      0      0      0      0      0

Now we need to add the treatment information so that DESeq knows what data each column represents, order is obviously important. 

>conds <- factor( c( “untreated”, “untreated”, “untreated”, “treated”,”treated” ,”treated”  ) )

> cds <- newCountDataSet( countsTable, conds )

This has created a instantiate a CountDataSet, which is the central data structure in the DESeq. The next step will adjust for different library sizes among the treatments, something called “size factors”. DESeq only needs relative library sizes, so if (on average) the counts for non-DE genes in one library are twice that of a second library the size factor should be twice as large.

 >cds <- estimateSizeFactors( cds )

 To see these size factors, do this:

 >sizeFactors( cds )

UN72_1    UN72_2    UN72_3    ST72_1    ST72_2    ST72_3

0.9805609 1.0420015 0.9164864 1.1083583 0.9599370 1.1462168

 Apparently, the way this works ( is that you take the genometric mean of each condition for a gene and use this as its reference value. Then you divide the gene expression value for each conditions by its reference value to create a list of values (quotient ie result of the division) for each condition. The size factor is the median value of this list, as there is a list per condition it will generate a size factor value for each condition. The counts are normalised by dividing each column by its size factor, here lets look before and after:

 > head(counts(cds))

                  UN72_1 UN72_2 UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2          0      0      0      0      0      0

g10671t00001            2      1      1      0      0      0

> head(counts(cds,normalized=TRUE))

                    UN72_1    UN72_2   UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2     0.000000 0.0000000 0.000000      0      0      0

g10671t00001       2.039649 0.9596916 1.091124      0      0      0

g16789t00001       0.000000 0.0000000 0.000000      0      0      0

The next part is a key to how DESeq calculates significance in DE. The package uses the relationship between the data’s variance (or dispersion) and its mean. The inference in DESeq relies on an estimation of the typical relationship between the data’s variance and their mean, or, equivalently, between the data’s dispersion and their mean. The dispersion can be understood as the square of the coefficient of biological variation. The variance between counts is the sum of two factors, firstly, the level of variation between replicates of each treatment, and secondly an uncertainty measure based on the concentration of the counts. This latter factor takes into consideration the level of noise predicted for low expressed genes based on a poisson distribution (see the DEseq manual for more detail).

>cds <- estimateDispersions( cds ) 

Now we write a little function that will plot this.

>plotDispEsts <- function( cds ){

plot(rowMeans( counts( cds, normalized=TRUE ) ), fitInfo(cds)

$perGeneDispEsts, pch = ‘.’, log=”xy”, ylab=”dispersion”,

xlab=”mean of normalized counts”)

xg = 10^seq( -.5, 5, length.out=300 )

lines( xg, fitInfo(cds)$dispFun( xg ), col=”red” )



> jpeg(“DispEsts_DESeq.jpg”)

> plotDispEsts(cds)


The Above commands make a plotting function, the pass cds to that function and save the result in a jpg file.

The empirical (real) dispersion values (black dots) and fitted values (red line)

The empirical (real) dispersion values (black dots) and fitted values (red line)

Remember that the level of dispersion is related to the biological variation seen in each treatment. Note this is a double log graph, ahhh my brain hurts, but look carefully at what is happening on the Y axis, as read count increases dispersion decreases, as we would expect. However, a lot of those black dot values below the red fitted line are probably underestimates of dispersion based on the small samples sizes used (only 3 replicates values measured in this example). So, to be conservative DESeq moves all these values below the red line up to the fitted value, BUT keeps all the empirical values above the red line, even if some of these are over estimates of dispersion, as I said it is being conservative. Why does dispersion matter? The more dispersion, or biological variation, the bigger the difference between counts between treatments is required before it differences become significant!

Finally we are ready to call DE values!

>res = nbinomTest( cds, “untreated”, “treated” )


                 id  baseMean baseMeanA baseMeanB foldChange log2FoldChange

1     GWD8ZKM01ESPR2 0.0000000 0.0000000         0        NaN            NaN

2       g10671t00001 0.6817440 1.3634880         0          0           -Inf

3       g16789t00001 0.0000000 0.0000000         0        NaN            NaN

      pval padj

1        NA   NA

2 0.2083517    1

3        NA   NA

Now lets plot the log2 fold changes (untreated versus treated) against the mean normalised counts, colouring in red those genes that are significant at 10% FDR.


> plotMA(res)



Note how lower lower expressed genes need a bigger log2 fold change value to be made red (not so obvious in this data due to low number of reads used in mapping).

Lets explore this a little more by comparing the normalized counts between all three replicates of just the untreated sample.

>ncu = counts( cds, normalized=TRUE )[ , conditions(cds)==”untreated” ]


>plotMA(data.frame(baseMean = rowMeans(ncu),log2FoldChange = log2( ncu[,2] / ncu[,1] )),

col = “black”)



Wow, look at all the DE in a single treatment,  thus we should call a gene DE only if the change between treated and untreated samples is stronger than what we see between replicates, and hence, the dividing line between red and black in the first figure mimics that for the second.

And now a histogram of p values to look at their distribution.

hist(res$pval, breaks=100, col=”skyblue”, border=”slateblue”, main=””)

> jpeg(“hist_pval.jpg”)

> hist(res$pval, breaks=100, col=”skyblue”, border=”slateblue”, main=””)



The pileup of very low p-values are the DE genes, the pileups at the right hand side of the distribution are from very low count genes that are given discrete values.

Now lets create some subset tables based around FDR cut offs (in this case 10%)

>resSig = res[ res$padj < 0.1, ]


Before we make any mistakes, lets save the data

>write.csv( res, file=”72hour_unstgVstg_all.csv” )

>write.csv( resSig, file=”72hour_unstgVstg_only_0.01FDR.csv” )

We are done with the basic analysis, we can do some more sophisticated filtering of the data. The main goal is to remove tests that give little chance of producing a meaningful result, with the aim of keeping the FDR in check. This results in an increased power to detect significant results whilst keeping the same level of type I error the same (false positive rate).

>rs = rowSums ( counts ( cds))

>theta = 0.4

>use = (rs > quantile(rs, probs=theta))




30145 36029

Basically here we are summing the counts across all conditions for each gene, then remove the genes with the lowest 40 % quantile (basically filtering out those that are false above from the original normalised cds table)

>cdsFilt = cds[ use, ]

Now just repeat the tests on the filtered data

>resFilt = nbinomTest( cdsFilt, “untreated”, “treated” )

Now lets look at the difference this makes to the FDR.

tabAll <-table(res$padj<0.1)

tabFilt <-table(resFilt$padj<0.1)

> tabFull


35266   763

> tabFilt


35266   763

In this case it didn’t seem to make much difference because most of the dropped values were NA! Just to visualise what is going on here lets plot the un-ajusted pvals of the ~40% quantile that we filtered.


> plot(rank(rs)/length(rs), -log10(res$pval), pch=16, cex=0.45)



Note that very few p value are less than 0.003 (this corresponds to about 2.5 on the − log10 scale), and this is demonstrated when we plot the pvals of each dataset.

> h1 = hist(res$pval,breaks=50,plot=FALSE)

> h2 = hist(resFilt$pval,breaks=50,plot=FALSE)

>colori <- c(‘do not pass’=”khaki”, ‘pass’=”powderblue”)

> barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, col = colori,

+ space = 0, main = “”, ylab=”frequency”)

> text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA)

> legend(“topright”, fill=rev(colori), legend=rev(names(colori)))

> dev.copy(png,’hist_filthist.png’)



Now we are going to process the data further, since some counts and be zero in some applications in makes sense to add psuedocounts (ie y = log2 (n + n0 ) where n0 is a small ‘chosen’ constant. The question is what is the value of the small number you add to the zero counts to make them non-zero (but keep them very small). DEseq uses a alternative approach.

>cdsBlind = estimateDispersions( cds, method=”blind” )

>vsd = varianceStabilizingTransformation( cdsBlind )

Now we will plot Per-gene standard deviation (taken across samples), against the rank of the mean, the standard deviation of the transformed data, across samples, against the mean, first using the shifted logarithm transformation (notAllZero is basically one of the numbers in the row is NOT ZERO), with 1 added to all values to remove zeros), then using DESeq’s variance stabilising transformation. While we are meant to see a constant standard deviation for the DESeq method along the whole dynamic range, with the shifted logarithm giving a highly elevated standard deviation in the lower count range for the other method, in fact they both look pretty similar for this small dataset.




>notAllZero = (rowSums(counts(cds))>0)

>meanSdPlot(log2(counts(cds)[notAllZero, ] + 1), ylim = c(0,2.5))

>meanSdPlot(vsd[notAllZero, ], ylim = c(0,2.5))

Now to deal with moderate counts that generate highly variable log fold change values and in some cases they become infinite. This will mess up any kind of clustering we may want to perform to identify expression common patterns. We can compare their ordinary log-ratios (lfc) against the moderated log-ratios (mod_lfc) using a scatterplot, with the genes bined by their log10 mean to colour them according to expression strength.

>mod_lfc = (rowMeans( exprs(vsd)[, conditions(cds)==”treated”, drop=FALSE] ) – rowMeans( exprs(vsd)[, conditions(cds)==”untreated”, drop=FALSE] ))

> lfc = res$log2FoldChange

> table(lfc[!is.finite(lfc)], useNA=”always”)

-Inf   Inf   NaN  <NA>

5085  4900 30145     0

Many of the latter are infinite (Inf,division of a finite value by 0) or not a number (NaN, resulting from division of 0 by 0).

> logdecade = 1 + round( log10( 1+rowMeans(counts(cdsBlind, normalized=TRUE)) ) )

> lfccol = colorRampPalette( c( “gray”, “blue” ) )(6)[logdecade]

> ymax = 4.5

> plot( pmax(-ymax, pmin(ymax, lfc)), mod_lfc,

xlab = “ordinary log-ratio”, ylab = “moderated log-ratio”,

cex=0.45, asp=1, col = lfccol,

pch = ifelse(lfc<(-ymax), 60, ifelse(lfc>ymax, 62, 16)))

> abline( a=0, b=1, col=”red3″)

Here grey to blue represents weakly to strongly expressed genes, for highly expressed genes there is general agreement (along diag) but for lowly expressed genes the moderated log-ratio is shrunk towards 0, compared to the ordinary log-ratio (the grey smear flattens out along the horizontal (0 value on the y-axis).


Now to look at heat maps of counts to see any patterns using the transformed  log data.



> cdsBlind = estimateDispersions(cds,method=”blind”)

> vsd = varianceStabilizingTransformation(cdsBlind)

>select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:30]

>hmcol = colorRampPalette(brewer.pal(9, “GnBu”))(100)

>heatmap.2(exprs(vsd)[select,], col = hmcol, trace=”none”, margin=c(10, 6))


and for a laugh the untransformed data.


Now for a

>dists = dist( t( exprs(vsd) ) )

>mat = as.matrix( dists )

> rownames(mat) = colnames(mat) = with(pData(cdsBlind), paste(condition))

> heatmap.2(mat, trace=”none”, col = rev(hmcol), margin=c(13, 13))


No shocks here again. Of note is that we used the method “blind”, thus the variance stabilizing transformation is not informed about the design to prevent bias in the outcome. In comparison, the pooled method (the default for estimating the dispersion) “means that one dispersion is computed for each gene, which is now an average over all cells (weighted by the number of samples for each cells), where the term cell denotes any of the four combinations of factor levels of the design.”

Now a good olde PCA, pretty cool really.

>print(plotPCA(vsd, intgroup=c(“condition”)))


This type of plot is useful for visualizing the overall effect of experimental covariates and batch effect, so you might expect known conditions to group, ie treatment and any time of library prep effects (paired end reads versus single end data). Good to see that even with this subset of data biggest component is treatment (pc1).


R version 2.15.3 (2013-03-01)

Platform: x86_64-pc-linux-gnu (64-bit)



[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8


[7] LC_PAPER=C                 LC_NAME=C

[9] LC_ADDRESS=C               LC_TELEPHONE=C


attached base packages:

[1] grid      stats     graphics  grDevices utils     datasets  methods

[8] base


other attached packages:

[1] arrayQualityMetrics_3.14.0 gplots_2.10.1

[3] KernSmooth_2.23-10         caTools_1.14

[5] gdata_2.12.0.2             gtools_2.7.1

[7] RColorBrewer_1.0-5         vsn_3.26.0

[9] BiocInstaller_1.8.3        DESeq_1.10.1

[11] lattice_0.20-15            locfit_1.5-9

[13] Biobase_2.18.0             BiocGenerics_0.4.0


loaded via a namespace (and not attached):

[1] affy_1.36.1           affyio_1.26.0         affyPLM_1.34.0

[4] annotate_1.36.0       AnnotationDbi_1.20.7  beadarray_2.8.1

[7] BeadDataPackR_1.10.0  Biostrings_2.26.3     bitops_1.0-5

[10] Cairo_1.5-2           cluster_1.14.4        colorspace_1.2-1

[13] DBI_0.2-5             gcrma_2.30.0          genefilter_1.40.0

[16] geneplotter_1.36.0    Hmisc_3.10-1          hwriter_1.3

[19] IRanges_1.16.6        latticeExtra_0.6-24   limma_3.14.4

[22] parallel_2.15.3       plyr_1.8              preprocessCore_1.20.0

[25] reshape2_1.2.2        RSQLite_0.11.2        setRNG_2011.11-2

[28] splines_2.15.3        stats4_2.15.3         stringr_0.6.2

[31] survival_2.37-4       SVGAnnotation_0.93-1  tcltk_2.15.3

[34] tools_2.15.3          XML_3.6-2             xtable_1.7-1

[37] zlibbioc_1.4.0


Illumina quality control

Several tools are available for investigating the quality of your sequencing data and these should be used before any downstream analysis, after all, crap in carp out.

The first tool, written in Java and called fastqc, has a simple GUI or can be used in the command line to generate a number of useful graphs describing several important features of your sequence data. These include (among others) per base quality and read length distributions, as well as detection of sequence duplication’s and kmer content. Handily the authors also provide some good and bad Illumina data that you can use to test out the software.

Running is easy, either use the GUI or just type: ./fastqc seq_file_name

The –help flag will provide the other options, format detection is automatic and worked for me so probably the only option you might consider is ‘-t’ to set the number of cores to use.


As you can see by the html report that is produced by fastqc a summary is provided indicating if your data meet a certain quality expectations (these seem to be very strict).

Here I have run the good (left) and bad (right) test data-sets and shown them side by side for comparison. Edit: These data-sets come from the SolexaQA package that I’ll talk about in another blog.


These box whisker plots of quality per base show how illumina data degrades at the 3′ end of the read. Here the median value as red line, box (yellow) the inter-quantile range (35-75%), upper/lower whiskers represent 10% and 90% points, while the blue line is the mean quality. Good quality scores (green), reasonable (orange), and (poor) are indicated by background shading. A warning will be issued if any base is less than 10, or if the median for any base is less than 25.


These plots show the quality score distribution over all sequences. The fastQC manual says that even in good data a small proportion of sequences may have low quality scores, often because they were poorly imaged.


Nothing dramatic here, these are the sequence content of each nucleotide across the reads, the lines should be horizontal reflecting the underlying GC%.


The sequence duplication level will reveal any problems during library preparation. Note that any duplicates that occur greater than 10 times are lumped into the 10+ category so if you see a pile-up there you have problems. Note that some duplication may occur if their is very high coverage.


The per base N content should be less than 3%.


The GC content distribution of the data (red) against a theoretical modeled distribution (blue). A shifted GC distribution away from the modeled normal distribution indicates a systematic bias that is independent of base position, probably the result of contamination in the library or some kind of bias subset of sequences.


The line should be horizontal in a random library, GC bias which changes in different bases can indicate over-represented sequence in your library.


In most cases the sequence length distribution should be uninteresting (at least for illumina data), but if any kind of trimming has occurred this should be revealed here.

Using PAML to detect pairwise dNdS ratios

Caveat: Many of the intricacies of Molecular evolution are fuzzy to me!

Intro: Due to the triplet code, changes in third codon position are mostly silent (synonymous), in that they do not result in a change to the encoded amino acid. In contrast, changes that occur in first and second codon positions nearly always result in an amino acid substitution (non-synonymous). Non-synonymous changes are rare as they mostly cause deleterious mutations that are removed from the population through purifying selection. However, in cases where positive selection is acting on a protein sequence, accumulation of changes that lead to the adaptive shift in protein function tend to result in an excess of Non-synonymous mutations. This means the ratio of non-silent to silent changes, as a proportion of each kind of site in the sequence, can be used as a measure of adaptive evolution. More specifically, a dNdS ratio of greater than 1 is used to indicate that a protein is/has been subject to positive selection. Here I’ll describe some of the important parameters when using arguably the most popular method for detecting selection based on dNdS in coding genes, that being the PAML package.

Preliminaries: The predominant program that we will use is called codeml, which (for the most part) takes an alignment of homologous coding regions. The most important aspect of the alignment is that the codon positions are correctly aligned. Codeml make calculations based on the nucleotide positions in the coding triplet, so the the open reading frame in the DNA needs to be maintained if the alignment contains indels or missing bases. The best way to guarantee this outcome is to constrain the DNA alignment based on a protein alignment (i.e. any gaps ‘-’ should be inserted as ‘—’ as not to disrupt reading frame) and in a sense any phylogeny based on coding DNA sequences should always be based such an alignment (evolution acts at the level of the protein after all). A tool for doing this is called transalign, but be warned that 1) your protein and DNA sequences need to be completely complementary (DNA must = protein, no partial or extra triplets etc) (2), they need to share the same names (3), they must be in the same order.

Pairwise dNdS analysis: Codeml reads a control file that contains the run parameters, typically this file is called codeml.ctl, the basic file for doing pairwise dNdS is below.

seqfile = seqfile.txt   * sequence data filename
outfile = results.txt   * main result file name
treefile = treefile.txt *tree file
noisy = 9      * 0,1,2,3,9: how much rubbish on the screen
verbose = 1      * 1:detailed output
runmode = -2     * -2:pairwise
seqtype = 1      * 1:codons
CodonFreq = 0      * 0:equal, 1:F1X4, 2:F3X4, 3:F61
model = 0      *
NSsites = 0      *
icode = 0      * 0:universal code
fix_kappa = 1      * 1:kappa fixed, 0:kappa to be estimated
kappa = 1      * initial or fixed kappa
fix_omega = 0      * 1:omega fixed, 0:omega to be estimated
omega = 0.5    * initial omega value

I’ve highlighted some important parameters in red, “seqfile” and “outfile” are self explanatory, except to say that the infile needs to be in a “typical” format, I use fasta because it is what I use(-;. The “runmode” is set to -2 to indicate that you are doing pairwise dNdS. “CodonFreq” and “kappa” incorporate parameters into the model that account for biological relevant aspects of protein evolution. CodonFreq controls the codon frequency parameter of the model, setting it to 0 implies that all codons are equally likely, 1 assigns the value based on the average nucleotide frequencies, 2 assigns the values based on the average frequencies of nucleotides in the codon, 3 allow the parameters to be estimated freely. All codons are generally not equally likely due to differing tRNA availability, ignoring these processes can lead to biases in the estimated dNdS. Kappa is the transition to transversion ratio. The chemical properties of DNA mutation create a bias toward transitions, and as transitions at third codon positions are more likely to be silent than transversions, ignoring this parameter nearly always results in an overestimation of dS and thus a correspondingly underestimation of dNdS. If the kappa value is set to 1, the transition bias is ignored, while setting it to 0 allows it to be estimated from the data. With these reference to these parameters, counting models such as NG86 (Nei and Gojobori (1986)) can be mirrored by setting K=1 and codonFreq = 0. Alternative settings for these values comes at a cost in that it increase the number of parameters that need to be estimated, however, their incorporation into the model makes for a more biologically relevant estimation of dNdS.

Final notes: It should be noted that these estimates of dNdS are not very sensitive to detecting positive selection. This is for two main reasons, firstly many amino acids are structurally important for the proteins core function and are thus correspondingly nearly invariably, this results in positive selection acting on a relative small number of codons. Secondly, selection tends to be episodic (occurs in spurts). This means it is difficult to detect positive selection by just taking the average dNdS across the entire protein from a single pairwise comparison. To overcome this methods have been developed that detect changes in dNdS across a phylogeny derived from multiple sequence alignment, and I’ll focus on these methods in the next addition!

I’ve been very slack with references for this post because I wonted to focus on the core considerations of using codeml, however, most of the above is based on reading the excelling coverage of this topic in: Maximum likelihood methods for detecting adaptive protein evolution

Bielawski, JP and Yang, Z (2005) Maximum likelihood methods for detecting adaptive protein evolution. In: Nielsen, R, (ed.) Statistical Methods in Molecular Evolution. (103 – 124). Springer-Verlag: New York.