SAM / BAM files explained?


Lets face it there aren’t too many HTS pipelines that don’t touch a SAM/BAM files somewhere along the way. The name SAM comes from the acronym Sequence Alignment/Map format and they are really nothing more than tab delineated text files describing mapping information for short read data. Of course with the millions of reads generated by next gen sequencers these days the files become really really big. That is where BAM files come in, these contain the same information as SAM files but are encoded in condensed computer readable binary format to save disk space.

If you open up a SAM file in a text editor (or use “head” in bash) the first part of the file typically contains lines beginning with AT (@) symbol, which represents the “header”. The header contains general information about the alignment, such as the reference name, reference length, MD5 checksum, the program used for the alignment, etc etc. Every line after the header represents a single read, with 11 mandatory tab separated fields of information. Lets see how this works with a really simple example.

What I’ve done is mapped four ‘pairs’ of reads to the alcohol dehydrogenase gene (my favourite gene) using bowtie2. The figures below shows the stranded nature of the read pairs as well as where they map on the gene.

Screenshot - 281115 - 11:07:01

I have set this up so that the reads fall into these three groupings.

  1. Two pairs of high quality reads that concordantly map
  2. A read pair where only one read is high quality
  3. An orphan read (the R1 read is only 1 nt long aka trimming)

The next figure describes each of the mapping scenario’s using the colour encoding. The red set of reads behave as we would expect for paired end data.

Where the reads map and the different read scenario

Where the reads map and the different read scenarios

For the alignment I’m going to use bowtie2, all of the sequences are here if you want to do this yourself.

#index the reference and specify the prefix as reference
bowtie2-build reference.fa reference

#map the reads
bowtie2 reference -1 test_R1.fq -2 test_R2.fq > adh.sam

####bowtie output###
Warning: skipping mate #1 of read 'HWI-7001326F:39:C7N3UANXX:1:1101:10000:62296/1' because length (1) <= # seed mismatches (0)
Warning: skipping mate #1 of read 'HWI-7001326F:39:C7N3UANXX:1:1101:10000:62296/1' because it was < 2 characters long 4 reads; of these: 4 (100.00%) were paired; of these: 1 (25.00%) aligned concordantly 0 times 3 (75.00%) aligned concordantly exactly 1 time 0 (0.00%) aligned concordantly >1 times
    1 pairs aligned concordantly 0 times; of these:
      0 (0.00%) aligned discordantly 1 time
    1 pairs aligned 0 times concordantly or discordantly; of these:
      2 mates make up the pairs; of these:
        1 (50.00%) aligned 0 times
        1 (50.00%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
87.50% overall alignment rate

The “adh.sam” file can be opened with any text editor, but for most of this post I’m going to use the spreadsheet version just to easily work with all the columns. Note the header lines indicated with the “@”. In this case the @SQ relates to the reference and @PG relates to the program parameters. Each header line has fields in the format tag:value. For example, the @SQ line has SN (SN=reference sequence name) and LN (LN=reference sequence length). A full set of these header can be found in the SAM specs.

@HD	VN:1.0	SO:unsorted
@SQ	SN:gi|568815594:c99321442-99306387	LN:1890
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0

Below the header each read is represented by a line, it is worth nothing that even read “62296/1” which shouldn’t map since it is only 1 nt long also gets a line. Two other things to note are that firstly the reads can be ordered by reference coordinate, this is the default Tophat behaviour. Alternatively, they can be listed in pairs. Samtools can be used to switch between formats using the “sort” command and why does this matter? Some software needs to know how the reads are sorted, for example htseq-count expects reads ordered in pairs because read pairs can map different distances away from each other and finding them in the file increases memory requirements.

OK the fist four column. The QNAME (seq name) and RNAME (reference name) are pretty self explanatory. The FLAG uses information dense bit-wise encoding containing information on the alignment, such as the strand, if the pair mapped, if this if the first or second pair. If your dumb like me you’ll be thankful for the Broads has a nifty website that translates these numbers into plain English. For example, the 83 represents “read paired”, “read mapped in proper pair”, “read reverse strand”, “first in pair”. The 163 is “read paired”, “read mapped in proper pair”, “mate reverse strand” and “second in pair”. I think that makes sense. For reference the other two numbers are 137 (read paired, mate unmapped) and 69 indicating an unmapped read pair. Very cool. The final flag here POS corresponds to the alignment coordinates. We can confirm the reads are ordered R1 then R2 based on the mapping coordinates (see the figure at the top of the page).



The next flags worth looking at are the MAPQ and CIGAR strings. The mapping quality is represented as a log transformed probability of mapping to the wrong position. Interestingly, the low quality (fastq quality “$”) has the same mapping score as the other high quality reads. I gather this is because the read really can only go one place when there is a 100% match and the reference is only short like this. The CIGAR string starts with a letter, in this case “M” for “alignment match” and 50 indicating the length of the alignment. The “*” indicates that no information is available. PNEXT is the position of the mate pair and TLEN is the observed template length, in other words the length of the sequenced fragment represented by the pairs.


Finally, we have sets of tags that provide other information on the alignment. Things like the number of ambiguous bases or gaps etc. Checkout the documentation of your aligner to find out what each tag defines if you are interested.





Make an android app in minutes with the App Inventor 2


The App Inventor 2 (AI2) is a really cool bit of software from Google and [now] MIT. It contains two GUI tools to make creating Android apps a “relatively” easy task. The first tool is a designer that helps create your screen layouts, pretty much like the layout designers that come with the Android SDK. The really cool and unique part is the “blocks” designer, which helps you create code by moving object blocks around (rather than typing java). These GUI’s dramatically speed up development, whilst also providing an excellent framework to teach people how to code in a fun and accessible way (ie no language syntax).

The following quick run through of a very simple conference app that I’m using for a demo at our teaching symposium. The goal is to do this demo in 10 min, so really this is bare-bones stuff! If you want more I highly recommend the book App Inventor, edition 2, which covers AI2 in much more detail through a series of fun (and useful) projects.

Its worth pointing out now that AI2 has some additional cool features. Firstly, its all on the cloud so you don’t need any software except chrome or firefox (ha ha IE is NOT supported). Secondly, if you are on WIFI, by installing a simple application on your phone you can test the app instantly by scanning a QT code from the computer screen. If you have ever created an app with eclipse and used the emulator for testing you will be delighted by how much simpler and quicker this process is, its super cool –  really!

The conference app is just a listview of names and session times that are clickable. If a item is selected a photo and brief bio pops up for that session. No fancy formatting at this stage, anyway.

1. Go to the AI2 page and login with your google account

The first screen will recommend you install some additional applications to allow easy testing on your phone. By far the simplest is to choose option one, as long as you are sharing a wifi network with your computer and phone. For the rest of this guide I’ll assume this is how you will test your app.


2. On the menu goto “Projects” – “Start new project” and call it something informative

After a few seconds the designer window will appear. The designer page allows you to layout all of the elements your require for your app. The left hand “palette” contains elements for layout as well as user interaction widgets. These elements can be dragged onto the middle view pain or alternatively if they are a invisible element (a database for example) to the bottom of the viewer screen. The “components” pane allows you to customise each element and finally the properties pane lets you modify size, layout, and text aspects.


3. Re-title

Just to to see how simple things can get, lets give a more informative title. All we need to do is change the “title text” in the property pane.

fig34. Add the list-view

Ok the meat and potatoes of the app, the listview will be populated with speaker names and times that are clickable. Just drag the listview icon from the palette window to the viewer.


 5. Adjust the placement and re-sizing to suit your tastes

We can now adjust the listview properties, such as the width and height so that this element fills the available screen space.

fig56. Go to the blocks page and start creating the logic for your app

Now the nuts and bolts (and the cool part). We have to populate our listview with the speaker names when the app is opened, as well as specify some behaviour when the user clicks on the element in the listview. For this work we will need to use the code blocks on the left hand side of the “blocks” screen.


7. Initialise by populating the list-view with strings once the screen is loaded

Clicking on the “Screen1” icon will bring up all the blocks associated with this choice. We want to drag the “when screen initializes” block to the work area. This “block” of code will execute when the app opens. Notice the gap in the middle, this is where we place the “do” part. In this case what we want to “do” is populate the listview with text for the speaker names.


To do this we click on the listview block and select the “listview elements” block, notice how this block will fit nicely inside the previous block. Notice also that the open jigsaw like parts at the end of this new block, this is where we add our new list that will hold the speaker string elements. I hope you notice this “feels” like coding nix with objects all the typing.


Clicking on the list block allows us to make a new list, we click this into our listview and now we populate the list with items that will be displayed.


So it should look like this (below). The little blue gear lets us increase the number of spots for elements to populate our list. For this app we will have four speakers for this section, so just drag the item block into the right “list block” twice to create two more spaces.


To actually populate the list we need some text. So we grab the empty text fields and drag them to the open slots on the list.


By typing between the quotes we can add our custom text. By right clicking on this first text element we can duplicate it for the remaining 3 spaces (or we could just drag 3 more empty elements across).


We modify each element to be a speaker name. Ok, now when the app opens the listview will be populated by string elements found in the list. Next we need to specify some behaviour when the user clicks on the speakers name.


At this stage our basic app should be “working”, as in the listview should display, although nothing will happen when we click on each item. Now this is where the simple testing functions of AI2 come to the fore. Just click on the build and then “provide QR code for apk”.

Screenshot 2014-12-04 23.00.00

Screenshot 2014-12-04 23.13.48

Once the QR code pops up just open up the App inventor 2 companion app and scan the code, its bloody brilliant!


Now we create an action once an element (speaker) in the list is selected by the user. As this is a listview action we click back on the listview icon and grab the “afterPicking” block; thats logical don’t you think?

One thing you might notice about these blocks, if you have coded before, is that they are very much like a normal object that we would specify using code syntax, its just its GUI based.

fig14fig15Our after picking behaviour will be opening a new screen that contains a photo of the speaker and a brief bio. So we need to setup a new screen to contain this new information. I’m being really lazy here and I’m just going to hard wire the page so that only one of the listview items actually works as expected (mine of course). However, in real life you would setup some logic that would ask for the index item that was clicked, and use this to send the user to the correct screen.
 fig16fig17So we have setup the method to open the new screen (above), now we need the screen. We’ll create the “DaveScreen” (as specified by another purple text element block ) that will contain my mock bio and a great photo. To do this just click the “add screen” button at the top of the screen and re-name it.fig18Back in the designer we can populate this new screen with a photo and some text, using a vertical alignment is simple just drag it onto the window and specify its attributes in the right hand panel.fig19Now Drag an image icon onto the panel and then by clicking on “image1” in the components plane we can upload an image and resize it using the properties. You really can’t get much easier than that really, can you!

fig20 fig21

fig23Wow, what a good looking bloke, finally, we create a label and populate it with text in the blocks screen. We could do this in the properties section of the designer page, but since I’m feeling guilty for hard coding the list view, I thought I would use the blocks to set the text.


Thats it. The app should provide a list of speakers, and when you click on my name (any name actually due to the hard wireing) the new page should open with my bio and a photo.

You can download the files associated with the app from here. These include the APK’s that can be installed on any android phone as well as the *aia files that can be imported into AI2 if you want to play with the code. I’ve included a slightly improved (although still broken) version that replaces the hard-wired listview actions with a call to specific speaker pages (that don’t exist but could be easily made based on the above information).

Install Arch Linux on virtualbox – the nuts and bolts (pt1)


Installing Arch is incredibly satisfying (maybe thats code for frustrating) as it really does introduce you to the flexibility offered by the modular nature of Linux. And with virtualisation software like virtualbox, we don’t have to worry about turning our computer into a fancy doorstop while we madly google a solution to that frozen black screen.

With the excellent Arch beginners guide installation wiki things really are not that difficult, however, there certainly are some got-yas (hair pulling), especially when installing inside a VM. Hopefully this step by step guide will help someone out.

Step 1. Download the latest Arch ISO, remember to use a torrent to save the arch servers bandwidth (and its a good FU to the MPAA).

Step 2. While you’re waiting, lets setup the VM. Open up virtualbox and click on ‘new’.


Typing Arch into the name box should populate the other options.


Set the memory to something sensible (1gb), remember you need to save some for the host, so stay away from the red section of the sliding bar.


We want to create a virtual hard drive now (the 8gb default is probably a little too frugal).


Select the top option on the next screen (VDI) then allow it to be dynamically allocated. Even though the drive will dynamically allocate space, we need to set the maximum size, at 8gb (the default on my machine) we will probably run out of space pretty quickly in real world desktop use. So lets make it something sensible. An alternative would be to do a minimum base install and then hook-up a shared folder to the host and keep data there (more on that latter).


Press “Create” and now we can setup the options on our new virtual machine. On the [general][advanced] tab setup bi-directional clipboard and file sharing.















Also tick the “show top of screen” or you’ll get driven crazy by the dropup menu destroying your life! On the system setting change the processors to suite your system. Display setting, might as well enable 3D. Now for the storage settings we can virtually stick our downloaded ISO into a imaginary CD-drive.


Click the ‘add CD’ button and point the file option to your freshly downloaded Arch ISO. The will allow the VM to boot from the ISO rather than; well nothing! Finally, lets setup a shared folder with the host. Select the folder path and click the auto-mount check box.


Now we are ready. Click “start” and we should get a boot screen, press enter (boot into Arch) and after a few seconds we should have a flashing cursor.

vm7 vm8 vm9

Bingo! Now I’m pretty much following along with the beginners guide.

Step 3 – Terminal action!

Firstly, I’m assuming you are using a standard keyboard layout (if not follow the guide). Secondly, lets check that the internet is working. A great thing about using a VM is that we don’t have to fiddle with wireless (it can be a pain) as the host just provides the link as if it was wired (remember for this to work the host must be connected to the internet). Lets check by pinging google, the “-c 3” flag just says to do it three times, if you forget it just [ctl][c] to kill ping.

ping -c 3

If you get a path not found error, check the host connection (then check the beginners guide if that doesn’t work)! Otherwise you’ll exchange some packets with google to verify everything is working dandy.

Now we use the “lsblk” command to check the name of the disk we are going to partition, in my case its “sda” (this is important), note I can tell that because its 20Gb big (remember we set that before).


So assuming your drive is sda we’ll use fdisk to make our new partitions.

 fdisk /dev/sda 

fdisk will present us with some options. **NOTE** if at any time you want to back out just hit “q”!

vm11Here I’m setting “n” for new partition table, “p” for primary, “1” for 1st partition, then accept the default for the start sector, finally “+10G” (this is all one word, it just slipped over the screen on the screenshot) to specify that we want this first partition to take up half our drive, we could probably make this less ( “+6G”?) if you wanted more space for the home. Now the home partition. Same as before really, but this time we specify “2” for the second partition and accept the defaults for start and end sectors to take up the remainder of the drive.


Now we can preview our new partitions with “p” and if we are happy write the table “w”, note that the partitions are called “/dev/sda1” and “/dev/sda2”, make note of this for latter. I’ve got plenty of RAM so I’m not going to worry about a swap partition (-:.


Next we format our new partitions as ext4 and then create a mount point for root and then mount a home directory on the other partition (remember to change sdx depending on your information above).

#format the drives
mkfs.ext4 /dev/sda1
mkfs.ext4 /dev/sda2
mount /dev/sda1 /mnt
mkdir /mnt/home
mount /dev/sda2 /mnt/home

Finally lets install the base system.

pacstrap -i /mnt base base-devel

vm14Accepted the defaults and “y” to install. Take note of some of the things that are being installed here, it really is the base system (ie base programs like “which” etc, all those little bash utilities you just take for granted).

Now we organise the boot partition.

genfstab -U -p /mnt >> /mnt/etc/fstab
#now check the table
nano /mnt/etc/fstab

This is what my boot table looks like (the $ signs at the end mean that there is more text outside the screen view, if you scroll to the right you should see the numbers 1 for the “/” partition and 2 for the “/home” partition.


Now its time to setup the base system ie set locale and make the internet persistent. For this we will be entering change root, which is a special environment (read about it here). Then we will change to files that contain information on our location and character set, the guide suggest we stick to UTF-8. In my case I’m in New Zealand, so I’ll uncomment that line in the file (note you can search in nano by using [ctl][w]) set that (note we only have one time zone in NZ). You can enable multiple languages here by uncommenting them, this would be handy if you are working on a system with multiple users, or you like to curse in foreign languages!

arch-chroot /mnt /bin/bash
nano /etc/locale.gen
##in the nano editor##
#en_NG ISO-8859-1
en_NV.UTF-8 UTF-8
#en_NV ISO-8859-1

Save this file with [ctl][o] and [ctl][x], then run the following to generate the locale, it should report your chosen setting.


Now that we have enabled the language we need to set the system wide settings in a new file called /etc/locale.conf that contains our chosen default system setting. In my case the command would be below, but if you were in the USA you probably use “echo LANG=en_US.UTF-8 > /etc/locale.conf”. All we are doing is echo’ing some text to a new file. Then we just export that setting (substitute your lacale).

echo LANG=en_NZ.UTF-8 > /etc/locale.conf
export LANG=en_NZ.UTF-8

The timezone and subzone files are in a folder called “/usr/share/zoneinfo/Zone/SubZone”, with “Zone” and “SubZone” being replaced by your region. For example, if you lived in Rochester NY (eastern US time), the folder you would point to would be called “/usr/share/zoneinfo/US/Eastern”. You can see below how I use “ls” to list the directories, then I create a simlink that to “/etc/localtime” (remember you probably have regions so your directory path will be one longer (as shown for the Rochester NY example).


ln -s /usr/share/zoneinfo/NZ /etc/localtime

Now we set the clock to UTC (Coordinated Universal time).

hwclock --systohc --utc

Then set our hostname (that will be seen on a network) to whatever we like (oneword). First we create a file called “/etc/hostname” containing the name, then we use nano to edit as second file (as shown), note the “DavesArch is a [tab] from the “hostname” string on that line (the one starting with

echo DavesArch > /etc/hostname
nano /etc/hosts
##in the nano editor!##
# /etc/hosts: static lookup table for host names

# localhost.localdomain localhost DavesArch
::1 localhost.localdomain localhost
# End of file

We are nearly there! Lets setup the network, before configuring grub and setting a root password. For the network we will use netctl. We can copy an example file from “/etc/netctl/examples” to a new filename “/etc/netctl/my_network”; the one to use is pretty obvious since we have a virtual wired connection via our host.

cp /etc/netctl/examples/ethernet-dhcp /etc/netctl/my_network

Next we edit this file, replacing the “Interface=eth0” line shown by the “ip a” command (see below).


Now just use nano to edit the file we just created and replace “eth0” with our interface (in my case “enp0s3”) from the ip a output.


netctl enable my_network

We will quickly set a root password, make it hard to guess and easy to remember (ha ha).


Now to setup the grub bootloader.

pacman -S grub
grub-install --target=i386-pc --recheck /dev/sda
grub-mkconfig -o /boot/grub/grub.cfg


umount /mnt/home
umount /mnt/

We use shutdown here so that we can eject the virtual ISO before we reboot, otherwise we will be back to where we started.


This time select the ISO and click the small minus sign icon at the bottom of this window to delete the ISO drive (otherwise we will boot back into the live CD)

Once our ISO is removed we can hit the start button again on Virtualbox, and fingers crossed we’ll boot into arch!

vm20What now? Unless you are a terminal jedi we’re going to have to install some GUI tools. For me this was the most challenging part of the install especially with the added complication of working with a VM. Since we are already up to 1600 words I think I might take a break, but to get your desktop GUI up and running click here!


Pandas data analysis – New Zealanders and their sheep!


As an Aussie I cop a little bit of flack living in New Zealand. It helps that since I’m from the South of Australia I follow Aussie Rules footy, not this rugby shenanigans, so I don’t bother too much with the constant focus on how much better the All Blacks are relative to our poor Wallabies (Australia has won as many world cups as New Zealand – Take that!).

They make bloody good gum boots (multi-purpose - inside joke (-; )

Kiwi’s have bloody good gum boots; they have to do more than keep the mud off! – inside joke (-;

That said as a bit of fun I thought I would do a post on the Pandas python module using data from StatsNZ. It has too be about sheep right! The actual spreadsheet I downloaded from StatsNZ and used in this is here.

I’ll be using pandas, ipython, and matplotlib to create some graphs to show the decline of sheep in NZ. In the next post I’ll try and work out why (yes I know you could find this out in 3 seconds with google, but thats no fun).

First we will read the file using pandas and make a dataframe. Pandas can import in all kinds of file formats, with excel you need to include a sheet name (that little tab at the bottom of the excel sheet).

import pandas as pd
import matplotlib.pyplot as plt
#sheetname required for xlsx
data_file = pd.read_excel('livestock1.xlsx',sheetname='livestock')

#make a dataframe
data = pd.DataFrame(data_file)
                                                 1994          2002      2003  \
Total beef cattle                             5047848       4491281   4626617
Calves born alive to beef heifers/cows        1262522       1083485   1079334
Total dairy cattle                            3839184       5161589   5101603
Calves born alive to dairy heifers/cows       2455975       3225238   3115897
Total sheep                              4.946605e+07  3.957184e+07  39552113
                                             2010      2011      2012
Total beef cattle                         3948520   3846414   3734412
Calves born alive to beef heifers/cows     901258    901375    827749
Total dairy cattle                        5915452   6174503   6445681
Calves born alive to dairy heifers/cows   3640914   3884257   3879543
Total sheep                              32562612  31132329  31262715

So we made a dataframe, the head method acts like bash head in that it shows the start of the frame rather than the whole thing. Currently the years are columns and the stock type are rows, lets flip the table, which is super easy!

 #we really want the dates as the index
data = data.T
     Total beef cattle Calves born alive to beef heifers/cows  \
1994           5047848                                1262522
2002           4491281                                1083485
2003           4626617                                1079334
2004           4447400                                1013893
2005           4423626                                1018730

Now that we have the years as rows, actually an index, it will be much easier to do our plotting. But the column names are overly informative, as in long, lets shorten them.

#the column names are pretty long, lets fix that now
Out[13]: Index([u'Total beef cattle', u'Calves born alive to beef heifers/cows', u'Total dairy cattle', u'Calves born alive to dairy heifers/cows', u'Total sheep', u'Total lambs marked and/or tailed', u'Total deer', u'Fawns weaned', u'Total pigs', u'Piglets weaned', u'Total horses'], dtype=object)
sub_data = data.loc[:,['Total beef cattle','Total dairy cattle','Total sheep','Total deer',
'Total pigs','Total horses']]
sub_data = sub_data.replace('..',0)#replace .. with 0
sub_data = sub_data.astype('float')
      Total beef cattle  Total dairy cattle  Total sheep  Total deer  \
1994            5047848             3839184     49466054     1231109
2002            4491281             5161589     39571837     1647938
2003            4626617             5101603     39552113     1689444
2004            4447400             5152492     39271137     1756888
2005            4423626             5087176     39879668     1705084
2006            4439136             5169557     40081594     1586918
2007            4393617             5260850     38460477     1396023
2008            4136872             5578440     34087864     1223324
2009            4100718             5860776     32383589     1145858
2010            3948520             5915452     32562612     1122695
2011            3846414             6174503     31132329     1088533
2012            3734412             6445681     31262715     1060694

Great! Now to make the plot easier to look at lets divide the dataframe by 1 million.

#now divide by a million
sub_data = sub_data.divide(1000000)
#first plot everything
plt.ylabel('Total stock (millions)')
plt.title('NZ farm stock')

Yes that is correct, back in the 90’s there were 50 million sheep in NZ, not bad for a country with a population of ~3 million people. Baa. But their numbers have been in serious decline since then, replaced by their bigger brothers the cows.
Poor sheepies!

Lets face it, NZ is all cows and sheep, lets just look at that data.

#lets just plot cows and sheep, that being the first 3 columns
cow_sheep = sub_data.ix[:,[0,1,2]]
cow_sheep.plot(label=True,title="Numbers of cattle and sheep in NZ")
plt.ylabel('Total stock (millions)')

fig2The data has meat cows and milk cows as separate columns, its easy to combine them.

#a milk cow and meat cow are still cattle
cow_sheep['Total cows'] = cow_sheep.ix[:,0] +  cow_sheep.ix[:,1]
cow_sheep.ix[:,2:].plot(legend=True,title="Total cattle and sheep in NZ")
plt.ylabel('Total stock (millions)')

fig_3So the country is cowafying, mostly dairy replacing beef cattle. Although the increase looks slight, basically its 25% up! That’s a lot of grass, cow shit, and methane!

Now lets look at the change in the numbers of each stock since 1994 (the start of our data). We do this by passing all the data as well as all the data minus 1994 to a function that handles the calculation. Pandas handles this all in the back end and parses the data in the frame through the function, nice.

def percent(dataframe,d1):
    '''calculate percent change relative to first column (1994)'''
    a = 100*((dataframe.ix[0,:]-d1)/dataframe.ix[0,:])
    return 0-a

#pass the entire data frame to this function
percent_data = percent(sub_data,sub_data.ix[0:,:])

      Total beef cattle  Total dairy cattle  Total sheep  Total deer  \
1994           0.000000            0.000000     0.000000    0.000000   
2002         -11.025827           34.444950   -20.002034   33.858009   
2003          -8.344764           32.882482   -20.041908   37.229441   
2004         -11.895128           34.207998   -20.609926   42.707754   
2005         -12.366101           32.506699   -19.379727   38.499840   

      Total pigs  Total horses  
1994    0.000000      0.000000  
2002  -19.100637     11.811093  
2003  -10.766476     18.504488  
2004   -8.072078     13.376472  
2005  -19.230733   -100.000000  
#the years are plotted as floats, its easy to convert them!
#figure 4
plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')

fig_4I really want the graph the other way around, so lets re-index. Also, lets get rid of the nags, they look funny because the frame was missing some data. More proof that Far Lap was an Australian horse.

horseless = sub_data.ix[:,:-1]
horseless_per = percent(horseless,horseless.ix[0:,:])
#flip the axis
horseless_per = horseless_per.reindex( index=data.index[ ::-1 ] )
plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')


Finally, its silly having the 1994 data as its zero, but it was a nice sanity check to make sure the % function was working correctly. But lets get rid of it now by just plotting part of the frame.

plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')

The ‘barh’ is just bar graph horizontal.
fig_9So it looks like there was a venison craze in the early 00’s, but mostly just more and more dairy.

#save the file

Ok, so although there are still a LOT of sheep in NZ, it really is the country of the cow now. What might be driving that? Lets look at some commodity prices in the next post!

How to use DESeq2 to analyse RNAseq data


News: My colleagues at NZGL have developed an open source R based GUI for generating plots using cuffdiff data. One for deseq2 will be available soon! Feel free to check it out and get back to us with any suggestions. Only requires R-studio.

There is only one thing better than DESeq and thats DESeq2! The updated version is out and I’m keen to give it a whirl. Like with my old DESeq post, once again I am really just following the excellent DESeq2 manual, thanks again to the authors for the great documentation!

Just a quick warning that I haven’t tested this workflow extensively, let me know if things don’t work. Also, DESeq2 is new so some of the function might have changed, so if you have problems makes sure you check what version you are using versus what I used (see sessionInfo below).

The files I used for this are found here if you want to give them a go or follow along. I’ve already got a blow by blow run through on how to do use DESeq and much of that applies to the new package, so here I’ll just concentrate on some DESeq2 specific stuff as well as all the graphs. I’ve included some side by side comparisons between DESeq and DESeq2.

Installing is easy:


One important change is that now you can directly create the count table using raw HT-Seq-count output, and I’ll show you how to do that below. Remember HT-Seq-count will create a single file for each replicate of a condition (based on an SAM alignment file), so in my case with two conditions (control and treatment) and 3 reps each, that makes a total of 6 files. I called these files treated1.txt, treated2.txt, treated3.txt, untreated1 untreated2, untreated3 so that i can use grep to import them (explained below). Previously i would use a script to merge them all together, now DESeq2 allows you to import these files directory.

Below I set the directory where the files are located; use grep to catch all these files based on the string match “treated” that they all share (be carefully it doesn’t catch anything else), this is stored in sampleFiles. If you like you could just directly specify the files using “sampleFiles<-c("treated1.txt",..etc.."untreated3.txt"). Importantly, we need to setup the sampleConditions, with the same order as that found for the file names in sampleFiles (so it knows what each files represents). Finally we make a dataframe that becomes a deseq table. Note my “#” lines are just nonfunctional comment lines that I am using to print out the output from the screen!

#use grep to search for the 'treated' part of filename to collect files
# sampleFiles
#[1] 'treated1.txt'   'treated2.txt' 'treated3.txt'  'untreated1.txt'
#[5] 'untreated2.txt' 'untreated3.txt'

sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition)
#     sampleName       fileName condition
#1   treated1.txt   treated1.txt   treated
#2   treated2.txt   treated2.txt   treated
#3   treated3.txt   treated3.txt   treated
#4 untreated1.txt untreated1.txt untreated
#5 untreated2.txt untreated2.txt untreated
#6 untreated3.txt untreated3.txt untreated

ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition)
#class: DESeqDataSet
#dim: 7921 6
#assays(1): counts
#rownames(7921): seq_1 seq_2 ... seq_7920 seq_7921
#rowData metadata column names(0):
#colnames(6): treated1.txt treated2.txt ... untreated2.txt
#  untreated3.txt
#colData names(1): condition
colData(ddsHTSeq)$condition<-factor(colData(ddsHTSeq)$condition, levels=c('untreated','treated'))

The levels in colData are important because they are used in the log calculations; it makes sense to set untreated or control first so that the direction of the logs fold changes doesn’t confuse everyone (typically we do comparisons to the control)! Now for the guts of the DEseq2 analysis.

#DataFrame with 6 rows and 6 columns
#          baseMean log2FoldChange      lfcSE      stat       pvalue
#seq_3146  997.5419      0.7894523 0.08297687  9.514125 1.832488e-21
#seq_1802  746.3972      0.5685789 0.08533961  6.662544 2.691282e-11
#seq_2146  406.1395      0.9424543 0.14108613  6.679993 2.389544e-11
#seq_7548  466.5453      0.6036683 0.10178158  5.931017 3.010637e-09
#seq_3240 1569.6556      0.6132326 0.11145966  5.501835 3.758596e-08
#seq_958   149.6504      0.7398193 0.14154162  5.226868 1.724055e-07
#                 padj
#seq_3146 1.299050e-17
#seq_1802 6.359498e-08
#seq_2146 6.359498e-08
#seq_7548 5.335601e-06
#seq_3240 5.328937e-05
#seq_958  2.036971e-04

Looking good, time for some plots. BTW I’m using the same dataset I used for the original DESeq blog post (links to data and that blog at top of page).

MAPlot of DESeq1 (left) and DESeq2  (right) for the same data

MAPlot of DESeq1 (left) and DESeq2 (right) for the same data

As expected for this dataset there are not many differentially expressed genes (red). There certainly is a difference in the level of scatter with this dataset using DESeq and DESeq2. Also note that there is good reduction in scatter for low count reads (left hand side of the graph) in DESeq2 versus the original version. DESeq tends to be a conservative approach, I like that, and with that in mind the update uses a test called cooks distance to remove outliers from the analysis. Cooks distance looks to see how much each sample contributes to a genes overall value fold change, with samples that cause extreme effects removed. To be specific, the gene will not be analysed for differential expression if one of its samples is considered an outlier. The idea being here that we want to see only DE genes that show a consistent pattern. The draw back of this approach is that there is a loss of power, as some genes that are truly DE will be removed before the statistical tests are performed.

We can save the table, and also print out some information on what the columns mean.

#DataFrame with 6 rows and 2 columns
#                       type

#baseMean       intermediate
#log2FoldChange      results
#lfcSE               results
#stat                results
#pvalue              results
#padj                results

#baseMean                        the base mean over all rows
#log2FoldChange log2 fold change (MAP): condition treated vs untreated
#lfcSE           standard error: condition treated vs untreated
#stat            Wald statistic: condition treated vs untreated
#pvalue          Wald test p-value: condition treated vs untreated
#padj            BH adjusted p-values
#write the table to a csv file

BTW take this with a pinch of salt because its only a simple sample dataset, but the difference in gene counts is that deseq only found a single differentially expressed gene (at padj 0.1), whilst deseq2 called this same gene plus 23 others.  Also, reducing the cut-off multiple testing correction to 0.05 only removes 3 genes from the list with DESeq2.

Now we want to transform the raw discretely distributed counts so that we can do clustering. (Note: when you expect a large treatment effect you should actually set blind=FALSE (see

rld<- rlogTransformation(dds, blind=TRUE)
vsd<-varianceStabilizingTransformation(dds, blind=TRUE)

Here we choose blind so that the initial conditions setting does not influence the outcome, ie we want to see if the conditions cluster based purely on the individual datasets, in an unbiased way. According to the documentation, the rlogTransformation method that converts counts to log2 values is apparently better than the old varienceStabilisation method when the data size factors vary by large amounts.

The code and plot below shows the [nice] effect of the transformation.

par(mai=ifelse(1:4 <= 2, par('mai'), 0))
px     <- counts(dds)[,1] / sizeFactors(dds)[1]
ord    <- order(px)
ord    <- ord[px[ord]<150]
ord    <- ord[seq(1, length(ord), length=50)]
last   <- ord[length(ord)]
vstcol <- c('blue', 'black')
matplot(px[ord], cbind(assay(vsd)[, 1], log2(px))[ord, ], type='l', lty=1, col=vstcol, xlab='n', ylab='f(n)')
legend('bottomright', legend = c(expression('variance stabilizing transformation'), expression(log[2](n/s[1]))), fill=vstcol)
Graph showing variance stabilizing transformation for sample 1 (blue) and of the transformation f (n) = log2 (n/s1 ) (black)

Graph showing variance stabilizing transformation for sample 1 (blue) and
of the transformation f (n) = log2 (n/s1 ) (black)

The x axis is the square root of variance over the mean for all samples, so this will naturally included variance due to the treatment. The goal here is to flattern the curve so that there is consistent variance across the read counts, and that is what we got.

notAllZero <- (rowSums(counts(dds))>0)
meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1), ylim = c(0,2.5))
meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5))
meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5))


This interesting plot shows the standard deviation across all samples against the mean counts using three different methods of transformation. With this data you can see that the shifted logarithm method (left) seems to do pretty badly at for low count genes, with both regularized log (center) and DESeqs variance stabilisation (right) doing a much better job across the entire dynamic range of counts.

For some reason, everyone loves a good heat map!

select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:30]
hmcol<- colorRampPalette(brewer.pal(9, 'GnBu'))(100)
heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol,
Rowv = FALSE, Colv = FALSE, scale='none',
dendrogram='none', trace='none', margin=c(10,6))
heatmap.2(assay(rld)[select,], col = hmcol,
Rowv = FALSE, Colv = FALSE, scale='none',
dendrogram='none', trace='none', margin=c(10, 6))
heatmap.2(assay(vsd)[select,], col = hmcol,
Rowv = FALSE, Colv = FALSE, scale='none',
dendrogram='none', trace='none', margin=c(10, 6))


The above shows heatmaps for 30 most highly expressed genes (not necessarily the biggest fold change). The data is of raw counts (left), regularized log transformation (center) and from variance stabilizing transformation (right) and you can clearly see the effect of the transformation has by shrinking the variance so that we don’t get the squish effect shown in the left hand graph.

Now we calculate sample to sample distances so we can make a dendrogram to look at the clustering of samples.

distsRL <- dist(t(assay(rld)))
mat<- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds),
paste(condition,sampleFiles , sep=' : '))

#updated in latest vignette (See comment by Michael Love)
#this line was incorrect
#heatmap.2(mat, trace='none', col = rev(hmcol), margin=c(16, 16))
#From the Apr 2015 vignette
hc <- hclust(distsRL)
heatmap.2(mat, Rowv=as.dendrogram(hc),
symm=TRUE, trace='none',
col = rev(hmcol), margin=c(13, 13))


Although this result looks terrible, as we would expect samples to cluster by treatment, in this case I’m actually happy by this result. Why? Well this was actually a control experiment to show that slightly different (and unavoidable) experimental setup for the different samples, wasn’t responsible for the observed expression differences, so seeing that there is little treatment effect makes me happy. Remember, always try and do what Fisher tells us to, replicate, randomised, block.

Similarly the pca.

print(plotPCA(rld, intgroup=c('condition')))

I hope your’s looks better!

Previously we talked about the cooks distance treatment of outliers, in that a gene is thrown away if one of its samples is deemed to be an outlier. You may not want this to happen so DESeq2 we can take a different approach by replacing the outlier value with one estimated value as predicted by the distribution using the trimmed mean approach. DESeq2 recomends you only do this if you have several replicates per treatment, and indeed it automatically uses this feature if you have 7 or more replicates in your datatable.

ddsClean <- replaceOutliersWithTrimmedMean(dds)
ddsClean <- DESeq(ddsClean)
tab <- table(initial = results(dds)$padj < .1,
cleaned = results(ddsClean)$padj < .1)
resClean <- results(ddsClean)

In my case it didn’t really make much difference.

Dispersion plot shows how the estimates are shrunk from the gene wise values (black dots) toward the fitted estimates, with the final values used in testing being the blue dots.

Dispersion plots deseq and deseq2 (right)

Dispersion plots deseq and deseq2 (right)

and compared DESeq1 and 2.

dispersion plots Deseq1 Left and right deseq2

dispersion plots Deseq1 Left and right deseq2

Now independent filtering to remove any tests that have little chance of pass to reduce the number of tests we have to perform, thus reducing the effects of multiple testing error. (false discovery). You can see how many genes are rejected based on different values of alpha (FDR)

#filtering threashold
#     10%
plot(attr(res,'filterNumRej'),type='b', ylab='number of rejections')

deseq2_filtering threashold_Ed

W <- res$stat
maxCooks <- apply(assays(dds)[['cooks']],1,max)
idx <- !
plot(rank(W[idx]), maxCooks[idx], xlab='rank of Wald statistic',
ylab='maximum Cook's distance per gene',
ylim=c(0,5), cex=.4, col=rgb(0,0,0,.3))
m <- ncol(dds)
p <- 3
abline(h=qf(.99, p, m - p))
cooks distance

cooks distance

Plot of the maximum Cook’s distance per gene over the rank of the Wald statistics for the condition.

Here more about independent filtering. What it shows in genes with very low counts are unlikely to have a significant p-value due to excessive dispersion at the left side of the dynamic range of counts. The y-axis here is -log10, so bigger numbers are smaller p-values (better).

plot(res$baseMean+1, -log10(res$pvalue),
log='x', xlab='mean of normalized counts',
cex=.4, col=rgb(0,0,0,.3))


All those dots on the left hand side the graph represent failed tests due to very low count values, thus we can really just get rid of them to reduce our chance of making a type I error.

And again, you can see that only a few small (or no) p-values are discarded by the filtering. NOTE: You might only see blue lines [I’ve broken something??]

use <- res$baseMean > attr(res,'filterThreshold')
h1 <- hist(res$pvalue[!use], breaks=0:50/50, plot=FALSE)
h2 <- hist(res$pvalue[use], breaks=0:50/50, plot=FALSE)
colori <- c('do not pass'='khaki', 'pass'='powderblue')
barplot(height = rbind(h1$counts, h2$counts), beside = FALSE,
col = colori, space = 0, main = '', ylab='frequency')
text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)),
adj = c(0.5,1.7), xpd=NA)
legend('topright', fill=rev(colori), legend=rev(names(colori)))

deseq2 pvals and mulittest
The graph on the left ranks the p-values from smallest to biggest (x-axis) and plots them. The black line is the actual p-value numbers (remember only about 23 genes had a p-value lower than 0.05). The red line has a slope that represents the number of tests divided by the false discovery rate (0.1). The idea here is the FDR is controlled at the 0.1% value for all tests that occur to the left of the right-most intersection of the black and red line.

The code for the right hand plot above.

resFilt <- res[use & !$pvalue),]
orderInPlot <- order(resFilt$pvalue)
showInPlot <- (resFilt$pvalue[orderInPlot] <= 0.08)
alpha <- 0.1

plot(seq(along=which(showInPlot)), resFilt$pvalue[orderInPlot][showInPlot],
pch='.', xlab = expression(rank(p[i])), ylab=expression(p[i]))
abline(a=0, b=alpha/length(resFilt$pvalue), col='red3', lwd=2)
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)


attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] gplots_2.12.1 RColorBrewer_1.0-5 BiocInstaller_1.12.0
[4] DESeq2_1.2.8 RcppArmadillo_0.3.920.3 Rcpp_0.10.6
[7] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.6
[10] BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] affy_1.40.0 affyio_1.30.0 annotate_1.40.0
[4] AnnotationDbi_1.24.0 Biobase_2.22.0 bitops_1.0-6
[7] caTools_1.16 DBI_0.2-7 DESeq_1.14.0
[10] gdata_2.13.2 genefilter_1.44.0 geneplotter_1.40.0
[13] grid_3.0.2 gtools_3.1.1 KernSmooth_2.23-10
[16] lattice_0.20-24 limma_3.18.5 locfit_1.5-9.1
[19] preprocessCore_1.24.0 RSQLite_0.11.4 splines_3.0.2
[22] stats4_3.0.2 survival_2.37-4 tools_3.0.2
[25] vsn_3.30.0 XML_3.98-1.1 xtable_1.7-1
[28] zlibb

[Updated July ’14: to fix errors with distance matrix plot, cooks distance, and the Benjamini-Hochberg multiple testing adjustment procedure (props to Stefan for pointing them out]

Setting up a Lubuntu virtual machine with virtual box [and get guest additions working]


BTW – Learn more about your linux system; Try my how-to on Arch Linux via virtualbox page

Disclaimer – this is totally safe, but please backup your computers before trying it, I take no responsibility for anything that might happen

Updated 2015 and tested – let me know if something is broken

If you haven’t tried Linux then you are really missing out. Although I run windows in a virtual machine, I’m working my way to weaning myself of that dark beast (my colleagues doing the same would certainly hasten the split).

Although I use Xubuntu on my work machine, Lubuntu is another favourite of mine due to its small footprint and minimalist approach. Both are great because they don’t have that abomination that is Unity now shipping with Ubuntu.

For a python class that I am developing I wanted to put together a how-to install Lubuntu in a virtual machine using virtual box. At work I use vmware workstation to run windows as a vm and it works pretty well. However, I had trouble using the free player, mostly due our wonderful proxy creating problems with the download and install of vmware tools (there were display problems too for some reason)! Anyway, long story short I settled on virtual box (dam you Oracle for stealing our Americas cup).

In this example I’m installing a linux vm on a windows 7 host.

(Note to self: and 8080 proxy)

  1. Get virtual box from here and install it on your host system.

  2. Get the Lubuntu from here, if you can use the torrent to save the foundation some $

  3. Set-up your new virtual machine by starting virtual box.


First set up the name and the OS type. The name can be anything, but best call it “lubuntu” (no quotes). Type is “linux” and the version is either ubuntu64 or ubuntu32. Click OK or Next. [Note: Most modern machines are 64 bit, use that unless you have something old so you can access more ram].

Give as much as you can, don’t forget to leave some for your host, but 2gb is a good start on my 8gb system (but in real life I might give it 3-4 gb). This can be changed latter so leave it as the default if worst comes to worst.




STOP! Although dynamic resizing is probably a good way to save disc space – I want speed! but for this course just use dynamic sizing ie not what I have shown below!


This is only a demo, normally I would make the drive considerably bigger (>20gb) to allow for software installs. But you can get away with 10 Gb as a minimum. You can share folders on your host machine once the system is up and running and this is where I would store all my important data. As an aside, I normally don’t back up my vm’s and if you do the same then, yeah, don’t store anything on the virtual machine that is important!



Click on the headings to access the different settings, its worth looking as the defaults are likely to make your machine run sub-optimally. OR JUST LEAVE IT FOR NOW


Accept the defaults if this is all too confusing!


Hovering the mouse over the warning lets you know what is wrong, here it says I need to select another option if I use duel core processors, it also says it will fix it for me so I didn’t have to do anything. Display issues are a problem for me when using vms, so I’ve given a bit of RAM to the video to hopefully minimise the chance of problems. Accept the defaults if this is too confusing.


Lets set up the shared folders with the host machine now, make it mount on startup so it acts as a removable drive once the vm is running.



Now the machine is ready to go, but first we need to “insert” the ISO image into the virtual CD-driveClick on the “Storage” settings option and select the little disk+ icon as shown below, then when prompted navigate to the ISO using “choose disk” .


Click on the newly created drive and select “live CD” option.


With the ISO file sitting in the virtual drive its time to boot up our new computer!


Here we have the option to try out the system first using the live CD. This does not install anything, but lets us know if we likely to run into compatibility issues. Its only a vm, so I’m just going to install.


Be patient here as it might not look like it is doing anything, there even might be some error messages, sit tight and hopefully you’ll see the screen below!

Hopefully you see the image below will load, or a new lubuntu desktop will load. If the desktop loads double click on the “install” icon on the desktop and you should see the image below (and next time follow instructions ie click the live CD option (-; ).


Non-open codecs like mp3 etc can be installed here and you can download updates on the fly during install. These options are up to you. But if you don’t have Internet leave the “download updates while installing” boxes unchecked.


If this were a normal install (dual boot for example) I would start to sweat at this point, whilst double checking my back drives (you have one right?)! If it goes wrong here you end up with a blinking cursor, where once you had a computer full of data and software (backup backup backup). But……The great advantage of VMs is that it doesn’t really matter! If things go wrong, its only a virtual computer after all, delete it and just start again (that said: backup backup backup). Also, as you will be running the vm inside a host, there is no need to worry about the hassle of setting up a dual booting (though its not actually that hard to do). So lets go ahead and format our virtual drive.


Fu*k the NSA, encrypt if you can! Click continue when asked unless you want to set up more complex swap and partition options.

Good to see NZ just made it onto the map! Choose the US keyboard (yes this is for you know who) unless you got your computer from some strange dodgy foreign online store.



The parts the matter for you here are your username and password (no spaces in username or password). Choosing NOT to log in automatically is certainly recommended from a security standpoint.

Capture24 Capture25

If this were a real install on a live CD you would need to eject the CD otherwise it will just boot back to the live image again (see below).


we need to “eject” the Lubuntu startup/install disk from the virtual CD, otherwise the VM will just boot from that again [and thats not what we want]. So goto VB settings, storage, then right-click “remove disk from virtual drive”. Now you should be good to press enter and reboot.

Screenshot 2015-02-20 17.41.15




Hopefully the display issues are minor. Next we will install guest tools that should make things run more smoothly.

Neat Trick t: When you have your VM up and running make it full screen then use the switch desktop [ctl][alt][arrow] to change between the host and the VM. This is so much easier than minimising the VM all the time; you end up with two computers for the price of one!



Dealing with the wonderful lovely helpful joyful PROXY!

Skip this if you are not behind a proxy (nearly everyone). Open a terminal using [ctl][alt] t (the first part is just for the proxy (insert your usrname password in place of ‘password’), ignore if you are just using a normal network.)

#the fu*king proxy, ignore if you live in a normal universe 
#make a new file, copy some text into it, then send this to where apt looks for settings
echo 'Acquire::http::Proxy "";' > ~/Desktop/99proxy
sudo cp ~/Desktop/99proxy /etc/apt/apt.conf.d 

But the fun doesn’t stop there, when you get home you have to disable this option

sudo nano /etc/apt/apt.conf.d/99proxy

#to save click [ctl][o] then [ctl][x]

When nano opens comment out ie add a # to the start of the “Acquire” line.

#Acquire::http::Proxy "";

You have to undo this when you get back to the work proxy environment again”, for the love of the flying spaghetti monster!!!

Now to get copy paste working, full screen, shared folders, etc

We are going to install guest additions which helps copy and paste between the host and the linux system work, as well as display issues and shared folders etc. VB comes with its own package for helping with this, but the opensource community has its own version, we’ll use that. Open a terminal and type (copy and paste is unlikely to work yet) (via see ( and

sudo apt-get install dkms build-essential linux-headers-generic virtualbox-guest-x11

Hit “Y” when prompted and download off the internet

Now reboot

sudo reboot

If the above worked you should be able to go to full screen mode [view][switch to full screen] — if that works yipee your done (but see optimisation steps below)!

———————————————-skip if things look OK——————————————-

If you are still having problems you can try this alt way

From the virtual box menu at the top or bottom of the screen (if the menu is hidden just put your mouse at the top or bottom and should appear), choose DEVICE the option “install guest addition”. (see bottom of device drop down).

Screenshot 2014-03-07 19.53.59

A box should pop up, click open or ok. If it says something like problem mounting, GOOGLE IS YOUR FRIEND, I googled this problem and this looks like a solution

Now! Open a terminal window using [CTL][ALT][DEL] and type.

cd /media/[username]/VBOX*

ie if your username is ‘fred’ it will be:

cd /media/fred/VBOX*

Type “ls” and make sure you can see a file called “”. IF so then use sudo to install.

sudo ./

Now reboot and hopefully any display issues you may have are fixed! This part below is to get the Internet working for apt-get.

*************************Final optimisation*************************

Look and feel

1) USE FULL SCREEN – it drives me crazy how people leave the VM as a window, you go to click something and all of a sudden you’re back in some sucky OS like windows! Just make the screen full screen and use [right ctl][F] to graciously switch between the host and the VM.

By default the settings “options bar” are hidden (click the pin if you want to lock it in place), move the mouse to the bottom middle of the screen and they should unhide themselves. Then click “view” “switch to full screen”.

2) Move the settings bar to the top of the screen so you don’t keep accidentally bringing it up when you go to the bottom of the screen. Goto the setting bar again, click “machine” “settings” in the general section click “advanced” and “show at top of screen” (note this will only start working when you shutdown the VM, then close then reopen virtual-box)


1) Copy and paste, options bar, “device” -“shared clip-board” – “biodirectional”, do the same with “drag and drop” also in the “device” section.

2) shared folders with host (via Open a terminal and type (replace dwheeler with your username)

sudo adduser dwheeler vboxsf

This will only work after you reboot the guest OS!

The shared folder will be in “/media/XXXX”, if you type “/media/” into the file-manager it should show up. Once you are their click “bookmarks” and add the folder for quick reference.

Last update Feb 2015

How to use DESeq to analyse RNAseq data


Several packages have been written to measure gene expression differences using RNAseq data. In this post I’m going to do a run through of the popular R based DESeq package. Basically, I’m just following along with the packages Vignette which is really well written and informative! For the test data I’m using 2 treatments, with 3 biological reps each containing 750K reads (yes its small).

A. Counting reads

DEseq uses a table of raw counts as the basis of DE calls. Any transformed counts, such as FPKM counts will not work with this package. The top row of the table can be a descriptor, these can be imported into R to act as column labels. Gene names should be column 1 (remember R counting starts from 1 rather than 0), each column after that should be a raw count for each biological replicate and for each treatment. Note that if technical reps are performed, these should be merged based on the sum of the two runs (they will be normalised by read count latter).

Here I used BWA to align the reads to my reference and generate SAM files that are then processed by HTseq count. Note that you have to run this on each sample, so in this current example I will generate 6 sam files, 1 for each treatment and condition. Also, see that I used the intersection-strict mode for HTseq.

>bwa aln sarcop_pep_rnaseqfiltered.fa Sample_ST72-1/ST72-1.subset.fastq > ST72-1.subset.sai

>bwa samse sarcop_pep_rnaseqfiltered.fa ST72-1.subset.sai Sample_ST72-1/ST72-1.subset.fastq > ST72-1.subset.sam

>htseq-count -m intersection-strict –stranded=no ST72-1.subset.sam sarcop_pep_rnaseqfiltered.gtf > ST72-1.subset.counts

In the above ST72-1.subset.counts is a table of counts that can be merged with the other count data before being imported into DEseq as a R data table. Here is a python script for merging, see the header of the script for usage, if that helps. For full detailed explanation of HTseq see ( and this seems to be the default way of doing it (instructions :

B. R DESeq package 

Enter R environment from a terminal window open where your data is.



>countsTable <- read.delim(“merged_counts.txt”,header=TRUE)

>rownames(countsTable) <- countsTable$gene

>countsTable <- countsTable[,-1]

 This imports the DESeq library, reads the tab delineated file created by the python script into a data table. The next two lines create row names from the gene column created by the script, if you use some other method to make the counts you need to change the name of the gene column to whatever it is. Finally, we get rid of the column used to make the row names. My table looks like this

> head( countsTable)

                  UN72_1 UN72_2 UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2          0      0      0      0      0      0

g10671t00001            2      1      1      0      0      0

g16789t00001            0      0      0      0      0      0

Now we need to add the treatment information so that DESeq knows what data each column represents, order is obviously important. 

>conds <- factor( c( “untreated”, “untreated”, “untreated”, “treated”,”treated” ,”treated”  ) )

> cds <- newCountDataSet( countsTable, conds )

This has created a instantiate a CountDataSet, which is the central data structure in the DESeq. The next step will adjust for different library sizes among the treatments, something called “size factors”. DESeq only needs relative library sizes, so if (on average) the counts for non-DE genes in one library are twice that of a second library the size factor should be twice as large.

 >cds <- estimateSizeFactors( cds )

 To see these size factors, do this:

 >sizeFactors( cds )

UN72_1    UN72_2    UN72_3    ST72_1    ST72_2    ST72_3

0.9805609 1.0420015 0.9164864 1.1083583 0.9599370 1.1462168

 Apparently, the way this works ( is that you take the genometric mean of each condition for a gene and use this as its reference value. Then you divide the gene expression value for each conditions by its reference value to create a list of values (quotient ie result of the division) for each condition. The size factor is the median value of this list, as there is a list per condition it will generate a size factor value for each condition. The counts are normalised by dividing each column by its size factor, here lets look before and after:

 > head(counts(cds))

                  UN72_1 UN72_2 UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2          0      0      0      0      0      0

g10671t00001            2      1      1      0      0      0

> head(counts(cds,normalized=TRUE))

                    UN72_1    UN72_2   UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2     0.000000 0.0000000 0.000000      0      0      0

g10671t00001       2.039649 0.9596916 1.091124      0      0      0

g16789t00001       0.000000 0.0000000 0.000000      0      0      0

The next part is a key to how DESeq calculates significance in DE. The package uses the relationship between the data’s variance (or dispersion) and its mean. The inference in DESeq relies on an estimation of the typical relationship between the data’s variance and their mean, or, equivalently, between the data’s dispersion and their mean. The dispersion can be understood as the square of the coefficient of biological variation. The variance between counts is the sum of two factors, firstly, the level of variation between replicates of each treatment, and secondly an uncertainty measure based on the concentration of the counts. This latter factor takes into consideration the level of noise predicted for low expressed genes based on a poisson distribution (see the DEseq manual for more detail).

>cds <- estimateDispersions( cds ) 

Now we write a little function that will plot this.

>plotDispEsts <- function( cds ){

plot(rowMeans( counts( cds, normalized=TRUE ) ), fitInfo(cds)

$perGeneDispEsts, pch = ‘.’, log=”xy”, ylab=”dispersion”,

xlab=”mean of normalized counts”)

xg = 10^seq( -.5, 5, length.out=300 )

lines( xg, fitInfo(cds)$dispFun( xg ), col=”red” )



> jpeg(“DispEsts_DESeq.jpg”)

> plotDispEsts(cds)


The Above commands make a plotting function, the pass cds to that function and save the result in a jpg file.

The empirical (real) dispersion values (black dots) and fitted values (red line)

The empirical (real) dispersion values (black dots) and fitted values (red line)

Remember that the level of dispersion is related to the biological variation seen in each treatment. Note this is a double log graph, ahhh my brain hurts, but look carefully at what is happening on the Y axis, as read count increases dispersion decreases, as we would expect. However, a lot of those black dot values below the red fitted line are probably underestimates of dispersion based on the small samples sizes used (only 3 replicates values measured in this example). So, to be conservative DESeq moves all these values below the red line up to the fitted value, BUT keeps all the empirical values above the red line, even if some of these are over estimates of dispersion, as I said it is being conservative. Why does dispersion matter? The more dispersion, or biological variation, the bigger the difference between counts between treatments is required before it differences become significant!

Finally we are ready to call DE values!

>res = nbinomTest( cds, “untreated”, “treated” )


                 id  baseMean baseMeanA baseMeanB foldChange log2FoldChange

1     GWD8ZKM01ESPR2 0.0000000 0.0000000         0        NaN            NaN

2       g10671t00001 0.6817440 1.3634880         0          0           -Inf

3       g16789t00001 0.0000000 0.0000000         0        NaN            NaN

      pval padj

1        NA   NA

2 0.2083517    1

3        NA   NA

Now lets plot the log2 fold changes (untreated versus treated) against the mean normalised counts, colouring in red those genes that are significant at 10% FDR.


> plotMA(res)



Note how lower lower expressed genes need a bigger log2 fold change value to be made red (not so obvious in this data due to low number of reads used in mapping).

Lets explore this a little more by comparing the normalized counts between all three replicates of just the untreated sample.

>ncu = counts( cds, normalized=TRUE )[ , conditions(cds)==”untreated” ]


>plotMA(data.frame(baseMean = rowMeans(ncu),log2FoldChange = log2( ncu[,2] / ncu[,1] )),

col = “black”)



Wow, look at all the DE in a single treatment,  thus we should call a gene DE only if the change between treated and untreated samples is stronger than what we see between replicates, and hence, the dividing line between red and black in the first figure mimics that for the second.

And now a histogram of p values to look at their distribution.

hist(res$pval, breaks=100, col=”skyblue”, border=”slateblue”, main=””)

> jpeg(“hist_pval.jpg”)

> hist(res$pval, breaks=100, col=”skyblue”, border=”slateblue”, main=””)



The pileup of very low p-values are the DE genes, the pileups at the right hand side of the distribution are from very low count genes that are given discrete values.

Now lets create some subset tables based around FDR cut offs (in this case 10%)

>resSig = res[ res$padj < 0.1, ]


Before we make any mistakes, lets save the data

>write.csv( res, file=”72hour_unstgVstg_all.csv” )

>write.csv( resSig, file=”72hour_unstgVstg_only_0.01FDR.csv” )

We are done with the basic analysis, we can do some more sophisticated filtering of the data. The main goal is to remove tests that give little chance of producing a meaningful result, with the aim of keeping the FDR in check. This results in an increased power to detect significant results whilst keeping the same level of type I error the same (false positive rate).

>rs = rowSums ( counts ( cds))

>theta = 0.4

>use = (rs > quantile(rs, probs=theta))




30145 36029

Basically here we are summing the counts across all conditions for each gene, then remove the genes with the lowest 40 % quantile (basically filtering out those that are false above from the original normalised cds table)

>cdsFilt = cds[ use, ]

Now just repeat the tests on the filtered data

>resFilt = nbinomTest( cdsFilt, “untreated”, “treated” )

Now lets look at the difference this makes to the FDR.

tabAll <-table(res$padj<0.1)

tabFilt <-table(resFilt$padj<0.1)

> tabFull


35266   763

> tabFilt


35266   763

In this case it didn’t seem to make much difference because most of the dropped values were NA! Just to visualise what is going on here lets plot the un-ajusted pvals of the ~40% quantile that we filtered.


> plot(rank(rs)/length(rs), -log10(res$pval), pch=16, cex=0.45)



Note that very few p value are less than 0.003 (this corresponds to about 2.5 on the − log10 scale), and this is demonstrated when we plot the pvals of each dataset.

> h1 = hist(res$pval,breaks=50,plot=FALSE)

> h2 = hist(resFilt$pval,breaks=50,plot=FALSE)

>colori <- c(‘do not pass’=”khaki”, ‘pass’=”powderblue”)

> barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, col = colori,

+ space = 0, main = “”, ylab=”frequency”)

> text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA)

> legend(“topright”, fill=rev(colori), legend=rev(names(colori)))

> dev.copy(png,’hist_filthist.png’)



Now we are going to process the data further, since some counts and be zero in some applications in makes sense to add psuedocounts (ie y = log2 (n + n0 ) where n0 is a small ‘chosen’ constant. The question is what is the value of the small number you add to the zero counts to make them non-zero (but keep them very small). DEseq uses a alternative approach.

>cdsBlind = estimateDispersions( cds, method=”blind” )

>vsd = varianceStabilizingTransformation( cdsBlind )

Now we will plot Per-gene standard deviation (taken across samples), against the rank of the mean, the standard deviation of the transformed data, across samples, against the mean, first using the shifted logarithm transformation (notAllZero is basically one of the numbers in the row is NOT ZERO), with 1 added to all values to remove zeros), then using DESeq’s variance stabilising transformation. While we are meant to see a constant standard deviation for the DESeq method along the whole dynamic range, with the shifted logarithm giving a highly elevated standard deviation in the lower count range for the other method, in fact they both look pretty similar for this small dataset.




>notAllZero = (rowSums(counts(cds))>0)

>meanSdPlot(log2(counts(cds)[notAllZero, ] + 1), ylim = c(0,2.5))

>meanSdPlot(vsd[notAllZero, ], ylim = c(0,2.5))

Now to deal with moderate counts that generate highly variable log fold change values and in some cases they become infinite. This will mess up any kind of clustering we may want to perform to identify expression common patterns. We can compare their ordinary log-ratios (lfc) against the moderated log-ratios (mod_lfc) using a scatterplot, with the genes bined by their log10 mean to colour them according to expression strength.

>mod_lfc = (rowMeans( exprs(vsd)[, conditions(cds)==”treated”, drop=FALSE] ) – rowMeans( exprs(vsd)[, conditions(cds)==”untreated”, drop=FALSE] ))

> lfc = res$log2FoldChange

> table(lfc[!is.finite(lfc)], useNA=”always”)

-Inf   Inf   NaN  <NA>

5085  4900 30145     0

Many of the latter are infinite (Inf,division of a finite value by 0) or not a number (NaN, resulting from division of 0 by 0).

> logdecade = 1 + round( log10( 1+rowMeans(counts(cdsBlind, normalized=TRUE)) ) )

> lfccol = colorRampPalette( c( “gray”, “blue” ) )(6)[logdecade]

> ymax = 4.5

> plot( pmax(-ymax, pmin(ymax, lfc)), mod_lfc,

xlab = “ordinary log-ratio”, ylab = “moderated log-ratio”,

cex=0.45, asp=1, col = lfccol,

pch = ifelse(lfc<(-ymax), 60, ifelse(lfc>ymax, 62, 16)))

> abline( a=0, b=1, col=”red3″)

Here grey to blue represents weakly to strongly expressed genes, for highly expressed genes there is general agreement (along diag) but for lowly expressed genes the moderated log-ratio is shrunk towards 0, compared to the ordinary log-ratio (the grey smear flattens out along the horizontal (0 value on the y-axis).


Now to look at heat maps of counts to see any patterns using the transformed  log data.



> cdsBlind = estimateDispersions(cds,method=”blind”)

> vsd = varianceStabilizingTransformation(cdsBlind)

>select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:30]

>hmcol = colorRampPalette(brewer.pal(9, “GnBu”))(100)

>heatmap.2(exprs(vsd)[select,], col = hmcol, trace=”none”, margin=c(10, 6))


and for a laugh the untransformed data.


Now for a

>dists = dist( t( exprs(vsd) ) )

>mat = as.matrix( dists )

> rownames(mat) = colnames(mat) = with(pData(cdsBlind), paste(condition))

> heatmap.2(mat, trace=”none”, col = rev(hmcol), margin=c(13, 13))


No shocks here again. Of note is that we used the method “blind”, thus the variance stabilizing transformation is not informed about the design to prevent bias in the outcome. In comparison, the pooled method (the default for estimating the dispersion) “means that one dispersion is computed for each gene, which is now an average over all cells (weighted by the number of samples for each cells), where the term cell denotes any of the four combinations of factor levels of the design.”

Now a good olde PCA, pretty cool really.

>print(plotPCA(vsd, intgroup=c(“condition”)))


This type of plot is useful for visualizing the overall effect of experimental covariates and batch effect, so you might expect known conditions to group, ie treatment and any time of library prep effects (paired end reads versus single end data). Good to see that even with this subset of data biggest component is treatment (pc1).


R version 2.15.3 (2013-03-01)

Platform: x86_64-pc-linux-gnu (64-bit)



[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8


[7] LC_PAPER=C                 LC_NAME=C

[9] LC_ADDRESS=C               LC_TELEPHONE=C


attached base packages:

[1] grid      stats     graphics  grDevices utils     datasets  methods

[8] base


other attached packages:

[1] arrayQualityMetrics_3.14.0 gplots_2.10.1

[3] KernSmooth_2.23-10         caTools_1.14

[5] gdata_2.12.0.2             gtools_2.7.1

[7] RColorBrewer_1.0-5         vsn_3.26.0

[9] BiocInstaller_1.8.3        DESeq_1.10.1

[11] lattice_0.20-15            locfit_1.5-9

[13] Biobase_2.18.0             BiocGenerics_0.4.0


loaded via a namespace (and not attached):

[1] affy_1.36.1           affyio_1.26.0         affyPLM_1.34.0

[4] annotate_1.36.0       AnnotationDbi_1.20.7  beadarray_2.8.1

[7] BeadDataPackR_1.10.0  Biostrings_2.26.3     bitops_1.0-5

[10] Cairo_1.5-2           cluster_1.14.4        colorspace_1.2-1

[13] DBI_0.2-5             gcrma_2.30.0          genefilter_1.40.0

[16] geneplotter_1.36.0    Hmisc_3.10-1          hwriter_1.3

[19] IRanges_1.16.6        latticeExtra_0.6-24   limma_3.14.4

[22] parallel_2.15.3       plyr_1.8              preprocessCore_1.20.0

[25] reshape2_1.2.2        RSQLite_0.11.2        setRNG_2011.11-2

[28] splines_2.15.3        stats4_2.15.3         stringr_0.6.2

[31] survival_2.37-4       SVGAnnotation_0.93-1  tcltk_2.15.3

[34] tools_2.15.3          XML_3.6-2             xtable_1.7-1

[37] zlibbioc_1.4.0


VIM shortcuts

Basic navigation

2h - 2 steps LEFT
H - home (top screen)
10j - 10 steps DOWN
J - join this line to the line below
{ - UP a section 
} - DOWN a section
o O - open line BELOW open line ABOVE
b - beginning of current word
w - beginning of next word
e - end of current word
$ - end of line 
^ - start of line
H - TOP of screen
L - BOTTOM of screen
v - select text
V - select line
v3j - select 3 lines DOWN

Modifying text

i - insert
a - append
dh - DELETE left
dl - DELETE right
dd - DELETE line
dw - delete word
D - DELETE to end of line
x - DEL
X - BACKSPACE - INDENT line LEFT (use visual for block) 
>> - INDENT line RIGHT  (use visual for block)
p - paste after
P - paste before
3p - paste 3 copies
yy - yank whole line
yw - yank word
y$ - yank to end of line
4yy - yank four lines
Y - yank whole line
J - join current line to next

Save Exit and help

K - HELP (bash, req select word)
ZZ- save and quit
ZQ - quite without saving
z - save


[ - enter command mode
, - leader key
v - visual
: - enter command mode
i - insert mode
a - append mode
:tabnew - open new tab
,n - left tab
,m - right tab
/ - search # press enter, then n to repeat, N repeat up 
:s/old/new/g - globally change


F2 - paste code blocks from external (press again to end)
n - remove highlight last search
,e - quit
,E - qa! quit all windows
ctl j - move to lower window
ctl k - move to upper window
ctl h - move to left window etc
,s - sort selected
f - fold code block (again to undo)
F - fold all (again to undo)
ctl space - auto complete
shift p - filesearch


:Gstatus - status / stage by "-"
:Gcommit - message 'wq' save

Cool combinations

cH, dH - copy/delete to top of screen
d3+ - delete line + 2 lines below
y3+ - copy lien + 2 lines below
d/test - delete all words 'test'

Insert breakpoint and ipdb PDB bit…

Setting up VIM as a python IDE

If you are looking for a lightweight but feature rich editor you can’t go past VIM! Sure those keyboard shortcuts are difficult remember and it’s frustratingly hard trying to stop reaching for that bloody mouse (the mouse will actually work in this set-up BTW). However, what I have found is the best way to learn VIM (I don’t think you ever really learn vim) is to make it useful enough that you want to start using it; this helps build up the muscle memory required to become efficient [enough]. The ultimate payback is that with practise you can become really really efficient relative to a user stuck with a mouse and complex GUI. The mouse is so 1980’s anyway 😛


So this post is about making VIM useful! The steps are based around Ubuntu/Debian Linux but should be adaptable for OSX and other NIX distros. Windows users, like yourselves I have no idea (-;?

Of course nothing I ever do is original and this especially goes for this post which borrows heavily from an excellent (but old) PyCon Asia talk by Martin Brochhaus. The .vimrc and much of the instructions are taken from the talk so feel free to watch it on youtube. I’ve also tried to update some of the info and stick to the minimum to get you coding away ASAP.

Section 1: Basic VIM install

These steps will clone the current version of VIM from source and enable additional features. If you already have a copy installed (ie vim starts when you type ‘vim’) I would first fun ‘apt-get remove vim’ (or your equivalent). First dependencies and some tools.

sudo apt-get update
sudo apt-get install mercurial curl python-pip flake8
sudo apt-get build-dep vim

#install python packages
sudo pip install ipdb
sudo pip install jedi
sudo pip install flake8

As we are going to use a bin folder in our home directory we need to put this location into our path so that our new version of vim will run from the command line.

#add a home bin dir to your path
nano ~/.bashrc

#add the following to the end of the file
if [ -d $HOME/bin ]; then

#save and exit

Now we make the ~/bin dir as well as an ~/opt for the vim install. Eventually we will use a simlink from ~/bin (now in our path) to the ~/opt folders in our home directory, the -p flag says ‘don’t complain if this dir is already there’. The final command reloads your bashrc without you having logout. Cool.

mkdir -p ~/bin
mkdir -p ~/opt
source ~/.bashrc

Now to install VIM.

cd ~/Desktop
hg clone vim
cd vim/src
./configure --enable-pythoninterp --with-features=huge --prefix=$HOME/opt/vim
make install
cd ~/bin
ln -s ~/opt/vim/bin/vim

#the following should return /home/YOUR_USERNAME/bin/vim
which vim


Section 2: The vimrc settings file

Now we need a “.vimrc file”. This file will contain our customisations (the dot says I’m a hidden file). You can get mine from here (download to your desktop or copy contents into a blank txt file). It is worth look at this file in a text editor as it contains some information on the key short-cuts and the plugins we are going to use (BTW in .vimrc files a single quote ” indicates a comment (aka #) a double quote “” indicates a bit of code that can be uncommented to enable something). The next bit of code simply copies the file to your home dir and the gets a custom colour scheme (mostly just to show how to do it).

mv ~/Desktop/vimrc.txt ~/.vimrc
mkdir -p ~/.vim/colors
cd ~/.vim/colors
wget -O wombat256mod.vim

Section 3: Getting set-up for the plugins with pathogen

This next part is important. To manage plugins we will use a bit of kit called pathogen. The plugins can then be installed (mostly by a git clone) right into the “~/.vim/bundle/” (this will result in a folder structure like this: .vim/bundle/plugin-name) and pathogen will handle everything for us – awesome! Most plugin developers set-up their folders to work nicely with pathogen to make life easy.

mkdir -p ~/.vim/autoload ~/.vim/bundle
curl -so ~/.vim/autoload/pathogen.vim

Section 4: Installing plugins

The first plugin we will use is called powerline and it adds features and makes a better looking status line.

#install powerline plugin
cd ~/.vim/bundle
git clone git://

The next plugin allows code folding to make it easier to look through long blocks of code. Its simple to use, just type f to collapse a section of code of F to collapse it all. Type again to reverse the folding – sweet as.

# install folding flugin us f to fold block of F to fold all
mkdir -p ~/.vim/ftplugin
wget -O ~/.vim/ftplugin/python_editing.vim

The next one installs ctrlp which allows nice fuzzy file searches from directly inside vim.

#install ctrlp
cd ~/.vim/bundle
git clone

Install the jedi plugin which allows awesome auto-completion of commands and imports using [ctl][space].

#install jedi plugin
git clone --recursive

Install syntastic is another awesome plugin that does syntax checking and will check your code for compliance with PEP8.

#install syntastic
git clone

Forget tabbing into the terminal to stage/commit/merge etc just install git support with fugitive.

# install git support
cd ~/.vim/bundle
git clone git://
vim -u NONE -c "helptags vim-fugitive/doc" -c q

So that is it! Your editor should look something like this:


Well that’s the install. Next post will be the shortcuts I use all the time and some that I want to learn as a bookmark for myself. Otherwise checkout the links to the plugins for detailed descriptions of what they do. But now you should be up and running with a pretty nice looking IDE!

Install Arch Linux on virtualbox – the nuts and bolts (pt2)

**This continues on from the previous post that can be found here**

Now that we have our base system installed, its time to add some tools that will give us a nice GUI desktop. But first off we will setup sudo so we can stop being root. First we’ll create a user. This first command creates a home directory called “dwheeler” using the -m flag, adds this user to the administrator group (wheel) with the -G flag and links us to bash. Enter a password after the passwd command. Obviously substitute for your username and password.

useradd -m -G wheel -s /bin/bash dwheeler
passwd dwheeler

Next we setup sudo. Sudo has a special editor to change it called “visudo”, we should always use this modify the config file. After typing the visudo command scroll down to the line that contains “root ALL=(ALL) ALL”, and underneath that add your username and the “ALL=(ALL) ALL” part. Note that visudo uses VI, which can be a little tricky to use for the uninitiated. Once you are at the line you want to insert your username, type “i” to insert, once you have finished type [esc] and then colon “:” and [wq] to save and exit (ie “:wq”).

root  ALL=(ALL) ALL
dwheeler  ALL=(ALL) ALL

Type “reboot” and now login as yourself. Now to install some graphics tools.

sudo pacman -S xorg-server xorg-xinit xorg-server-utils mesa
sudo pacman -S xorg-twm xorg-xclock xterm

Because we are using virtualbox, we need to install some helper tools that will allow the graphics to work properly. I got some good ideas from this post (, it might be worth checking it out for a second way of doing this.

sudo pacman -S virtualbox-guest-utils
sudo nano /etc/modules-load.d/virtualbox.conf

After nano opens, add these to these lines to the virtualbox.conf file:


Then get this too load.

sudo systemctl enable vboxservice.service

Reboot using “sudo reboot”, once you are back into the environment type “startx”, and some very basic windows should open confirming that x is working! Type exit in these windows to return to the terminal.


Finally, we will install the desktop environment. This consists of two parts, first the display manager that will log us in and kick off the desktop, and then the desktop environment itself. You can choose from a bunch of different environments, from fancy feature rich to bare-bones, see the arch desktop page for the options. I’m all for saving resources so I’m installing the lightweight LXDE. Installing the lxde group using pacman also installs the display manager (called lxdm).

pacman -S lxde

Accept the defaults. Now we need to get the display manager to load automatically at boot, and set the default desktop (note you can install multiple desktops environments lxdm will give you options to boot into them instead of the default if you should wish)

sudo systemctl enable lxdm
nano /etc/lxdm/lxdm.conf

In nano uncomment the desktop environment you want to be the default, in my case it was this line ”

Reboot, login, and bobs your uncle!

Note: If full screen doesn’t work, try typing “sudo depmod -a” in a terminal and then reboot again


Last job is just to change the permissions on the shared folder so that we can access the host (replace username with your username).

sudo usermod -a -G vboxsf username


New Zealanders and their sheep – part 2

Ok, based on the graphs in the last post NZ is slowly being cow-a-fyed, so whats driving this trend. Well google will tell you that its …

WARNING: This data is dodgy, but I’m really just using it to demonstrate how cool pandas is. So I found some information on milk and lamb meet prices, we’ll load them up as dataframes and work out the percent change since 1994 like we did before. We’ll try out the datetime functionality of pandas, which is really quite nice. But first just to import our table from the last post and make the year the index so we can easily merge the new data.

import pandas as pd
per_decline = pd.DataFrame(pd.read_csv('percent_decline.csv'))
cols = per_decline.columns.values
cols[0] = 'Year'
per_decline.columns = cols
per_decline.index = per_decline['Year']
per_decline = per_decline.ix[:,1:] #all rows, skip first column (date is now the index)

	Total beef cattle	Total dairy cattle	Total sheep	Total deer	Total pigs	Total horses
1994	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000
2002	-11.025827	 34.444950	-20.002034	 33.858009	-19.100637	 11.811093
2003	 -8.344764	 32.882482	-20.041908	 37.229441	-10.766476	 18.504488
2004	-11.895128	 34.207998	-20.609926	 42.707754	 -8.072078	 13.376472
2005	-12.366101	 32.506699	-19.379727	 38.499840	-19.230733	-100.000000

Now we are going to create a table from the dodgy lamb price data, this table is in a slightly different format so we will have to use the groupby method to wrangle it into the shape we need.

lamb_data = pd.DataFrame(pd.read_excel('lamb_usd.xlsx',sheetname='lamb'))
	Month	Price	Change
0	 Apr 1994	 130.00	 -
1	 May 1994	 126.59	 -2.62 %
2	 Jun 1994	 127.03	 0.35 %
3	 Jul 1994	 126.11	 -0.72 %
4	 Aug 1994	 119.62	 -5.15 %

Now to use datetime to make an index based on the month data.

lamb_data.index = pd.to_datetime(lamb_data['Month'])
lamb_data=lamb_data.ix[:,1:2] #just grab the price
1994-04-02	 130.00
1994-05-02	 126.59
1994-06-02	 127.03
1994-07-02	 126.11
1994-08-02	 119.62

Pandas did a good job of converting the date format into a datetime index. As you’ll see in a second this datetime object has some extra functionality that makes dealing with dates a breeze. Although this new data has the date and price information we need, its divided into quarterly amounts. As you can see by the commented out code, initially I made a mistake and summed these values, but really we want the mean to get the average yearly price. I left the mistake code there as it shows how easy it would have been to get the sum using groupby.

#wrong! lamb_prices = lamb_data.groupby(lamb_data.index.year)['Price'].sum()
lamb_prices = lamb_data.groupby(lamb_data.index.year)['Price'].mean()
lamb_prices = pd.DataFrame(lamb_prices[:-1]) #get rid of 2014
1994	 124.010000
1995	 113.242500
1996	 145.461667
1997	 150.282500
1998	 116.013333

We pass the year index to groupby and get it to do its magic on the price column (our only column in this case, but you get the idea), we then just call the mean method to return the mean price per year. The datetime object made specifying the year easy. Now we are going to write a quick function to calculate the percent change since 1994.

def percent(start,data):
    '''calculate percent change relative to first column (1994), better than previous attempt )-:'''
    ans = 100*((start - data)/start)
    return 0-ans

lamb_change = percent(lamb_prices.Price[1994],lamb_prices)
1994	 0.000000
1995	 -8.682768
1996	 17.298336
1997	 21.185791
1998	 -6.448405

Great! Now just add that column to our original dataframe. Notice how only the intersect of the dates are used, very handy (ie it drops 1995-2001 from the lamb price data as these dates are not in our stock number table)!

per_decline['Lambprice'] = lamb_change
	Total beef cattle	Total dairy cattle	Total sheep	Total deer	Total pigs	Total horses	Lambprice
1994	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000
2002	-11.025827	 34.444950	-20.002034	 33.858009	-19.100637	 11.811093	 17.768056
2003	 -8.344764	 32.882482	-20.041908	 37.229441	-10.766476	 18.504488	 28.869984
per_decline.index=per_decline.index.astype(int) #lamb2
plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')



The next series of code and graphs adds in milk and lamb prices to try and see why farmers are moving from ovines to bovines!

milk_data = pd.DataFrame(pd.read_excel('milk_prices_usd.xlsx',sheetname='milk'))
	year	thousand head	pounds	mill lbs	price_cwt
1989	 1989	 10046	 14323	 143893	 13.56
1990	 1990	 9993	 14782	 147721	 13.68
1991	 1991	 9826	 15031	 147697	 12.24
1992	 1992	 9688	 15570	 150847	 13.09
1993	 1993	 9581	 15722	 150636	 12.80
#get rid of the info we don't need
milk_data = pd.DataFrame(milk_data.ix[5:,:])
milk_change = percent(milk_data.price_cwt[1994],milk_data)
per_decline['milk_price'] = milk_change
	Total beef cattle	Total dairy cattle	Total sheep	Total deer	Total pigs	Total horses	Lambprice	milk_price
1994	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000
2002	-11.025827	 34.444950	-20.002034	 33.858009	-19.100637	 11.811093	 17.768056	 -6.630686
2003	 -8.344764	 32.882482	-20.041908	 37.229441	-10.766476	 18.504488	 28.869984	 -3.469545
2004	-11.895128	 34.207998	-20.609926	 42.707754	 -8.072078	 13.376472	 33.670672	 23.747109
2005	-12.366101	 32.506699	-19.379727	 38.499840	-19.230733	-100.000000	 29.762385	 16.653816

lamb_3 These graphs are a little busy, lets just concentrate on the important stuff.

<pre>animals=['Total dairy cattle','Total sheep','Lambprice','milk_price']