Illumina quality control

Several tools are available for investigating the quality of your sequencing data and these should be used before any downstream analysis, after all, crap in carp out.

The first tool, written in Java and called fastqc, has a simple GUI or can be used in the command line to generate a number of useful graphs describing several important features of your sequence data. These include (among others) per base quality and read length distributions, as well as detection of sequence duplication’s and kmer content. Handily the authors also provide some good and bad Illumina data that you can use to test out the software.

Running is easy, either use the GUI or just type: ./fastqc seq_file_name

The –help flag will provide the other options, format detection is automatic and worked for me so probably the only option you might consider is ‘-t’ to set the number of cores to use.


As you can see by the html report that is produced by fastqc a summary is provided indicating if your data meet a certain quality expectations (these seem to be very strict).

Here I have run the good (left) and bad (right) test data-sets and shown them side by side for comparison. Edit: These data-sets come from the SolexaQA package that I’ll talk about in another blog.


These box whisker plots of quality per base show how illumina data degrades at the 3′ end of the read. Here the median value as red line, box (yellow) the inter-quantile range (35-75%), upper/lower whiskers represent 10% and 90% points, while the blue line is the mean quality. Good quality scores (green), reasonable (orange), and (poor) are indicated by background shading. A warning will be issued if any base is less than 10, or if the median for any base is less than 25.


These plots show the quality score distribution over all sequences. The fastQC manual says that even in good data a small proportion of sequences may have low quality scores, often because they were poorly imaged.


Nothing dramatic here, these are the sequence content of each nucleotide across the reads, the lines should be horizontal reflecting the underlying GC%.


The sequence duplication level will reveal any problems during library preparation. Note that any duplicates that occur greater than 10 times are lumped into the 10+ category so if you see a pile-up there you have problems. Note that some duplication may occur if their is very high coverage.


The per base N content should be less than 3%.


The GC content distribution of the data (red) against a theoretical modeled distribution (blue). A shifted GC distribution away from the modeled normal distribution indicates a systematic bias that is independent of base position, probably the result of contamination in the library or some kind of bias subset of sequences.


The line should be horizontal in a random library, GC bias which changes in different bases can indicate over-represented sequence in your library.


In most cases the sequence length distribution should be uninteresting (at least for illumina data), but if any kind of trimming has occurred this should be revealed here.


Developer mode and Ubuntu on a chromebook without duel booting

I really love my chromebook, at least when I get to wrestle it off the wife. But every now and then I miss the fact that I can’t jump into the terminal and start tapping away. But I don’t really want to get rid of chomeOS, its great the way it is and duel booting seems to defeat the purpose of having a quick start “always on” laptop.

So long story short, I wasn’t ready to go down the install ubuntu on your chromebook path. That was until I came across a great post on google+ ( which describes a bunch of scripts called crouton that allow you to run ubuntu inside of chromeOS! Switching between the two systems is as easy as a key stroke, and it is instant.

Instructions for ARM based Samsung chromebooks (the $250 ones)

1. Backup everything! Well if you live in the cloud this is not a problem as google is taking care of this for you, but this process WIPE YOUR DRIVE (You can restore chromeOS if you like, no harm no foul, but any data on your SSD will be GONE).

Also, developer mode is less secure as Google is no longer watching your back, don’t be stupid with your new found power!

Identity crisis, Linux and ChromeOS side by side!

Identity crisis, Linux and ChromeOS side by side!

1. Power off and enter developer mode by tapping the power button while holding down ESC and REFRESH, this should bring up a scary “your OS is damaged” screen, now hit CTRL D, and following the on screen instructions press ENTER to switch to developer mode (this takes some time).

2. Once you are back to the login screen, type in your google credentials and restore you system, for me it seemed that everything was running a lot quicker than with the walled garden version, maybe I was dreaming.

3. Once you are up and running with your ChromeOS, open a browser and download crouton, use the file manager to check that it has downloaded to your download directory.

4. Now open a shell window by typing CTL ALT T (note, this won’t work if you use the VT2 terminal ie CTL ALT Forward!), at the chronos shell type “shell” to, well, enter the shell

5. Follow the instructions to setup a sudo password!

6. Now: “sudo sh -e ~/Downloads/crouton -t xfce”. Note this will install xfce desktop (a light weight desktop that comes with xubuntu, really, its all you need), see the crouton page for more information on other options (Unity, if your that way inclined\-:).

8. After the system downloads and installs, at the prompt type “sudo startxfce4” to enter your new linux environment!

So far this is great, to switch between stock chromeOS and xfce just use Ctrl+Alt+Shift+Back and Ctrl+Alt+Shift+Forward, how easy is that!

So what now, well apt-get works, why not install nano, or in my case gimp, since I needed to edit some of the photos in this post. Life is good(-:

Screenshot - 030213 - 19:22:01

Issues: yes! There is a little bit of instability, but I’ve found a quick switch between OS’s seems to calm things down.