Oft repeated commands and work flows [work in progress]

This is just a dumping ground for commands I perform all the time so that I have somewhere to look them up!

Convert tophat bam file into a sorted sam file for htseq-count


#if used tophat should be already sorted
samtools view -h tophat_sorted.bam > tophat_sorted.sam

#if need to sort before conver (output tophat_sorted.bam)
samtools sort tophat.bam tophat_sorted

Bash

wc -l seq_file.fas #count lines
grep -c '>' seq_file.fas #count seqs in fasta file
ls -l #permissions and file size
ls -lh #above with human format for sizes
chmod a+wx #add read write permissions to all
chmod 755 #read and execute for user/world
chmod -R 755 #change all files in folder
tail -n 400 #last 400 lines
tail -n +400 #from 400 to end
history | grep "command" #search history for commands
#sed prints to screen, redirect or us -i for inplace
#print a specific line (100) using sed -n just shows that line
sed -n -e '100d' test.txt
#delete a specfic line (line 1000) from a file and backup original
sed -i.bak -e '1000d' file.txt
#delete lines
sed -i.bak -e '1d,10d
#removes line 10 to 20 INCLUSIVE ie 11 lines
sed -i -e '10,20d' test.txt
#remove line 10, lines 15-20, line 100
sed -i -e '10d;15,20d;100d' test.txt
#convert fastq to fasta file!(effecient)[p is print]
sed -n '1~4 s/^@/>/p;2~4p' seq.fq > seq.fas
#get the md5 checksum of a folder
find FOLDER/ -type f -exec md5sum {} \; | sort -k 34 | md5sum > md5.txt

Htseq-count -s flag turn off stranded union is default

htseq-count -s no -m union tophat_sorted.sam > sample_counts.txt

Vim

v #visual
p #paste
# find word at hash
yy #yank word or line
d #delete line
wq #save and close
ma #mark a, use 'a' to goto this section
mz # mark z as above
{} #select paragraph
{d} #cut block</pre>

Python specific

pip install --upgrade packagename
#virtual env
virtualenv venv
source venv/bin/activate
deactivate venv

print "%f"%(np.mean(seq_length))
#prints 1019.662143
print "%.2f"%(np.mean(seq_length))
#prints 1019.66
print "%.1f"%(np.mean(seq_length))
#prints 1019.7 note rounding up
print "%d"%(np.mean(seq_length))
#prints 1019

#use with open('') as for automatic file closing
with open('infile.txt','r') as f:
    f.read()

Git

WordPress (see text for tags)

use the following tags
in sqaure_brackets code language="css" and end with code in sqr bracktets
import sys

test = sys.argv[1]

for name in test:
    print name
plot(xmir205,ycbx1,'go',fig_fn205(xmir205'),'-k')
plot(xmir205,ycbx1,'go',fig_fn205(xmir205),'-k')

R

browseVignettes()

Matplotlib
Change the font family.

font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 22}

matplotlib.rc('font', **font)

Basic data plotting with ipython and pandas

I’ve posted before on how much I love plotting in python using pandas, matplotlib, and ipython. But I thought I would do a blog post that would contain some important ‘bookmarks’ on how to do some essential plotting tasks. Just as a simple example of how pandas could be used in a biological sense, how about we make a dataframe from a blast table outputted from NCBI BLAST. The example data is just a blast search using transcripts as a qry with output option 8, which creates a tabular output that is nice for parsing using scripts. Note, if you want to use my table go for it, its here as an tab-delimited Excel spreadsheet, for this example I just copied it into a text document called ‘blast_output.txt’ – I can’t upload txt files at wordpress)-:)

1) Basic table manipulations and very simple excel-like maths functions.

Normally I probably would pre-process the data using a simple python script, but here I’ll do it all in pandas.

from pandas import Series,DataFrame
import matplotlib.pyplot as plt
import pandas as pd
blast_data = pd.read_table('blast_output.txt',header = 0)
blast_data = DataFrame(blast_data)
blast_data.head()

#          QUERY  NUMHITS                                            HITDESC  \
#0  Sequence_7116        1  ref|XP_004521769.1| PREDICTED: M-phase phospho...   
#1  Sequence_7117        1  ref|XP_002059188.1| GJ16167 [Drosophila virili...   
#2  Sequence_7118        1  ref|XP_002008154.1| GI11977 [Drosophila mojave...   
#3  Sequence_7119        1  ref|XP_001955873.1| GF24906 [Drosophila ananas...   
#4  Sequence_7120        1  ref|XP_004520750.1| PREDICTED: proteoglycan 4-...   
#
#   NUMHSP           EVAL  BITSCORE  PERCENTIDENT  PERCENTCONSERV  QUERYCOV  \
#0       1   2.000000e-44       158      0.702290        0.793893  0.511719   
#1       1  1.000000e-175       502      0.766667        0.886667  0.853081   
#2       3   1.000000e-46       132      0.680412        0.762887  0.126928   
#3       1   2.000000e-62       211      0.452830        0.577358  0.716599   
#4       2   0.000000e+00       734      0.514886        0.643608  0.577610   

#     HITCOV  
#0  0.796178  
#1  0.974026  
#2  0.572414  
#3  0.933824  
#4  0.650243

So now we have the blast table stored as a pandas data frame, it might be a good time to show how to splice and dice this frame so that it shows information that we are interested in. Firstly, column and index headings are easy to access using (in this example) “blast_data.columns” and “blast_data.index”. To grab a column is simple:

blast_data['QUERY']
##output##
0     Sequence_7116
1     Sequence_7117
2     Sequence_7118
3     Sequence_7119
4     Sequence_7120
5     Sequence_7121
......

To slice section of the table we would use “blast_data.ix”. As this is a two demensional table sometimes its a little tricky to workout how “ix” works. But its quite simple really, its just the two dimensions separated by a comma, with the row range first and columns second, remembering that as this is python counting starts at 0! So for some easy examples, lets see how to get the first line and all rows (1), first 10 rows and 3 columns (2), first 10 rows and last 2 columns (3). As you can see its just normal python list splicing but in 2 dimensions.

#first line and all rows (1)
blast_data.ix[0:0,:] #index 0 to 0 ie first row and ":" columns ie all of them
###output###
 QUERY  NUMHITS                                            HITDESC  \
0  Sequence_7116        1  ref|XP_004521769.1| PREDICTED: M-phase phospho...   
   NUMHSP          EVAL  BITSCORE  PERCENTIDENT  PERCENTCONSERV  QUERYCOV  \
0       1  2.000000e-44       158       0.70229        0.793893  0.511719   
     HITCOV  
0  0.796178  

#first 10 rows and 3 columns (2) 
 blast_data.ix[0:10,0:3] #standard python slice
###output####
            QUERY  NUMHITS                                            HITDESC
0   Sequence_7116        1  ref|XP_004521769.1| PREDICTED: M-phase phospho...
1   Sequence_7117        1  ref|XP_002059188.1| GJ16167 [Drosophila virili...
2   Sequence_7118        1  ref|XP_002008154.1| GI11977 [Drosophila mojave...
3   Sequence_7119        1  ref|XP_001955873.1| GF24906 [Drosophila ananas...
4   Sequence_7120        1  ref|XP_004520750.1| PREDICTED: proteoglycan 4-...
5   Sequence_7121        1  ref|XP_004537614.1| PREDICTED: protein suppres...
6   Sequence_7122        1  ref|XP_002073069.1| GK13353 [Drosophila willis...
7   Sequence_7123        1  ref|XP_001981946.1| GG11315 [Drosophila erecta...
8   Sequence_7124        1  ref|XP_002052580.1| GJ17621 [Drosophila virili...
9   Sequence_7125        1  ref|XP_002007909.1| GI13202 [Drosophila mojave...
10  Sequence_7126        1  ref|XP_004520332.1| PREDICTED: nucleolar prote...

#first 10 lines and last 2 rows (3)
blast_data.ix[0:10,-2:] #standard python again, '-2' is 2 from end
###output### 
    QUERYCOV    HITCOV
0   0.511719  0.796178
1   0.853081  0.974026
2   0.126928  0.572414
3   0.716599  0.933824
4   0.577610  0.650243
5   0.312649  0.376129
6   0.615196  0.996016
7   0.934222  1.000000
8   0.840226  1.000000
9   0.157009  0.342857
10  0.837484  0.960961

The great thing about the pandas dataframe is you can do excel like maths work using the table, for example firstly lets just look at E-values < 1E-100.

good_hits = blast_data[blast_data['EVAL']<1e-100]
len(good_hits)
#output 576
good_hits.head()

###output####
           QUERY  NUMHITS                                            HITDESC  \
1  Sequence_7117        1  ref|XP_002059188.1| GJ16167 [Drosophila virili...   
4  Sequence_7120        1  ref|XP_004520750.1| PREDICTED: proteoglycan 4-...   
5  Sequence_7121        1  ref|XP_004537614.1| PREDICTED: protein suppres...   
6  Sequence_7122        1  ref|XP_002073069.1| GK13353 [Drosophila willis...   
7  Sequence_7123        1  ref|XP_001981946.1| GG11315 [Drosophila erecta...   

   NUMHSP           EVAL  BITSCORE  PERCENTIDENT  PERCENTCONSERV  QUERYCOV  \
1       1  1.000000e-175       502      0.766667        0.886667  0.853081   
4       2   0.000000e+00       734      0.514886        0.643608  0.577610   
5       2  1.000000e-108       390      0.438655        0.605042  0.312649   
6       1  1.000000e-112       342      0.648221        0.833992  0.615196   
7       1   0.000000e+00      1845      0.513501        0.672667  0.934222   

     HITCOV  
1  0.974026  
4  0.650243  
5  0.376129  
6  0.996016  
7  1.000000

We can even collect the best hits by sorting by the EVAL column.

best_hits.sort('EVAL',inplace=True)
best_hits.head()
###output##
            QUERY  NUMHITS  \
501  Sequence_7617        1   
400  Sequence_7516        1   
754  Sequence_7870        1   
398  Sequence_7514        1   
755  Sequence_7871        1   

                                               HITDESC  NUMHSP  EVAL  \
501  ref|XP_004525475.1| PREDICTED: kelch domain-co...       1     0   
400  ref|XP_001362065.2| GA19724 [Drosophila pseudo...       1     0   
754  ref|XP_004524178.1| PREDICTED: mitogen-activat...       1     0   
398  ref|XP_004523028.1| PREDICTED: sphingosine kin...       1     0   
755  ref|XP_004536256.1| PREDICTED: V-type proton A...       1     0   

     BITSCORE  PERCENTIDENT  PERCENTCONSERV  QUERYCOV    HITCOV  
501       691            74        0.863248  0.537193  0.902724  
400       842            61        0.744382  0.904181  0.973574  
754      2147            73        0.837891  0.934356  0.965003  
398       636            53        0.680445  0.797900  0.993399  
755       897            84        0.897143  0.379302  0.949367

Some *very* simple math might be to convert the percent identity into a percentage value

good_hits['PERCENTIDENT'] = good_hits['PERCENTIDENT']*100
good_hits['PERCENTIDENT'] = good_hits['PERCENTIDENT'].astype(int)

Of course you could do a lot more complex excel type maths than that, but hopefully you get the idea!

2) Histograms – everyone loves a good histogram.

For this histogram how about we collect information from a blast search and parse out the species, so that we can look at histogram of ‘species richness’ in the best BLAST hits. In this format column 3 contains a hit descriptor that also lists the species name in square brackets (ie [Nasonia vitripennis]). So I want to parse out that information, collect all the totals, and plot the most frequent 10 (for example).  We really only need the species part, so we discard the genus information to avoid problems with all those contaminating Drosophila genomes in the database (-;

Now we are really interested in the column “HITDESC”. What we want is just the frequency distribution of the species in the square brackets, all the other info is not required for a histogram.

desc = blast_data['HITDESC']
desc
###output###
0     ref|XP_004521769.1| PREDICTED: M-phase phospho...
1     ref|XP_002059188.1| GJ16167 [Drosophila virili...
2     ref|XP_002008154.1| GI11977 [Drosophila mojave...
3     ref|XP_001955873.1| GF24906 [Drosophila ananas.
......

Although names are descriptions are incomplete this is just a display issue, the data is still there. First we will write a quick filtering function to grab the species name, then use list comprehension to make a collection of names, finally, we will use another function to count each occurrence of the names in the list.

def species(desc):
    #a function to pull out the species name from [species genus]
    line = line.split(' [')[1] #get species genus
    line = line.split(' ')[0] #trim off genus part
    return line

def counter(seq):
    """make a freq dict with species as key"""
    seq_dict = {}
    for n in seq:
        if n in seq_dict:
            seq_dict[n] += 1
        else:
            seq_dict[n] = 1
    return seq_dict

desc_list = [species(des) for des in desc] #make a list of names
###output###
['Ceratitis',
 'Ceratitis',
 .......
 'Ceratitis',
 'Oikopleura']

desc_dict = counter(desc_list)
#lambda magic to get a sorted list from an unsorted dictionary
most_freq8 = sorted(spec_dict,key=lambda x:spec_dict[x],reverse=True)[:8]
#if you wanted to check this works
#chk = spec_dict.values()
#chk.sort()
#chk[-8:]#show last ten values, crossref these against diction usin#g key as most freq species from most_freq8

Nearly there, you’ll probably notice our most_freq8 list is no good for a histogram because we have lost our frequency information. We can get it back using some functional programing magic (BTW for an excellent run through of functional programing in python check out the great how-to-here here).

most_freq_dict = dict(((spec,spec_dict[spec]) for spec in most_freq8)) 
most_freq_dict
###output###
{'Acyrthosiphon': 7,
 'Aedes': 8,
 'Anopheles': 6,
 'Ceratitis': 501,
 'Drosophila': 388,
 'Glossina': 7,
 'Musca': 7,
 'Sarcophaga': 8}

Briefly, a dictionary can be created from tuple of key value pairs, so all we do here is use list comprehension to combine the “spec_dict” frequency counts with the keys from the most frequent list to create a new dictionary. As it turns out, dictionaries can be converted directly to pandas dataframes, however, this will create a frame with all the species as column headings. We could flip the axes using DataFrame.transpose fucntion, but that is messy. Instead we will use list comprehension again to reformat the data (Yes I know I probably could have done this a few steps ago to avoid all this data mushing).

hist_data = DataFrame([[key,most_freq_dict[key]] for key in most_freq_dict])
hist_data
###output###
               0    1
0          Musca    7
1          Aedes    8
2     Drosophila  388
3      Anopheles    6
4      Ceratitis  501
5  Acyrthosiphon    7
6     Sarcophaga    8
7       Glossina    7
#need the index to be the species to get them to display

Great, sort of! Really, we want the species names to be the index so they are displayed in the histogram, we could just re-index from here, but lets re-make the table directly specifying that we want the index to be the key from the dictionary

hist_data = DataFrame([most_freq_dict[key] for key in most_freq_dict],columns = ['Frequency'],index=most_freq_dict.keys())
hist_data #we use the same dictionary(most_freq_dict) for columns and index, so this will work!
###Output### 
                 Frequency
Musca            7
Aedes            8
Drosophila     388
Anopheles        6
Ceratitis      501
Acyrthosiphon    7
Sarcophaga       8
Glossina         7

We can make a histogram directly from out dataframe, sweet.

fig = plt.figure()
axes = fig.add_subplot(1,1,1)
axes.set_ylabel("Species")
axes.set_xlabel("Frequency")
axes.set_title("Bla bla")
hist_data.hist(kind='barh',ax=axes,legend=False)

basic_pandas_hist

Finally, we really want the data_frame sorted by size to make the histogram look a little nicer.

hist_data = hist_data.sort('Frequency',ascending=False) #not in place
axes = fig.add_subplot(1,1,1)
axes.set_ylabel("Species")
axes.set_xlabel("Frequency")
axes.set_title("Bla bla")
hist_data.hist(kind='barh',ax=axes,legend=False)

basic_pandas_hist2

Bash for Biologists: A gateway to the terminal!

What is bash

Bash programming is a form of scripting that originates from the early days when a computer was a keyboard with a flashing cursor on a monochrome screen!

Credit: Wtshymanski @ wikipedia

Bash acts as a command line interpreter, taking input from a user (via the keyboard) and using it to make the computer to act in a certain way. Many of these commands are small scripts that do a multitude of tasks, from text processing to network communications. Although most of us now use a graphic user interface (GUI) to communicate with our computers, a basic understanding of bash programing can dramatically improve our productivity in certain situations. This is especially with dealing with the large files generated by most bioinformatic applications.

Where do I get this bash program

Well technically speaking Bash is not really a program, its more a command line interpreter and a scripting language. That said it is available by default on any NIX (Unix-like system ie ubuntu or fedora) Macintosh (OSX) and windows using a free program called cygwin (there are others as well.

Enough already!

First thing, don’t copy and paste! Why? Typing these commands is part of the hard wiring required to make them stick! Also, “with great power comes great responsibility”, and bash has the power to wipe your computer – no trash recovery – completely gone forever (except for that backup – right?)! By typing commands it gives you another opportunity to think about the command before you execute it and typing is more likely to trigger a big red flag in dangerous situations! Think twice, type once.

Ok, Ok, in the following tutorial I have attempted to provide command line solutions to some common problems faced by life scientists who have to deal with files generated by next generation sequencing technologies. I’ll probably do a few of these as with the explanations they get quite long. Of course, this is not an attempt to teach you to be a terminal master, that can only come from practice, but hopefully it will give you a taste of the power hiding behind that drab terminal exterior. It is also not designed to teach you how to program a computer, but I’m hoping this might be the gateway drug you need to get started.

The solutions in this tutorial were tested on a linux machine (Ubuntu), but should mostly work on Mac and windows through cygwin.

Problem 1. I have this huge file of DNA sequences in FASTA format (>seq1\nATGTAA) and I want to know how many sequences are in there?

Solution. This is a common problem, if your text editor manages to open the file without freezing you are still left with the problem of counting sequences. To our rescue is grep, grep is a little application that searches for strings. In our case as this is a fasta file we can search and count instances of the “>” to let us know how many sequences are in the file. And yes, this is a small file and you probably could have opened it on a 10 year old computer, but use your imagination on this one. Very important put the > in single quotes(‘>’)!!

grep ‘>’ fasta_example.fas

—————-output—————-

>fid|47273465|locus|VBIADEClo150191_2639|   >fid|47286894|locus|VBIADEClo157809_4509|   >fid|47397394|locus|VBIAERSal180824_4606|   >fid|20665099|locus|VBIAciCap40988_0842|   >fid|20666571|locus|VBIAciCap40988_1572|   >fid|20668069|locus|VBIAciCap40988_2310|   >fid|46816701|locus|VBIAciSp155132_0285|   >fid|46817921|locus|VBIAciSp155132_0889|   >fid|38082798|locus|VBIAciSp155501_0265|   >fid|38086805|locus|VBIAciSp155501_2252|   >fid|38088330|locus|VBIAciSp155501_3002|  >fid|38088628|locus|VBIAciSp155501_3149|  >fid|38090581|locus|VBIAciSp155501_4107|  >fid|85865228|locus|VBIActMas228228_1098|  >fid|101184055|locus|VBIActMas238937_2668|

So grep found all the lines with the ‘>’, but thats not exactly what we wanted. How do we count the lines. Now is a good time to introduce the manual entry (using the man command) available for all bash commands. This manual entry for grep is rather long, however it contains the information we need (highlighted in red below).

man grep

—————-output—————-

GREP(1)                                                                GREP(1)NAME

      grep, egrep, fgrep, rgrep – print lines matching a pattern

SYNOPSIS

      grep [OPTIONS] PATTERN [FILE…]

      grep [OPTIONS] [-e PATTERN | -f FILE] [FILE…]

…………….shortened for clarity…………………

  General Output Control

      -c, –count

             Suppress normal output; instead print a count of matching  lines

             for  each  input  file.  With the -v, –invert-match option (see

             below), count non-matching lines.  (-c is specified by POSIX.)

—————-end of output—————-

It looks like ‘-c’ flag (or switch) will cause grep to count the lines that match the search string rather than just print them out – perfect! Before we try it out its worth taking a second to look at another important part of the manual that specifies how to use grep. BTW you can use either -c or –count, the double minus sign just specifies the longer format of the flag that is often a little more human readable.

grep [OPTIONS] PATTERN [FILE...]

The square brackets are not used as part of the command, they merely specify that whatever is in the brackets is OPTIONAL. So the minimum command is “grep pattern”, with the possibility of adding a OPTION flag/switch and a FILE to search.This time let’s add our optional flag and repeat the search, note that the command order fits the structure described in the manual (grep ->option-> pattern ->file) .

grep -c ‘>’ fasta_example.fasta

—————-output—————-

15

Thats more like it. But reading through the manual was a little painful, we know we want to count something, so why don’t we use grep to search through the manual entry for grep, to find a line that uses the string ‘count’. Circular enough for you?To do that we need to introduce what are called pipes (“|”), they are the bar symbol on the keyboard (SHIFT backslash on my computer). First the command, then the explanation.

man grep ‘count’ | grep ‘count’

—————-output—————-

 -c, –count

             Suppress normal output; instead print a count of matching  lines

             below), count non-matching lines.  (-c is specified by POSIX.)

      -m NUM, –max-count=NUM

             –count  option  is  also  used,  grep  does  not output a count

      Large repetition counts in the {n,m} construct may cause  grep  to  use

Sweet as Bro! Pipes are a redirection, that is rather than send the output of a program to the screen (called “stdout”) or a file etc, send it to another program. In this case we redirect the output from the man program to the grep program (that the manual entry is about grep is irrelevant to your stupid computer) so that grep can look for the string “count”.

A quick aside: To ‘single quote’, “double quote”, or Not to quote, that is a good question! This will be covered a little later, but for now it is important to use single quotes whenever you want to specify a string of letter(s). We have already used one situation where this can cause problems, for example: grep > filename acts very differently to grep ‘>’ filename. As you will learn now.

Since we are talking about pipes now is a good time to talk about another type of redirection. The > symbol means redirect the output from a command to a file, which it will overwrite if it already exists (>> will append the output onto the end of the file). So

grep -c ‘>’ fasta_example1.fasta > tmp.txt

Will output nothing to the screen, rather it will put the number 15 in a file called tmp.txt. You’ll see in a moment how useful this can be when dealing with sequence files. So what do you think the grep > fasta_example1.fasta (without the single quotes around the > symbol) might do to your wonderful sequence file (Hint, it won’t end well, so if you want to try it use some throw away file)? Well time to get back to some biology!

Problem 2. I have this huge file of DNA sequences in FASTA format (>seq1\nATGTAA) and I want to split it in half (as best I can).

Solution 2. Lets combine what we have learnt with a couple of new commands to solve this problem. Firstly, lets just count the sequences to find out how many each file will have to include for the files to be split in two.

grep -c ‘>’ fasta_example1.fasta

———-output—————-

15

Well thats not an even number, so lets just put 7 in the first file and 8 in the second file. But first, two new commands, head and tail.

head fasta_example.fasta

—————-shortened output—————-

>fid|47273465|locus|VBIADEClo150191_2639| MIKISEDYIEKRFDNELLRIEAWGKNSLRIRSFVDQNFVDENYALNEKPKLNKDDITINKNEDGSAIIKNGKIKAVLDHRDRITFYNDKNEILLKEYIRLRAVKHDDGGEDVGTIEITKDFNSTLKLKSREY

tail fasta_example.fasta

—————-shortened output—————-

RSGLYEIHQRFRAYPGERIYGLGQHTHGRLDHKGLVLDLVQRNAEVSIPFYLSSRGYGFLWNNPAVGRVEFAENGTRWVADAAPGLDYWVTAGKTPAEILEHYADAVGHAPVLPEWATGFWQCKLRYLGQEELLDVAREYRRRELPLSVIVTDFFHWTAMGDYRFDPEEYPDPEAMMRELDELGVKLMVSIWPTISPLSENYDAMAEQGLLVGADQGVEFHQDIHDKKMPRSLPVAFYDP

Head by default outputs the first 10 lines of a file, whilst tail outputs the last 10 lines. You can change this default behaviour using the -n flag. Lets start by finding out what the 7th sequence is in the file.

grep ‘>’ fasta_example1.fasta | head -n 8

—————-output—————-

>fid|47273465|locus|VBIADEClo150191_2639|  >fid|47286894|locus|VBIADEClo157809_4509|   >fid|47397394|locus|VBIAERSal180824_4606|   >fid|20665099|locus|VBIAciCap40988_0842|   >fid|20666571|locus|VBIAciCap40988_1572|   >fid|20668069|locus|VBIAciCap40988_2310|  >fid|46816701|locus|VBIAciSp155132_0285|

>fid|46817921|locus|VBIAciSp155132_0889|

Normally the first part of this command just prints each line that contains the ‘>’ FASTA symbol, but here we piped the output from grep to head, using the -n flag to show only the first 8 lines. So what? Well remember that you file might have 1000s of sequences, finding out which one is the 501’st might be a little more tricky if you lived in a world without grep, head and tail. So now we know that “>fid|46817921|locus|VBIAciSp155132_0889|” is the first sequence that needs to go in our second file. Now once again after looking at the manual page for grep we see that the -n flag reports the line number of the pattern match.

grep -n ‘>fid|46817921|locus|VBIAciSp155132_0889|’ fasta_example1.fasta

—————-output—————-

102:>fid|46817921|locus|VBIAciSp155132_0889|

There are multiple ways of doing this. We could use another command wc -l filename which counts the number of lines in a file, then subtract 102 from this number and feed that to tail, basically to print out the last x lines. However, rather than introducing a new command (too late) lets look at the man page for tail and see if we can do what we want that way.

man tail

—————-output shortened—————-

-n, –lines=K

             output the last K lines, instead of the last 10; or use -n +K to

             output lines starting with the Kth

Great, so all we need to do is pass the line number we want tail to start from with the switch -n and the “+” symbol. Putting this all together.

head -n 101 fasta_example1.fasta > first_half.fasta
tail -n +102 fasta_example1.fasta > second_hald.fasta
grep '>' first_half.fasta

Lets quickly check to see if it worked.

>fid|47273465|locus|VBIADEClo150191_2639|  >fid|47286894|locus|VBIADEClo157809_4509|   >fid|47397394|locus|VBIAERSal180824_4606|   >fid|20665099|locus|VBIAciCap40988_0842|   >fid|20666571|locus|VBIAciCap40988_1572|   >fid|20668069|locus|VBIAciCap40988_2310|   >fid|46816701|locus|VBIAciSp155132_0285|

grep ‘>’ second_half.fasta

>fid|46817921|locus|VBIAciSp155132_0889|   >fid|38082798|locus|VBIAciSp155501_0265|   >fid|38086805|locus|VBIAciSp155501_2252|   >fid|38088330|locus|VBIAciSp155501_3002|  >fid|38088628|locus|VBIAciSp155501_3149|  >fid|38090581|locus|VBIAciSp155501_4107|  >fid|85865228|locus|VBIActMas228228_1098|  >fid|101184055|locus|VBIActMas238937_2668|

Perfect! 7 sequences in the first file, 8 in the second. As I said previously this will work just as well with 100000000 sequences as it did in this small example!

How to use DESeq to analyse RNAseq data

Featured

Several packages have been written to measure gene expression differences using RNAseq data. In this post I’m going to do a run through of the popular R based DESeq package. Basically, I’m just following along with the packages Vignette which is really well written and informative! For the test data I’m using 2 treatments, with 3 biological reps each containing 750K reads (yes its small).

A. Counting reads

DEseq uses a table of raw counts as the basis of DE calls. Any transformed counts, such as FPKM counts will not work with this package. The top row of the table can be a descriptor, these can be imported into R to act as column labels. Gene names should be column 1 (remember R counting starts from 1 rather than 0), each column after that should be a raw count for each biological replicate and for each treatment. Note that if technical reps are performed, these should be merged based on the sum of the two runs (they will be normalised by read count latter).

Here I used BWA to align the reads to my reference and generate SAM files that are then processed by HTseq count. Note that you have to run this on each sample, so in this current example I will generate 6 sam files, 1 for each treatment and condition. Also, see that I used the intersection-strict mode for HTseq.

>bwa aln sarcop_pep_rnaseqfiltered.fa Sample_ST72-1/ST72-1.subset.fastq > ST72-1.subset.sai

>bwa samse sarcop_pep_rnaseqfiltered.fa ST72-1.subset.sai Sample_ST72-1/ST72-1.subset.fastq > ST72-1.subset.sam

>htseq-count -m intersection-strict –stranded=no ST72-1.subset.sam sarcop_pep_rnaseqfiltered.gtf > ST72-1.subset.counts

In the above ST72-1.subset.counts is a table of counts that can be merged with the other count data before being imported into DEseq as a R data table. Here is a python script for merging, see the header of the script for usage, if that helps. For full detailed explanation of HTseq see (http://www-huber.embl.de/users/anders/HTSeq/doc/install.html#installation-on-linux) and this seems to be the default way of doing it (instructions :http://www-huber.embl.de/users/anders/HTSeq/doc/count.html).

B. R DESeq package 

Enter R environment from a terminal window open where your data is.

>R

>library(“DESeq”)

>countsTable <- read.delim(“merged_counts.txt”,header=TRUE)

>rownames(countsTable) <- countsTable$gene

>countsTable <- countsTable[,-1]

 This imports the DESeq library, reads the tab delineated file created by the python script into a data table. The next two lines create row names from the gene column created by the script, if you use some other method to make the counts you need to change the name of the gene column to whatever it is. Finally, we get rid of the column used to make the row names. My table looks like this

> head( countsTable)

                  UN72_1 UN72_2 UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2          0      0      0      0      0      0

g10671t00001            2      1      1      0      0      0

g16789t00001            0      0      0      0      0      0

Now we need to add the treatment information so that DESeq knows what data each column represents, order is obviously important. 

>conds <- factor( c( “untreated”, “untreated”, “untreated”, “treated”,”treated” ,”treated”  ) )

> cds <- newCountDataSet( countsTable, conds )

This has created a instantiate a CountDataSet, which is the central data structure in the DESeq. The next step will adjust for different library sizes among the treatments, something called “size factors”. DESeq only needs relative library sizes, so if (on average) the counts for non-DE genes in one library are twice that of a second library the size factor should be twice as large.

 >cds <- estimateSizeFactors( cds )

 To see these size factors, do this:

 >sizeFactors( cds )

UN72_1    UN72_2    UN72_3    ST72_1    ST72_2    ST72_3

0.9805609 1.0420015 0.9164864 1.1083583 0.9599370 1.1462168

 Apparently, the way this works (http://www.biostars.org/p/12273/) is that you take the genometric mean of each condition for a gene and use this as its reference value. Then you divide the gene expression value for each conditions by its reference value to create a list of values (quotient ie result of the division) for each condition. The size factor is the median value of this list, as there is a list per condition it will generate a size factor value for each condition. The counts are normalised by dividing each column by its size factor, here lets look before and after:

 > head(counts(cds))

                  UN72_1 UN72_2 UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2          0      0      0      0      0      0

g10671t00001            2      1      1      0      0      0

> head(counts(cds,normalized=TRUE))

                    UN72_1    UN72_2   UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2     0.000000 0.0000000 0.000000      0      0      0

g10671t00001       2.039649 0.9596916 1.091124      0      0      0

g16789t00001       0.000000 0.0000000 0.000000      0      0      0

The next part is a key to how DESeq calculates significance in DE. The package uses the relationship between the data’s variance (or dispersion) and its mean. The inference in DESeq relies on an estimation of the typical relationship between the data’s variance and their mean, or, equivalently, between the data’s dispersion and their mean. The dispersion can be understood as the square of the coefficient of biological variation. The variance between counts is the sum of two factors, firstly, the level of variation between replicates of each treatment, and secondly an uncertainty measure based on the concentration of the counts. This latter factor takes into consideration the level of noise predicted for low expressed genes based on a poisson distribution (see the DEseq manual for more detail).

>cds <- estimateDispersions( cds ) 

Now we write a little function that will plot this.

>plotDispEsts <- function( cds ){

plot(rowMeans( counts( cds, normalized=TRUE ) ), fitInfo(cds)

$perGeneDispEsts, pch = ‘.’, log=”xy”, ylab=”dispersion”,

xlab=”mean of normalized counts”)

xg = 10^seq( -.5, 5, length.out=300 )

lines( xg, fitInfo(cds)$dispFun( xg ), col=”red” )

}

>plotDispEsts(cds)

> jpeg(“DispEsts_DESeq.jpg”)

> plotDispEsts(cds)

> dev.off()

The Above commands make a plotting function, the pass cds to that function and save the result in a jpg file.

The empirical (real) dispersion values (black dots) and fitted values (red line)

The empirical (real) dispersion values (black dots) and fitted values (red line)

Remember that the level of dispersion is related to the biological variation seen in each treatment. Note this is a double log graph, ahhh my brain hurts, but look carefully at what is happening on the Y axis, as read count increases dispersion decreases, as we would expect. However, a lot of those black dot values below the red fitted line are probably underestimates of dispersion based on the small samples sizes used (only 3 replicates values measured in this example). So, to be conservative DESeq moves all these values below the red line up to the fitted value, BUT keeps all the empirical values above the red line, even if some of these are over estimates of dispersion, as I said it is being conservative. Why does dispersion matter? The more dispersion, or biological variation, the bigger the difference between counts between treatments is required before it differences become significant!

Finally we are ready to call DE values!

>res = nbinomTest( cds, “untreated”, “treated” )

>head(res)

                 id  baseMean baseMeanA baseMeanB foldChange log2FoldChange

1     GWD8ZKM01ESPR2 0.0000000 0.0000000         0        NaN            NaN

2       g10671t00001 0.6817440 1.3634880         0          0           -Inf

3       g16789t00001 0.0000000 0.0000000         0        NaN            NaN

      pval padj

1        NA   NA

2 0.2083517    1

3        NA   NA

Now lets plot the log2 fold changes (untreated versus treated) against the mean normalised counts, colouring in red those genes that are significant at 10% FDR.

jpeg(“plotMA.jpg”)

> plotMA(res)

> dev.off()

plotMA

Note how lower lower expressed genes need a bigger log2 fold change value to be made red (not so obvious in this data due to low number of reads used in mapping).

Lets explore this a little more by comparing the normalized counts between all three replicates of just the untreated sample.

>ncu = counts( cds, normalized=TRUE )[ , conditions(cds)==”untreated” ]

>jpeg(“MA_untreated_only.jpg”)

>plotMA(data.frame(baseMean = rowMeans(ncu),log2FoldChange = log2( ncu[,2] / ncu[,1] )),

col = “black”)

>dev.off()

 MA_untreated_only

Wow, look at all the DE in a single treatment,  thus we should call a gene DE only if the change between treated and untreated samples is stronger than what we see between replicates, and hence, the dividing line between red and black in the first figure mimics that for the second.

And now a histogram of p values to look at their distribution.

hist(res$pval, breaks=100, col=”skyblue”, border=”slateblue”, main=””)

> jpeg(“hist_pval.jpg”)

> hist(res$pval, breaks=100, col=”skyblue”, border=”slateblue”, main=””)

> dev.off()

 hist_pval

The pileup of very low p-values are the DE genes, the pileups at the right hand side of the distribution are from very low count genes that are given discrete values.

Now lets create some subset tables based around FDR cut offs (in this case 10%)

>resSig = res[ res$padj < 0.1, ]

>head(resSig)

Before we make any mistakes, lets save the data

>write.csv( res, file=”72hour_unstgVstg_all.csv” )

>write.csv( resSig, file=”72hour_unstgVstg_only_0.01FDR.csv” )

We are done with the basic analysis, we can do some more sophisticated filtering of the data. The main goal is to remove tests that give little chance of producing a meaningful result, with the aim of keeping the FDR in check. This results in an increased power to detect significant results whilst keeping the same level of type I error the same (false positive rate).

>rs = rowSums ( counts ( cds))

>theta = 0.4

>use = (rs > quantile(rs, probs=theta))

>table(use)

table(use)

FALSE  TRUE

30145 36029

Basically here we are summing the counts across all conditions for each gene, then remove the genes with the lowest 40 % quantile (basically filtering out those that are false above from the original normalised cds table)

>cdsFilt = cds[ use, ]

Now just repeat the tests on the filtered data

>resFilt = nbinomTest( cdsFilt, “untreated”, “treated” )

Now lets look at the difference this makes to the FDR.

tabAll <-table(res$padj<0.1)

tabFilt <-table(resFilt$padj<0.1)

> tabFull

FALSE  TRUE

35266   763

> tabFilt

FALSE  TRUE

35266   763

In this case it didn’t seem to make much difference because most of the dropped values were NA! Just to visualise what is going on here lets plot the un-ajusted pvals of the ~40% quantile that we filtered.

>jpeg(“pval_bot40.jpg”)

> plot(rank(rs)/length(rs), -log10(res$pval), pch=16, cex=0.45)

> dev.off()

 pval_bot40

Note that very few p value are less than 0.003 (this corresponds to about 2.5 on the − log10 scale), and this is demonstrated when we plot the pvals of each dataset.

> h1 = hist(res$pval,breaks=50,plot=FALSE)

> h2 = hist(resFilt$pval,breaks=50,plot=FALSE)

>colori <- c(‘do not pass’=”khaki”, ‘pass’=”powderblue”)

> barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, col = colori,

+ space = 0, main = “”, ylab=”frequency”)

> text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA)

> legend(“topright”, fill=rev(colori), legend=rev(names(colori)))

> dev.copy(png,’hist_filthist.png’)

> dev.off()

hist_filthist

Now we are going to process the data further, since some counts and be zero in some applications in makes sense to add psuedocounts (ie y = log2 (n + n0 ) where n0 is a small ‘chosen’ constant. The question is what is the value of the small number you add to the zero counts to make them non-zero (but keep them very small). DEseq uses a alternative approach.

>cdsBlind = estimateDispersions( cds, method=”blind” )

>vsd = varianceStabilizingTransformation( cdsBlind )

Now we will plot Per-gene standard deviation (taken across samples), against the rank of the mean, the standard deviation of the transformed data, across samples, against the mean, first using the shifted logarithm transformation (notAllZero is basically one of the numbers in the row is NOT ZERO), with 1 added to all values to remove zeros), then using DESeq’s variance stabilising transformation. While we are meant to see a constant standard deviation for the DESeq method along the whole dynamic range, with the shifted logarithm giving a highly elevated standard deviation in the lower count range for the other method, in fact they both look pretty similar for this small dataset.

>biocLite(“vsn”)

>library(“vsn”)

>par(mfrow=c(1,2))

>notAllZero = (rowSums(counts(cds))>0)

>meanSdPlot(log2(counts(cds)[notAllZero, ] + 1), ylim = c(0,2.5))

>meanSdPlot(vsd[notAllZero, ], ylim = c(0,2.5))

Now to deal with moderate counts that generate highly variable log fold change values and in some cases they become infinite. This will mess up any kind of clustering we may want to perform to identify expression common patterns. We can compare their ordinary log-ratios (lfc) against the moderated log-ratios (mod_lfc) using a scatterplot, with the genes bined by their log10 mean to colour them according to expression strength.

>mod_lfc = (rowMeans( exprs(vsd)[, conditions(cds)==”treated”, drop=FALSE] ) – rowMeans( exprs(vsd)[, conditions(cds)==”untreated”, drop=FALSE] ))

> lfc = res$log2FoldChange

> table(lfc[!is.finite(lfc)], useNA=”always”)

-Inf   Inf   NaN  <NA>

5085  4900 30145     0

Many of the latter are infinite (Inf,division of a finite value by 0) or not a number (NaN, resulting from division of 0 by 0).

> logdecade = 1 + round( log10( 1+rowMeans(counts(cdsBlind, normalized=TRUE)) ) )

> lfccol = colorRampPalette( c( “gray”, “blue” ) )(6)[logdecade]

> ymax = 4.5

> plot( pmax(-ymax, pmin(ymax, lfc)), mod_lfc,

xlab = “ordinary log-ratio”, ylab = “moderated log-ratio”,

cex=0.45, asp=1, col = lfccol,

pch = ifelse(lfc<(-ymax), 60, ifelse(lfc>ymax, 62, 16)))

> abline( a=0, b=1, col=”red3″)

Here grey to blue represents weakly to strongly expressed genes, for highly expressed genes there is general agreement (along diag) but for lowly expressed genes the moderated log-ratio is shrunk towards 0, compared to the ordinary log-ratio (the grey smear flattens out along the horizontal (0 value on the y-axis).

smoothing

Now to look at heat maps of counts to see any patterns using the transformed  log data.

>library(“gplots”)

>library(“RColorBrewer”)

> cdsBlind = estimateDispersions(cds,method=”blind”)

> vsd = varianceStabilizingTransformation(cdsBlind)

>select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:30]

>hmcol = colorRampPalette(brewer.pal(9, “GnBu”))(100)

>heatmap.2(exprs(vsd)[select,], col = hmcol, trace=”none”, margin=c(10, 6))

 heat_map_transformed

and for a laugh the untransformed data.

 heat_map_untransformed

Now for a

>dists = dist( t( exprs(vsd) ) )

>mat = as.matrix( dists )

> rownames(mat) = colnames(mat) = with(pData(cdsBlind), paste(condition))

> heatmap.2(mat, trace=”none”, col = rev(hmcol), margin=c(13, 13))

dend

No shocks here again. Of note is that we used the method “blind”, thus the variance stabilizing transformation is not informed about the design to prevent bias in the outcome. In comparison, the pooled method (the default for estimating the dispersion) “means that one dispersion is computed for each gene, which is now an average over all cells (weighted by the number of samples for each cells), where the term cell denotes any of the four combinations of factor levels of the design.”

Now a good olde PCA, pretty cool really.

>print(plotPCA(vsd, intgroup=c(“condition”)))

 pca

This type of plot is useful for visualizing the overall effect of experimental covariates and batch effect, so you might expect known conditions to group, ie treatment and any time of library prep effects (paired end reads versus single end data). Good to see that even with this subset of data biggest component is treatment (pc1).

>sessionInfo()

R version 2.15.3 (2013-03-01)

Platform: x86_64-pc-linux-gnu (64-bit)

locale:

[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C

[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8

[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8

[7] LC_PAPER=C                 LC_NAME=C

[9] LC_ADDRESS=C               LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] grid      stats     graphics  grDevices utils     datasets  methods

[8] base

 

other attached packages:

[1] arrayQualityMetrics_3.14.0 gplots_2.10.1

[3] KernSmooth_2.23-10         caTools_1.14

[5] gdata_2.12.0.2             gtools_2.7.1

[7] RColorBrewer_1.0-5         vsn_3.26.0

[9] BiocInstaller_1.8.3        DESeq_1.10.1

[11] lattice_0.20-15            locfit_1.5-9

[13] Biobase_2.18.0             BiocGenerics_0.4.0

 

loaded via a namespace (and not attached):

[1] affy_1.36.1           affyio_1.26.0         affyPLM_1.34.0

[4] annotate_1.36.0       AnnotationDbi_1.20.7  beadarray_2.8.1

[7] BeadDataPackR_1.10.0  Biostrings_2.26.3     bitops_1.0-5

[10] Cairo_1.5-2           cluster_1.14.4        colorspace_1.2-1

[13] DBI_0.2-5             gcrma_2.30.0          genefilter_1.40.0

[16] geneplotter_1.36.0    Hmisc_3.10-1          hwriter_1.3

[19] IRanges_1.16.6        latticeExtra_0.6-24   limma_3.14.4

[22] parallel_2.15.3       plyr_1.8              preprocessCore_1.20.0

[25] reshape2_1.2.2        RSQLite_0.11.2        setRNG_2011.11-2

[28] splines_2.15.3        stats4_2.15.3         stringr_0.6.2

[31] survival_2.37-4       SVGAnnotation_0.93-1  tcltk_2.15.3

[34] tools_2.15.3          XML_3.6-2             xtable_1.7-1

[37] zlibbioc_1.4.0

>

Illumina quality control

Several tools are available for investigating the quality of your sequencing data and these should be used before any downstream analysis, after all, crap in carp out.

The first tool, written in Java and called fastqc, has a simple GUI or can be used in the command line to generate a number of useful graphs describing several important features of your sequence data. These include (among others) per base quality and read length distributions, as well as detection of sequence duplication’s and kmer content. Handily the authors also provide some good and bad Illumina data that you can use to test out the software.

Running is easy, either use the GUI or just type: ./fastqc seq_file_name

The –help flag will provide the other options, format detection is automatic and worked for me so probably the only option you might consider is ‘-t’ to set the number of cores to use.

Image

As you can see by the html report that is produced by fastqc a summary is provided indicating if your data meet a certain quality expectations (these seem to be very strict).

Here I have run the good (left) and bad (right) test data-sets and shown them side by side for comparison. Edit: These data-sets come from the SolexaQA package that I’ll talk about in another blog.

fastqc_stats

These box whisker plots of quality per base show how illumina data degrades at the 3′ end of the read. Here the median value as red line, box (yellow) the inter-quantile range (35-75%), upper/lower whiskers represent 10% and 90% points, while the blue line is the mean quality. Good quality scores (green), reasonable (orange), and (poor) are indicated by background shading. A warning will be issued if any base is less than 10, or if the median for any base is less than 25.

fastQC_qualityscore_dist

These plots show the quality score distribution over all sequences. The fastQC manual says that even in good data a small proportion of sequences may have low quality scores, often because they were poorly imaged.

fastqc_sequencecontent

Nothing dramatic here, these are the sequence content of each nucleotide across the reads, the lines should be horizontal reflecting the underlying GC%.

fastQC_seq_dupliction

The sequence duplication level will reveal any problems during library preparation. Note that any duplicates that occur greater than 10 times are lumped into the 10+ category so if you see a pile-up there you have problems. Note that some duplication may occur if their is very high coverage.

fastqc_perbaseN

The per base N content should be less than 3%.

fastQC_GC_overseqdist

The GC content distribution of the data (red) against a theoretical modeled distribution (blue). A shifted GC distribution away from the modeled normal distribution indicates a systematic bias that is independent of base position, probably the result of contamination in the library or some kind of bias subset of sequences.

fastQC_GCcontent

The line should be horizontal in a random library, GC bias which changes in different bases can indicate over-represented sequence in your library.

FASTqcseq_len_dist

In most cases the sequence length distribution should be uninteresting (at least for illumina data), but if any kind of trimming has occurred this should be revealed here.

Developer mode and Ubuntu on a chromebook without duel booting

I really love my chromebook, at least when I get to wrestle it off the wife. But every now and then I miss the fact that I can’t jump into the terminal and start tapping away. But I don’t really want to get rid of chomeOS, its great the way it is and duel booting seems to defeat the purpose of having a quick start “always on” laptop.

So long story short, I wasn’t ready to go down the install ubuntu on your chromebook path. That was until I came across a great post on google+ (https://plus.google.com/u/0/112449749826562830126/posts/ZS9WaegrZYH) which describes a bunch of scripts called crouton that allow you to run ubuntu inside of chromeOS! Switching between the two systems is as easy as a key stroke, and it is instant.

Instructions for ARM based Samsung chromebooks (the $250 ones)

1. Backup everything! Well if you live in the cloud this is not a problem as google is taking care of this for you, but this process WIPE YOUR DRIVE (You can restore chromeOS if you like, no harm no foul, but any data on your SSD will be GONE).

Also, developer mode is less secure as Google is no longer watching your back, don’t be stupid with your new found power!

Identity crisis, Linux and ChromeOS side by side!

Identity crisis, Linux and ChromeOS side by side!

1. Power off and enter developer mode by tapping the power button while holding down ESC and REFRESH, this should bring up a scary “your OS is damaged” screen, now hit CTRL D, and following the on screen instructions press ENTER to switch to developer mode (this takes some time).

2. Once you are back to the login screen, type in your google credentials and restore you system, for me it seemed that everything was running a lot quicker than with the walled garden version, maybe I was dreaming.

3. Once you are up and running with your ChromeOS, open a browser and download crouton, use the file manager to check that it has downloaded to your download directory.

4. Now open a shell window by typing CTL ALT T (note, this won’t work if you use the VT2 terminal ie CTL ALT Forward!), at the chronos shell type “shell” to, well, enter the shell

5. Follow the instructions to setup a sudo password!

6. Now: “sudo sh -e ~/Downloads/crouton -t xfce”. Note this will install xfce desktop (a light weight desktop that comes with xubuntu, really, its all you need), see the crouton page for more information on other options (Unity, if your that way inclined\-:).

8. After the system downloads and installs, at the prompt type “sudo startxfce4” to enter your new linux environment!

So far this is great, to switch between stock chromeOS and xfce just use Ctrl+Alt+Shift+Back and Ctrl+Alt+Shift+Forward, how easy is that!

So what now, well apt-get works, why not install nano, or in my case gimp, since I needed to edit some of the photos in this post. Life is good(-:

Screenshot - 030213 - 19:22:01

Issues: yes! There is a little bit of instability, but I’ve found a quick switch between OS’s seems to calm things down.

Using PAML to detect pairwise dNdS ratios

Caveat: Many of the intricacies of Molecular evolution are fuzzy to me!

Intro: Due to the triplet code, changes in third codon position are mostly silent (synonymous), in that they do not result in a change to the encoded amino acid. In contrast, changes that occur in first and second codon positions nearly always result in an amino acid substitution (non-synonymous). Non-synonymous changes are rare as they mostly cause deleterious mutations that are removed from the population through purifying selection. However, in cases where positive selection is acting on a protein sequence, accumulation of changes that lead to the adaptive shift in protein function tend to result in an excess of Non-synonymous mutations. This means the ratio of non-silent to silent changes, as a proportion of each kind of site in the sequence, can be used as a measure of adaptive evolution. More specifically, a dNdS ratio of greater than 1 is used to indicate that a protein is/has been subject to positive selection. Here I’ll describe some of the important parameters when using arguably the most popular method for detecting selection based on dNdS in coding genes, that being the PAML package.

Preliminaries: The predominant program that we will use is called codeml, which (for the most part) takes an alignment of homologous coding regions. The most important aspect of the alignment is that the codon positions are correctly aligned. Codeml make calculations based on the nucleotide positions in the coding triplet, so the the open reading frame in the DNA needs to be maintained if the alignment contains indels or missing bases. The best way to guarantee this outcome is to constrain the DNA alignment based on a protein alignment (i.e. any gaps ‘-’ should be inserted as ‘—’ as not to disrupt reading frame) and in a sense any phylogeny based on coding DNA sequences should always be based such an alignment (evolution acts at the level of the protein after all). A tool for doing this is called transalign, but be warned that 1) your protein and DNA sequences need to be completely complementary (DNA must = protein, no partial or extra triplets etc) (2), they need to share the same names (3), they must be in the same order.

Pairwise dNdS analysis: Codeml reads a control file that contains the run parameters, typically this file is called codeml.ctl, the basic file for doing pairwise dNdS is below.

seqfile = seqfile.txt   * sequence data filename
outfile = results.txt   * main result file name
treefile = treefile.txt *tree file
noisy = 9      * 0,1,2,3,9: how much rubbish on the screen
verbose = 1      * 1:detailed output
runmode = -2     * -2:pairwise
seqtype = 1      * 1:codons
CodonFreq = 0      * 0:equal, 1:F1X4, 2:F3X4, 3:F61
model = 0      *
NSsites = 0      *
icode = 0      * 0:universal code
fix_kappa = 1      * 1:kappa fixed, 0:kappa to be estimated
kappa = 1      * initial or fixed kappa
fix_omega = 0      * 1:omega fixed, 0:omega to be estimated
omega = 0.5    * initial omega value

I’ve highlighted some important parameters in red, “seqfile” and “outfile” are self explanatory, except to say that the infile needs to be in a “typical” format, I use fasta because it is what I use(-;. The “runmode” is set to -2 to indicate that you are doing pairwise dNdS. “CodonFreq” and “kappa” incorporate parameters into the model that account for biological relevant aspects of protein evolution. CodonFreq controls the codon frequency parameter of the model, setting it to 0 implies that all codons are equally likely, 1 assigns the value based on the average nucleotide frequencies, 2 assigns the values based on the average frequencies of nucleotides in the codon, 3 allow the parameters to be estimated freely. All codons are generally not equally likely due to differing tRNA availability, ignoring these processes can lead to biases in the estimated dNdS. Kappa is the transition to transversion ratio. The chemical properties of DNA mutation create a bias toward transitions, and as transitions at third codon positions are more likely to be silent than transversions, ignoring this parameter nearly always results in an overestimation of dS and thus a correspondingly underestimation of dNdS. If the kappa value is set to 1, the transition bias is ignored, while setting it to 0 allows it to be estimated from the data. With these reference to these parameters, counting models such as NG86 (Nei and Gojobori (1986)) can be mirrored by setting K=1 and codonFreq = 0. Alternative settings for these values comes at a cost in that it increase the number of parameters that need to be estimated, however, their incorporation into the model makes for a more biologically relevant estimation of dNdS.

Final notes: It should be noted that these estimates of dNdS are not very sensitive to detecting positive selection. This is for two main reasons, firstly many amino acids are structurally important for the proteins core function and are thus correspondingly nearly invariably, this results in positive selection acting on a relative small number of codons. Secondly, selection tends to be episodic (occurs in spurts). This means it is difficult to detect positive selection by just taking the average dNdS across the entire protein from a single pairwise comparison. To overcome this methods have been developed that detect changes in dNdS across a phylogeny derived from multiple sequence alignment, and I’ll focus on these methods in the next addition!

I’ve been very slack with references for this post because I wonted to focus on the core considerations of using codeml, however, most of the above is based on reading the excelling coverage of this topic in: Maximum likelihood methods for detecting adaptive protein evolution

Bielawski, JP and Yang, Z (2005) Maximum likelihood methods for detecting adaptive protein evolution. In: Nielsen, R, (ed.) Statistical Methods in Molecular Evolution. (103 – 124). Springer-Verlag: New York.

A week with Samsung’s and Google’s new Chromebook.

In one line. I’m sold – it will only get better – shelving plans for an ultrabook First impressions. Firstly, its small, thin and light! Certainly smaller than I was expecting, think larger netbook rather than ultrabook/notebook. That said the keyboard is full size and the screen at 11.6 inches is perfectly reasonable for most tasks. This looks pretty nice, however, it does feel a little cheap due to its all plastic body. Does it feel like a $250 netbook? No way, I doubt the casual observer would think you aren’t using anything but the latest and greatest little ultrabook. So far the keyboard has been great, the keys respond well so touch typing is a breeze. The track pad is also above average based on my experience, in fact its the first laptop I have owned where I haven’t automatically gone for the USB mouse. It should be noted that the keyboard lacks a dedicated cap’s lock and delete keys, however, easy short-cuts are available in their place ([alt] [backspace]=delete, [alt][search]=caps). As for additional keys, the dedicated back and forward browser keys, as well as brightness and sound keys are a nice touch. All in all using this little machine is a pleasant experience, and I doubt the casual user will have much to complain about with regard to typing, using the touch pad, and the screen. I have noticed some negative reaction to the screen over the twittersphere regarding black white contrast, however, I certainly don’t have anything to complain about and the matte finish on this should work well for outside use. Battery life for me has been easily 6-7 hours with normal light internet use (its rated for 5-6 hours with heavier use). Overall, very happy with the look and feel.

ChromeOS. Well what can I say, its pretty much like using the chrome web browser. Startup times are ridiculously quick, pretty much instant for coming out of sleep and less than 10 seconds for a full reboot. It does have a desktop and a launch bar at the bottom of the screen, but most of the time you’ll find yourself working in tabs within the browser window. Perhaps to highlight this best, outside the typical chrome browser settings, there are only a small handful of settings that actually relate to ChromeOS itself (available by clicking the clock on the bottom right corner). Basically, what this means is that setup is nearly zero work, in fact after you provide your google details all your browser history and settings are automatically synced onto the new machine. Everything is backed up to the cloud too, so performing a factory reset is cost less, you are back up and running in a matter of minutes. As long as you trust google, this means no more backups!

Google apps and the cloud. This machine is designed to work well with google’s ecosystem and they give you 100GB free storage for two years, as I’m already a google user for my music, most docs, and email, obviously for me it just works. Importantly, offline access is available via the google drive app, as is email and many other apps in the offline category of the chrome webstore. For some reason I had problems getting offline Docs to work, in retrospect I probably didn’t allow enough time for the online files to sync to the machine (the main offline settings are available at the bottom of the cog options on your google drive homepage). Well anyway, now I can happily edit docs away from wifi and they will automatically synced with Drive once I am back online. Apart from google apps, I also

Offline drive setup can be found under the cog

installed a simple calculator app, angry birds (works well), the weatherbug app, a light image editing (sorry no links, I haven’t really tried these so can’t really recommend them, but their all in the popular downloads section of the web apps store). Coding is bit more of a challenge, there’s a nice little python shell app that is really only good for testing, perhaps a better long term solution will be with a cloud based IDE like cloud9IDE , and I’m sure more solution will be coming online as the popularity of cloud computing grows. Perhaps the best app I have discovered is Chrome Remote Desktop (CRD), this allows you to access all your computers through the browser using pin-codes. Despite the awesomeness of CRD, the big drawback for me was that there is no way of adding a Linux machine to your “computers”, so my plans to run my work computer as a code slave for my chromebook went up in smoke.

Chrome Remote Desktop is brilliant

Hopefully, future updates will add better Linux support for this awesome app (you can access a Linux computer but only through the “share” function that requires you or someone else to be sitting at the slave computer). That said, if you want to give granny this computer helping her out in real time via the CRD will save your life and sanity.

Issues. Besides the ones related to using google docs (please please integrate docs with google scholar as a citation manager), perhaps the biggest is this constant page reloading that seems to occur when you have “too many” tabs open (I’m talking 10ish). The reloading slows everything down and tends to cause docs and especially music (via google music) to hang for a frustrating number of seconds. Hopefully this “feature” will work better in future chromeOS updates.

Bottom line. This thing would be awesome for the road worrier or as a general use house laptop, especially if they were already part of the google docs ecosystem. With an external monitor it might even be perfect starter for your granny. With the SSD its pretty robust, fast to start up, has a small footprint make it a computer that I’ll happily drop into my backpack for that I need a browser now moment.

Arch linux on an eeepc 900

This is a summary of my attempt to install the Arch linux distro on my eeepc 900. I had some problems with the login manager and the desktop, but appart from that the base install was not that difficult if you follow the beginners guide. Arch is a hard core distro in that you have to set everything up yourself, but by doing that you get a better understanding of what is going on under the hood.

My little old eeepc 900 running on Arch

Download the Arch iso, I used an external CD-drive to install but alternatively a USB thumb drive could be used. To make it easier, initially I used gparted (http://gparted.sourceforge.net/) GUI to set the / to the 4gb SSD and /home to the 15gb SSD, I formatted both as ext4, which is not recommended because it will destroy your ssd, but apparently ext4 is SSD aware, I just need to work out how to turn journaling off.

Boot to the CD, chose “Boot Arch Linux”. This first section is almost word for word from the beginers guide, and that has a lot more detail, I have skipped the parts that are not relevant to the eeepc.

There was no wireless at startup, on the eeepc it is identified as “wlan0” and the interphase can be checked with, then bring it up and scan

#iwconfig
#ip link set wlan0 up
#iwlist wlan0 scan

Look for the ESSID:”yourwirelessnetworkname”

Backup the original file then modify and set it up for wpa encription.
# mv /etc/wpa_supplicant/wpa_supplicant.conf /etc/wpa_supplicant/wpa_supplicant.conf.original
# wpa_passphrase linksys "my_secret_passkey" > /etc/wpa_supplicant/wpa_supplicant.conf
#wpa_supplicant -B -Dwext -i wlan0 -c /etc/wpa_supplicant/wpa_supplicant.conf

Wait a few seconds for it to associate, now check and requiest an ip address, then ping google to check
#iwconfig wlan0
# dhcpcd wlan0
#ping -c 3 http://www.google.com

Now the hard drive
# fdisk -l

If you formatted your drives like I did (using gpartted) the output should indicate that you have sda1 which is your base and home is sdb1, since we used gparted I can skip all the ugly manual partitioning, so we just need to format them as ext4 (bad idea, see below).

# mkfs.ext4 /dev/sda1
# mkfs.ext4 /dev/sdb1

Remember that this is a bad idea, using ext2 would be safer, but I figure I can work out how to turn of journaling. Save your SSD, don’t make a swap partition!

Mount the base and the home and install the base and devel packages, then setup fstab
# mount /dev/sda1 /mnt
# mkdir /mnt/home && mount /dev/sdb1 /mnt/home
# pacstrap /mnt base base-devel
# genfstab -p /mnt >> /mnt/etc/fstab
#cat nano /mnt/etc/fstab

Modify the mount flags for drive to avoid unessary writes, more infomation at https://wiki.archlinux.org/index.php/Solid_State_Drives

Now install the graphics environment

pacman -S xorg-server xorg-xinit xorg-utils xorg-server-utils xterm

Now let’s enter the new system, start configuration by creating a hostname, or a name for your computer
#arch-chroot /mnt
#nano /etc/hostname

And add “daves-eeepc” or something like that, no quotes! Save [ctl][o] an exit [ctl][x], now modify the /etc/hosts and add the new name
#nano /etc/hosts
add the name as below
127.0.0.1   localhost.localdomain   localhost daves-eeepc
::1         localhost.localdomain   localhost daves-eeepc

Now configure the time, first find out the region and local time and settings
# ls /usr/share/zoneinfo/
America and then for the subdomain
# ls /usr/share/zoneinfo/America
I’m New_York
Now create a sim link to local time directory
# ln -s /usr/share/zoneinfo/America/New_York /etc/localtime
Now set up local settings and uncomment the en_US.UTF-8 UTF-8
setting.
# nano /etc/locale.gen
Then generate the file
# locale-gen

Setup system wide preferences by adding “LANG=en_US.UTF-8” to (dont inc quotes) the file shown using nano.
#nano /etc/locale.conf
Then export the settings and set the clock to utc
# export LANG=en_US.UTF-8
# hwclock --systohc --utc

Now set the network, check the /etc/rc.conf file make sure “network” is between the brackets, also add “net-auto-wireless” for our wireless.
DAEMONS=(syslog-ng network net-auto-wireless netfs crond)

Since we are wireless, we need to do a little work!
#exit
# pacstrap /mnt wireless_tools netcfg
# pacstrap /mnt wpa_supplicant wpa_actiond
# pacstrap /mnt zd1211-firmware
# arch-chroot /mnt

Set the interface in /etc/conf.d/netcfg to
WIRELESS_INTERFACE="wlan0"

Finally create the ramdisk environment, and install the grub boot loader on the root drive, and get rid of any error messages
# mkinitcpio -p linux
# pacman -S grub-bios
# grub-install --target=i386-pc --recheck /dev/sda
# mkdir -p /boot/grub/locale
# cp /usr/share/locale/en\@quot/LC_MESSAGES/grub.mo /boot/grub/locale/en.mo

Now set up grub and create a root password
# pacman -S os-prober
# grub-mkconfig -o /boot/grub/grub.cfg
# passwd

Nearly done with the base install, finally lets reboot

# exit
# umount /mnt/{boot,home,}
# reboot

Once we are back refresh pacman and lets add you as a user and install sudo

# pacman -Syy
# pacman -Syu
# adduser

Add the following, blank is just hit return to accept defaults, add contact details when asked if you like. If you make a mistake just delete and try again (# userdel -r [username])

Login name for new user []: dave
User ID ('UID') [ defaults to next available ]:
Initial group [ users ]:
Additional groups (comma separated) []: audio,games,lp,optical,power,scanner,storage,video
Home directory [ /home/dave ]:
Shell [ /bin/bash ]:
Expiry date (YYYY-MM-DD) []:

Now install and setup sudo, we need to edit the sudoer file with a special editor that has vi underneath, only use this editor as it has failsafes.

#pacman -S sudo
#visudo

Scroll through the file, find the “root  ALL=(ALL) ALL” and on the next line add yourself (without quotes) “dave   ALL=(ALL) ALL”
Get tab complete working

#pacman -S bash-completion

Now install the X-server for graphics

#pacman -S xorg-server xorg-xinit xorg-utils xorg-server-utils xterm
#pacman -S xf86-video-intel xf86-input-synaptics

This is where I ran into trouble following the online guides, we are going to install the XDM login manager and the LXDE desktop both are light and work well on the eeepc.

Firstly, login as a user then

#sudo pacman -S xorg-xdm
#cp /etc/skel/.xsession /etc/skel/.xinitrc ~

These are the launch scripts, xsession calls xinitrc, change the permissions on them
#chmod 744 ~/.xinitrc ~/.xsession

Now we need to increase the run level, in /ect/inittab edit the top and bottom lines by
removing the quotes:
…..
#id:3:initdefault:
…..
id:5:initdefault
…. Unhash (may already be) the line that refers to xdm which we will be using as our desktom manager
x:5:respans:/usr/sbin/xdm -nodeamon

Now install dbus

#sudo pacman -S dbus
#sudo nano /etc/rc.conf

add dbus to the array
DAEMONS=(syslog-ng dbus network net-auto-wireless netfs crond)

Now install lxde and configure open box

#pacman -S lxde
#mkdir -p ~/.config/openbox
#cp /etc/xdg/openbox/menu.xml ~/.config/openbox
#cp /etc/xdg/openbox/rc.xml /etc/xdg/openbox/autostart ~/.config/openbox
# pacman -S gamin
# pacman -S leafpad obconf epdfview

finally add a exec command to the .xinitrc you created during the XDM, add at the bottom (no quotes) “exec startlxde”

That should be it! Reboot and hopefully you will have a simple graphics login page and a functioning (if not light) desktop,

Sources
http://8thstring.blogspot.com/2011/11/arch-linux-on-eee-pc-900-chronicle.html
https://wiki.archlinux.org/index.php/Beginners%27_Guide
https://wiki.archlinux.org/index.php/XDM
https://wiki.archlinux.org/index.php/LXDE

Pandas, matplotlib and Ipython – all you need for great data anaylsis

I’ve been playing around a little with the stats-centric programming language R, mostly to get a better handle on the Bioconductor differential gene expression analysis packages edgeR and DEseq. The language is designed to allow for easy manipulation of tabular data as well as providing access to a rich library of statistical packages and graphing tools. One thing I learned was that often it was more efficient (at least for me) to spend a little time pre-formatting the data using python/perl before even getting started with R. The charting features of R are pretty cool, but once again I missed my familiar Python environment)-:

But, as always there is a Pythonic way, I came across a great book called Python for Data Analysis by Wes McKinney. The book introduced me to the pandas library, which contains R-like tools for tabular data manipulations and analyses. The book also introduced the Ipython development environment; basically a souped up feature rich but light weight “python IDLE”. The best features of Ipython for me are the logging capabilities, advanced history and de-bugging utilities – very cool! Ipython has been designed to work well with the matplotlib, thus allowing production and manipulation of nice looking graphs and charts within an interactive python environment. I’m still very much learning this (and I’m a hack programmer), but here is some fun data crunching based on a USDA food database wrangled into into Json format by Ashley Williams.

Ok, I think it would look better on one chart. Here is the top 50 protein containing foods with their fat content as part of one graph.

Note that I chopped the y-axis labels off, doh, no problem just modify them in real time using the slider editor!

Not too bad at all for a newbee. Hopefully by next post I will feel a little more comfortable to share some hard learned lessons on data manipulation using pandas.