How to use DESeq to analyse RNAseq data

Featured

Several packages have been written to measure gene expression differences using RNAseq data. In this post I’m going to do a run through of the popular R based DESeq package. Basically, I’m just following along with the packages Vignette which is really well written and informative! For the test data I’m using 2 treatments, with 3 biological reps each containing 750K reads (yes its small).

A. Counting reads

DEseq uses a table of raw counts as the basis of DE calls. Any transformed counts, such as FPKM counts will not work with this package. The top row of the table can be a descriptor, these can be imported into R to act as column labels. Gene names should be column 1 (remember R counting starts from 1 rather than 0), each column after that should be a raw count for each biological replicate and for each treatment. Note that if technical reps are performed, these should be merged based on the sum of the two runs (they will be normalised by read count latter).

Here I used BWA to align the reads to my reference and generate SAM files that are then processed by HTseq count. Note that you have to run this on each sample, so in this current example I will generate 6 sam files, 1 for each treatment and condition. Also, see that I used the intersection-strict mode for HTseq.

>bwa aln sarcop_pep_rnaseqfiltered.fa Sample_ST72-1/ST72-1.subset.fastq > ST72-1.subset.sai

>bwa samse sarcop_pep_rnaseqfiltered.fa ST72-1.subset.sai Sample_ST72-1/ST72-1.subset.fastq > ST72-1.subset.sam

>htseq-count -m intersection-strict –stranded=no ST72-1.subset.sam sarcop_pep_rnaseqfiltered.gtf > ST72-1.subset.counts

In the above ST72-1.subset.counts is a table of counts that can be merged with the other count data before being imported into DEseq as a R data table. Here is a python script for merging, see the header of the script for usage, if that helps. For full detailed explanation of HTseq see (http://www-huber.embl.de/users/anders/HTSeq/doc/install.html#installation-on-linux) and this seems to be the default way of doing it (instructions :http://www-huber.embl.de/users/anders/HTSeq/doc/count.html).

B. R DESeq package 

Enter R environment from a terminal window open where your data is.

>R

>library(“DESeq”)

>countsTable <- read.delim(“merged_counts.txt”,header=TRUE)

>rownames(countsTable) <- countsTable$gene

>countsTable <- countsTable[,-1]

 This imports the DESeq library, reads the tab delineated file created by the python script into a data table. The next two lines create row names from the gene column created by the script, if you use some other method to make the counts you need to change the name of the gene column to whatever it is. Finally, we get rid of the column used to make the row names. My table looks like this

> head( countsTable)

                  UN72_1 UN72_2 UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2          0      0      0      0      0      0

g10671t00001            2      1      1      0      0      0

g16789t00001            0      0      0      0      0      0

Now we need to add the treatment information so that DESeq knows what data each column represents, order is obviously important. 

>conds <- factor( c( “untreated”, “untreated”, “untreated”, “treated”,”treated” ,”treated”  ) )

> cds <- newCountDataSet( countsTable, conds )

This has created a instantiate a CountDataSet, which is the central data structure in the DESeq. The next step will adjust for different library sizes among the treatments, something called “size factors”. DESeq only needs relative library sizes, so if (on average) the counts for non-DE genes in one library are twice that of a second library the size factor should be twice as large.

 >cds <- estimateSizeFactors( cds )

 To see these size factors, do this:

 >sizeFactors( cds )

UN72_1    UN72_2    UN72_3    ST72_1    ST72_2    ST72_3

0.9805609 1.0420015 0.9164864 1.1083583 0.9599370 1.1462168

 Apparently, the way this works (http://www.biostars.org/p/12273/) is that you take the genometric mean of each condition for a gene and use this as its reference value. Then you divide the gene expression value for each conditions by its reference value to create a list of values (quotient ie result of the division) for each condition. The size factor is the median value of this list, as there is a list per condition it will generate a size factor value for each condition. The counts are normalised by dividing each column by its size factor, here lets look before and after:

 > head(counts(cds))

                  UN72_1 UN72_2 UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2          0      0      0      0      0      0

g10671t00001            2      1      1      0      0      0

> head(counts(cds,normalized=TRUE))

                    UN72_1    UN72_2   UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2     0.000000 0.0000000 0.000000      0      0      0

g10671t00001       2.039649 0.9596916 1.091124      0      0      0

g16789t00001       0.000000 0.0000000 0.000000      0      0      0

The next part is a key to how DESeq calculates significance in DE. The package uses the relationship between the data’s variance (or dispersion) and its mean. The dispersion is the square of the coefficient of biological variation. The inference in DESeq relies on an estimation of the typical relationship between the data’s variance and their mean, or, equivalently, between the data’s dispersion and their mean. The dispersion can be understood as the square of the coefficient of biological variation. The variance between counts is the sum of two factors, firstly, the level of variation between replicates of each treatment, and secondly an uncertainty measure based on the concentration of the counts. This latter factor takes into consideration the level of noise predicted for low expressed genes based on a poisson distribution (see the DEseq manual for more detail).

>cds <- estimateDispersions( cds ) 

Now we write a little function that will plot this.

>plotDispEsts <- function( cds ){

plot(rowMeans( counts( cds, normalized=TRUE ) ), fitInfo(cds)

$perGeneDispEsts, pch = ‘.’, log=”xy”, ylab=”dispersion”,

xlab=”mean of normalized counts”)

xg = 10^seq( -.5, 5, length.out=300 )

lines( xg, fitInfo(cds)$dispFun( xg ), col=”red” )

}

>plotDispEsts(cds)

> jpeg(“DispEsts_DESeq.jpg”)

> plotDispEsts(cds)

> dev.off()

The Above commands make a plotting function, the pass cds to that function and save the result in a jpg file.

The empirical (real) dispersion values (black dots) and fitted values (red line)

The empirical (real) dispersion values (black dots) and fitted values (red line)

Remember that the level of dispersion is related to the biological variation seen in each treatment. Note this is a double log graph, ahhh my brain hurts, but look carefully at what is happening on the Y axis, as read count increases dispersion decreases, as we would expect. However, a lot of those black dot values below the red fitted line are probably underestimates of dispersion based on the small samples sizes used (only 3 replicates values measured in this example). So, to be conservative DESeq moves all these values below the red line up to the fitted value, BUT keeps all the empirical values above the red line, even if some of these are over estimates of dispersion, as I said it is being conservative. Why does dispersion matter? The more dispersion, or biological variation, the bigger the difference between counts between treatments is required before it differences become significant!

Finally we are ready to call DE values!

>res = nbinomTest( cds, “untreated”, “treated” )

>head(res)

                 id  baseMean baseMeanA baseMeanB foldChange log2FoldChange

1     GWD8ZKM01ESPR2 0.0000000 0.0000000         0        NaN            NaN

2       g10671t00001 0.6817440 1.3634880         0          0           -Inf

3       g16789t00001 0.0000000 0.0000000         0        NaN            NaN

      pval padj

1        NA   NA

2 0.2083517    1

3        NA   NA

Now lets plot the log2 fold changes (untreated versus treated) against the mean normalised counts, colouring in red those genes that are significant at 10% FDR.

jpeg(“plotMA.jpg”)

> plotMA(res)

> dev.off()

plotMA

Note how lower lower expressed genes need a bigger log2 fold change value to be made red (not so obvious in this data due to low number of reads used in mapping).

Lets explore this a little more by comparing the normalized counts between all three replicates of just the untreated sample.

>ncu = counts( cds, normalized=TRUE )[ , conditions(cds)=="untreated" ]

>jpeg(“MA_untreated_only.jpg”)

>plotMA(data.frame(baseMean = rowMeans(ncu),log2FoldChange = log2( ncu[,2] / ncu[,1] )),

col = “black”)

>dev.off()

 MA_untreated_only

Wow, look at all the DE in a single treatment,  thus we should call a gene DE only if the change between treated and untreated samples is stronger than what we see between replicates, and hence, the dividing line between red and black in the first figure mimics that for the second.

And now a histogram of p values to look at their distribution.

hist(res$pval, breaks=100, col=”skyblue”, border=”slateblue”, main=”")

> jpeg(“hist_pval.jpg”)

> hist(res$pval, breaks=100, col=”skyblue”, border=”slateblue”, main=”")

> dev.off()

 hist_pval

The pileup of very low p-values are the DE genes, the pileups at the right hand side of the distribution are from very low count genes that are given discrete values.

Now lets create some subset tables based around FDR cut offs (in this case 10%)

>resSig = res[ res$padj < 0.1, ]

>head(resSig)

Before we make any mistakes, lets save the data

>write.csv( res, file=”72hour_unstgVstg_all.csv” )

>write.csv( resSig, file=”72hour_unstgVstg_only_0.01FDR.csv” )

We are done with the basic analysis, we can do some more sophisticated filtering of the data. The main goal is to remove tests that give little chance of producing a meaningful result, with the aim of keeping the FDR in check. This results in an increased power to detect significant results whilst keeping the same level of type I error the same (false positive rate).

>rs = rowSums ( counts ( cds))

>theta = 0.4

>use = (rs > quantile(rs, probs=theta))

>table(use)

table(use)

FALSE  TRUE

30145 36029

Basically here we are summing the counts across all conditions for each gene, then remove the genes with the lowest 40 % quantile (basically filtering out those that are false above from the original normalised cds table)

>cdsFilt = cds[ use, ]

Now just repeat the tests on the filtered data

>resFilt = nbinomTest( cdsFilt, “untreated”, “treated” )

Now lets look at the difference this makes to the FDR.

tabAll <-Table(res$padj<0.1)

tabFilt <-Table(resFilt$padj<0.1)

> tabFull

FALSE  TRUE

35266   763

> tabFilt

FALSE  TRUE

35266   763

In this case it didn’t seem to make much difference because most of the dropped values were NA! Just to visualise what is going on here lets plot the un-ajusted pvals of the ~40% quantile that we filtered.

>jpeg(“pval_bot40.jpg”)

> plot(rank(rs)/length(rs), -log10(res$pval), pch=16, cex=0.45)

> dev.off()

 pval_bot40

Note that very few p value are less than 0.003 (this corresponds to about 2.5 on the − log10 scale), and this is demonstrated when we plot the pvals of each dataset.

> h1 = hist(res$pval,breaks=50,plot=FALSE)

> h2 = hist(resFilt$pval,breaks=50,plot=FALSE)

> barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, col = colori,

+ space = 0, main = “”, ylab=”frequency”)

> text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA)

> legend(“topright”, fill=rev(colori), legend=rev(names(colori)))

> dev.copy(png,’hist_filthist.png’)

> dev.off()

hist_filthist

Now we are going to process the data further, since some counts and be zero in some applications in makes sense to add psuedocounts (ie y = log2 (n + n0 ) where n0 is a small ‘chosen’ constant. The question is what is the value of the small number you add to the zero counts to make them non-zero (but keep them very small). DEseq uses a alternative approach.

>cdsBlind = estimateDispersions( cds, method=”blind” )

>vsd = varianceStabilizingTransformation( cdsBlind )

Now we will plot Per-gene standard deviation (taken across samples), against the rank of the mean, the standard deviation of the transformed data, across samples, against the mean, first using the shifted logarithm transformation (notAllZero is basically one of the numbers in the row is NOT ZERO), with 1 added to all values to remove zeros), then using DESeq’s variance stabilising transformation. While we are meant to see a constant standard deviation for the DESeq method along the whole dynamic range, with the shifted logarithm giving a highly elevated standard deviation in the lower count range for the other method, in fact they both look pretty similar for this small dataset.

>biocLite(“vsn”)

>library(“vsn”)

>par(mfrow=c(1,2))

>notAllZero = (rowSums(counts(cds))>0)

>meanSdPlot(log2(counts(cds)[notAllZero, ] + 1), ylim = c(0,2.5))

>meanSdPlot(vsd[notAllZero, ], ylim = c(0,2.5))

Now to deal with moderate counts that generate highly variable log fold change values and in some cases they become infinite. This will mess up any kind of clustering we may want to perform to identify expression common patterns. We can compare their ordinary log-ratios (lfc) against the moderated log-ratios (mod_lfc) using a scatterplot, with the genes bined by their log10 mean to colour them according to expression strength.

>mod_lfc = (rowMeans( exprs(vsd)[, conditions(cds)=="treated", drop=FALSE] ) – rowMeans( exprs(vsd)[, conditions(cds)=="untreated", drop=FALSE] ))

> lfc = res$log2FoldChange

> table(lfc[!is.finite(lfc)], useNA=”always”)

-Inf   Inf   NaN  <NA>

5085  4900 30145     0

Many of the latter are infinite (Inf,division of a finite value by 0) or not a number (NaN, resulting from division of 0 by 0).

> logdecade = 1 + round( log10( 1+rowMeans(counts(cdsBlind, normalized=TRUE)) ) )

> lfccol = colorRampPalette( c( “gray”, “blue” ) )(6)[logdecade]

> ymax = 4.5

> plot( pmax(-ymax, pmin(ymax, lfc)), mod_lfc,

xlab = “ordinary log-ratio”, ylab = “moderated log-ratio”,

cex=0.45, asp=1, col = lfccol,

pch = ifelse(lfc<(-ymax), 60, ifelse(lfc>ymax, 62, 16)))

> abline( a=0, b=1, col=”red3″)

Here grey to blue represents weakly to strongly expressed genes, for highly expressed genes there is general agreement (along diag) but for lowly expressed genes the moderated log-ratio is shrunk towards 0, compared to the ordinary log-ratio (the grey smear flattens out along the horizontal (0 value on the y-axis).

smoothing

Now to look at heat maps of counts to see any patterns using the transformed  log data.

>library(“gplots”)

>library(“RColorBrewer”)

> cdsBlind = estimateDispersions(cds,method=”blind”)

> vsd = varianceStabilizingTransformation(cdsBlind)

>select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:30]

>hmcol = colorRampPalette(brewer.pal(9, “GnBu”))(100)

>heatmap.2(exprs(vsd)[select,], col = hmcol, trace=”none”, margin=c(10, 6))

 heat_map_transformed

and for a laugh the untransformed data.

 heat_map_untransformed

Now for a

>dists = dist( t( exprs(vsd) ) )

>mat = as.matrix( dists )

> rownames(mat) = colnames(mat) = with(pData(cdsBlind), paste(condition))

> heatmap.2(mat, trace=”none”, col = rev(hmcol), margin=c(13, 13))

dend

No shocks here again. Of note is that we used the method “blind”, thus the variance stabilizing transformation is not informed about the design to prevent bias in the outcome. In comparison, the pooled method (the default for estimating the dispersion) “means that one dispersion is computed for each gene, which is now an average over all cells (weighted by the number of samples for each cells), where the term cell denotes any of the four combinations of factor levels of the design.”

Now a good olde PCA, pretty cool really.

>print(plotPCA(vsd, intgroup=c(“condition”)))

 pca

This type of plot is useful for visualizing the overall effect of experimental covariates and batch effect, so you might expect known conditions to group, ie treatment and any time of library prep effects (paired end reads versus single end data). Good to see that even with this subset of data biggest component is treatment (pc1).

>sessionInfo()

R version 2.15.3 (2013-03-01)

Platform: x86_64-pc-linux-gnu (64-bit)

locale:

[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C

[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8

[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8

[7] LC_PAPER=C                 LC_NAME=C

[9] LC_ADDRESS=C               LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] grid      stats     graphics  grDevices utils     datasets  methods

[8] base

 

other attached packages:

[1] arrayQualityMetrics_3.14.0 gplots_2.10.1

[3] KernSmooth_2.23-10         caTools_1.14

[5] gdata_2.12.0.2             gtools_2.7.1

[7] RColorBrewer_1.0-5         vsn_3.26.0

[9] BiocInstaller_1.8.3        DESeq_1.10.1

[11] lattice_0.20-15            locfit_1.5-9

[13] Biobase_2.18.0             BiocGenerics_0.4.0

 

loaded via a namespace (and not attached):

[1] affy_1.36.1           affyio_1.26.0         affyPLM_1.34.0

[4] annotate_1.36.0       AnnotationDbi_1.20.7  beadarray_2.8.1

[7] BeadDataPackR_1.10.0  Biostrings_2.26.3     bitops_1.0-5

[10] Cairo_1.5-2           cluster_1.14.4        colorspace_1.2-1

[13] DBI_0.2-5             gcrma_2.30.0          genefilter_1.40.0

[16] geneplotter_1.36.0    Hmisc_3.10-1          hwriter_1.3

[19] IRanges_1.16.6        latticeExtra_0.6-24   limma_3.14.4

[22] parallel_2.15.3       plyr_1.8              preprocessCore_1.20.0

[25] reshape2_1.2.2        RSQLite_0.11.2        setRNG_2011.11-2

[28] splines_2.15.3        stats4_2.15.3         stringr_0.6.2

[31] survival_2.37-4       SVGAnnotation_0.93-1  tcltk_2.15.3

[34] tools_2.15.3          XML_3.6-2             xtable_1.7-1

[37] zlibbioc_1.4.0

>

Illumina quality control

Several tools are available for investigating the quality of your sequencing data and these should be used before any downstream analysis, after all, crap in carp out.

The first tool, written in Java and called fastqc, has a simple GUI or can be used in the command line to generate a number of useful graphs describing several important features of your sequence data. These include (among others) per base quality and read length distributions, as well as detection of sequence duplication’s and kmer content. Handily the authors also provide some good and bad Illumina data that you can use to test out the software.

Running is easy, either use the GUI or just type: ./fastqc seq_file_name

The –help flag will provide the other options, format detection is automatic and worked for me so probably the only option you might consider is ‘-t’ to set the number of cores to use.

Image

As you can see by the html report that is produced by fastqc a summary is provided indicating if your data meet a certain quality expectations (these seem to be very strict).

Here I have run the good (left) and bad (right) test data-sets and shown them side by side for comparison. Edit: These data-sets come from the SolexaQA package that I’ll talk about in another blog.

fastqc_stats

These box whisker plots of quality per base show how illumina data degrades at the 3′ end of the read. Here the median value as red line, box (yellow) the inter-quantile range (35-75%), upper/lower whiskers represent 10% and 90% points, while the blue line is the mean quality. Good quality scores (green), reasonable (orange), and (poor) are indicated by background shading. A warning will be issued if any base is less than 10, or if the median for any base is less than 25.

fastQC_qualityscore_dist

These plots show the quality score distribution over all sequences. The fastQC manual says that even in good data a small proportion of sequences may have low quality scores, often because they were poorly imaged.

fastqc_sequencecontent

Nothing dramatic here, these are the sequence content of each nucleotide across the reads, the lines should be horizontal reflecting the underlying GC%.

fastQC_seq_dupliction

The sequence duplication level will reveal any problems during library preparation. Note that any duplicates that occur greater than 10 times are lumped into the 10+ category so if you see a pile-up there you have problems. Note that some duplication may occur if their is very high coverage.

fastqc_perbaseN

The per base N content should be less than 3%.

fastQC_GC_overseqdist

The GC content distribution of the data (red) against a theoretical modeled distribution (blue). A shifted GC distribution away from the modeled normal distribution indicates a systematic bias that is independent of base position, probably the result of contamination in the library or some kind of bias subset of sequences.

fastQC_GCcontent

The line should be horizontal in a random library, GC bias which changes in different bases can indicate over-represented sequence in your library.

FASTqcseq_len_dist

In most cases the sequence length distribution should be uninteresting (at least for illumina data), but if any kind of trimming has occurred this should be revealed here.

Developer mode and Ubuntu on a chromebook without duel booting

I really love my chromebook, at least when I get to wrestle it off the wife. But every now and then I miss the fact that I can’t jump into the terminal and start tapping away. But I don’t really want to get rid of chomeOS, its great the way it is and duel booting seems to defeat the purpose of having a quick start “always on” laptop.

So long story short, I wasn’t ready to go down the install ubuntu on your chromebook path. That was until I came across a great post on google+ (https://plus.google.com/u/0/112449749826562830126/posts/ZS9WaegrZYH) which describes a bunch of scripts called crouton that allow you to run ubuntu inside of chromeOS! Switching between the two systems is as easy as a key stroke, and it is instant.

Instructions for ARM based Samsung chromebooks (the $250 ones)

1. Backup everything! Well if you live in the cloud this is not a problem as google is taking care of this for you, but this process WIPE YOUR DRIVE (You can restore chromeOS if you like, no harm no foul, but any data on your SSD will be GONE).

Also, developer mode is less secure as Google is no longer watching your back, don’t be stupid with your new found power!

Identity crisis, Linux and ChromeOS side by side!

Identity crisis, Linux and ChromeOS side by side!

1. Power off and enter developer mode by tapping the power button while holding down ESC and REFRESH, this should bring up a scary “your OS is damaged” screen, now hit CTRL D, and following the on screen instructions press ENTER to switch to developer mode (this takes some time).

2. Once you are back to the login screen, type in your google credentials and restore you system, for me it seemed that everything was running a lot quicker than with the walled garden version, maybe I was dreaming.

3. Once you are up and running with your ChromeOS, open a browser and download crouton, use the file manager to check that it has downloaded to your download directory.

4. Now open a shell window by typing CTL ALT T (note, this won’t work if you use the VT2 terminal ie CTL ALT Forward!), at the chronos shell type “shell” to, well, enter the shell

5. Follow the instructions to setup a sudo password!

6. Now: “sudo sh -e ~/Downloads/crouton -t xfce”. Note this will install xfce desktop (a light weight desktop that comes with xubuntu, really, its all you need), see the crouton page for more information on other options (Unity, if your that way inclined\-:).

8. After the system downloads and installs, at the prompt type “sudo startxfce4″ to enter your new linux environment!

So far this is great, to switch between stock chromeOS and xfce just use Ctrl+Alt+Shift+Back and Ctrl+Alt+Shift+Forward, how easy is that!

So what now, well apt-get works, why not install nano, or in my case gimp, since I needed to edit some of the photos in this post. Life is good(-:

Screenshot - 030213 - 19:22:01

Issues: yes! There is a little bit of instability, but I’ve found a quick switch between OS’s seems to calm things down.

Our move to NZ

Aside

Well for those that don’t know Sarah and I have left snowy Rochester and moved to Palmerston North (Palmy as the locals say) New Zealand for a faculty Job at Massey University. The Uni paid for our flights and two weeks accom here, so that was very nice of them. They wanted us to use a moving company, but since we had only a small amount of stuff we just shipped it via US postal service. Well 3 bags and 12 boxes later we are in NZ, everything made it bar some broken wine glasses. I’ve eaten about a meat pie a day!

Besides the typical crap associated with moving countries things have been going well. House hunting on foot was hard work, but we managed to find a nice old place in the north end of town. We also managed to [partially - couch, tv, bed, BBQ] furnish the house, in fact we found some good deals and even own a proper TV for the first time ever!  Here’s the house, it’s a big old thing, probably going to be cold as hell in winter, not sure about firing up the coal powered stove to heat the house! We put it all on our Australian credit cards, lets just say the exchange rate has been good to us(-:

The old coal stove

The old coal stove

The front of our new renatl in palmy

The front of our new rental in palmy

The back yard, notice the Hills hoist clothes line!

The back yard, notice the Hills hoist clothes line!

NZ has been great so far, the people are super friendly, even though they pick I’m an Aussie right away, I guess as long as they are winning the rugby (especially against us) they are a happy bunch. It’s funny getting ribbed all the time – shoe is on the other foot now after spending 5 years ribbing Yanks on their many backward ways(-; Although I’m not a big fan of rugby, they had this thing called super 7s on TV that is quite watchable, NZ v Australia this afternoon!

The fresh air and beautiful scenery have inspired Sarah and I to work hard on getting out and about, so we both rewarded ourselves for a great packing job by buying a couple of new bikes. The bikes are nothing great, but certainly not wall-mart things! I’ve also signed up for a fly fishing course as the river running through town is a good source of brown trout.

A new bike for a new  country

A new bike for a new country

My old bike stuck in Rochester snow

The old bike in the old country – stuck in Rochester snow

Wonderful weather in Rochester

Wonderful weather in Rochester – bbq and table compliments of previous owners – recycling (beer bottle for reference)!

Prawns on the barbi

Prawns on the barbie - not that great really, stupid tourism Australia propaganda

We got the new Barbie of NZ’s version of craigs list, only cost $100 with a partially full gas bottle, they were nice sellers and threw in some clothes draws, a bedside table, cutlery, chairs; as they were on their way to live in Melbourne – True recycling!

Land of the long white cloud

Land of the long white cloud

Water spins the other way!

Water spins the other way!

The local beer, made up on the hill, prenouced Twoee - "can I have a a handle (glass) of Twooee?"

The local beer, made up on the hill, pronounced ”two-eee” as in “can I have a handle (pint) of two-ee please”

Anyway, so far we have been really happy with the move. I will re-evaluate after I start my hefty teaching load at work!

10 things I will miss about America

  1. Friends!
  2. The optimism – the people believe in themselves and their country
  3. Science – Good funding, amazing universities, great resources, easy access to conferences and collaborates.
  4. Our house and the best house mate ever (Nirish)
  5. Kansas thunderstorms – in retrospect a cellar would have made this more fun
  6. Great beer at great prices – $8 a six pack of craft beer, awesome!
  7. Interstate – a good way to get around, even if you miss actually seeing any of the country
  8. Resources – natural and otherwise they were super abundant, truly the lucky country
  9. Buffalo wings and southern BBQ – yes its crap, but tasty crap.
  10. Internet – Google, amazon, Hullu, no download limits!

Finally 10 things I don’t miss about the USA

  1. Tipping and tax not included in prices – If its expected why make us do it each time, just add it to the bill or pay people properly, its insulting!
  2. Driving on the Right hand side – it was second nature by the end, but always felt “off”
  3. Religious nutters – Religion, turning decent  people stupid for millennia.
  4. The weather – Rochester’s weather sucked 330 days of the year (I like snow otherwise this would be a bigger number)
  5. People not understanding me – I’m speaking Fu!!en English!
  6. Crap food and the worst wine ever (sweet sweet sweet – yuck)
  7. dilapidated or missing infrastructure (schools, roads, side walks, bridges, public transport)
  8. Pennies – who uses 1 cents anymore (in NZ 10 cents is the smallest coin)?
  9. Broken government services – We used to say if we can get it done in 3 visits we are doing well (social security, DMV, even the University)
  10. Poverty – it was everywhere, not just the down and outs, but hard working people driven into the ground by low wages and poor social support (no health insurance)

Using PAML to detect pairwise dNdS ratios

Caveat: Many of the intricacies of Molecular evolution are fuzzy to me!

Intro: Due to the triplet code, changes in third codon position are mostly silent (synonymous), in that they do not result in a change to the encoded amino acid. In contrast, changes that occur in first and second codon positions nearly always result in an amino acid substitution (non-synonymous). Non-synonymous changes are rare as they mostly cause deleterious mutations that are removed from the population through purifying selection. However, in cases where positive selection is acting on a protein sequence, accumulation of changes that lead to the adaptive shift in protein function tend to result in an excess of Non-synonymous mutations. This means the ratio of non-silent to silent changes, as a proportion of each kind of site in the sequence, can be used as a measure of adaptive evolution. More specifically, a dNdS ratio of greater than 1 is used to indicate that a protein is/has been subject to positive selection. Here I’ll describe some of the important parameters when using arguably the most popular method for detecting selection based on dNdS in coding genes, that being the PAML package.

Preliminaries: The predominant program that we will use is called codeml, which (for the most part) takes an alignment of homologous coding regions. The most important aspect of the alignment is that the codon positions are correctly aligned. Codeml make calculations based on the nucleotide positions in the coding triplet, so the the open reading frame in the DNA needs to be maintained if the alignment contains indels or missing bases. The best way to guarantee this outcome is to constrain the DNA alignment based on a protein alignment (i.e. any gaps ‘-’ should be inserted as ‘—’ as not to disrupt reading frame) and in a sense any phylogeny based on coding DNA sequences should always be based such an alignment (evolution acts at the level of the protein after all). A tool for doing this is called transalign, but be warned that 1) your protein and DNA sequences need to be completely complementary (DNA must = protein, no partial or extra triplets etc) (2), they need to share the same names (3), they must be in the same order.

Pairwise dNdS analysis: Codeml reads a control file that contains the run parameters, typically this file is called codeml.ctl, the basic file for doing pairwise dNdS is below.

seqfile = seqfile.txt   * sequence data filename
outfile = results.txt   * main result file name
treefile = treefile.txt *tree file
noisy = 9      * 0,1,2,3,9: how much rubbish on the screen
verbose = 1      * 1:detailed output
runmode = -2     * -2:pairwise
seqtype = 1      * 1:codons
CodonFreq = 0      * 0:equal, 1:F1X4, 2:F3X4, 3:F61
model = 0      *
NSsites = 0      *
icode = 0      * 0:universal code
fix_kappa = 1      * 1:kappa fixed, 0:kappa to be estimated
kappa = 1      * initial or fixed kappa
fix_omega = 0      * 1:omega fixed, 0:omega to be estimated
omega = 0.5    * initial omega value

I’ve highlighted some important parameters in red, “seqfile” and “outfile” are self explanatory, except to say that the infile needs to be in a “typical” format, I use fasta because it is what I use(-;. The “runmode” is set to -2 to indicate that you are doing pairwise dNdS. “CodonFreq” and “kappa” incorporate parameters into the model that account for biological relevant aspects of protein evolution. CodonFreq controls the codon frequency parameter of the model, setting it to 0 implies that all codons are equally likely, 1 assigns the value based on the average nucleotide frequencies, 2 assigns the values based on the average frequencies of nucleotides in the codon, 3 allow the parameters to be estimated freely. All codons are generally not equally likely due to differing tRNA availability, ignoring these processes can lead to biases in the estimated dNdS. Kappa is the transition to transversion ratio. The chemical properties of DNA mutation create a bias toward transitions, and as transitions at third codon positions are more likely to be silent than transversions, ignoring this parameter nearly always results in an overestimation of dS and thus a correspondingly underestimation of dNdS. If the kappa value is set to 1, the transition bias is ignored, while setting it to 0 allows it to be estimated from the data. With these reference to these parameters, counting models such as NG86 (Nei and Gojobori (1986)) can be mirrored by setting K=1 and codonFreq = 0. Alternative settings for these values comes at a cost in that it increase the number of parameters that need to be estimated, however, their incorporation into the model makes for a more biologically relevant estimation of dNdS.

Final notes: It should be noted that these estimates of dNdS are not very sensitive to detecting positive selection. This is for two main reasons, firstly many amino acids are structurally important for the proteins core function and are thus correspondingly nearly invariably, this results in positive selection acting on a relative small number of codons. Secondly, selection tends to be episodic (occurs in spurts). This means it is difficult to detect positive selection by just taking the average dNdS across the entire protein from a single pairwise comparison. To overcome this methods have been developed that detect changes in dNdS across a phylogeny derived from multiple sequence alignment, and I’ll focus on these methods in the next addition!

I’ve been very slack with references for this post because I wonted to focus on the core considerations of using codeml, however, most of the above is based on reading the excelling coverage of this topic in: Maximum likelihood methods for detecting adaptive protein evolution

Bielawski, JP and Yang, Z (2005) Maximum likelihood methods for detecting adaptive protein evolution. In: Nielsen, R, (ed.) Statistical Methods in Molecular Evolution. (103 – 124). Springer-Verlag: New York.

A week with Samsung’s and Google’s new Chromebook.

In one line. I’m sold – it will only get better – shelving plans for an ultrabook First impressions. Firstly, its small, thin and light! Certainly smaller than I was expecting, think larger netbook rather than ultrabook/notebook. That said the keyboard is full size and the screen at 11.6 inches is perfectly reasonable for most tasks. This looks pretty nice, however, it does feel a little cheap due to its all plastic body. Does it feel like a $250 netbook? No way, I doubt the casual observer would think you aren’t using anything but the latest and greatest little ultrabook. So far the keyboard has been great, the keys respond well so touch typing is a breeze. The track pad is also above average based on my experience, in fact its the first laptop I have owned where I haven’t automatically gone for the USB mouse. It should be noted that the keyboard lacks a dedicated cap’s lock and delete keys, however, easy short-cuts are available in their place ([alt] [backspace]=delete, [alt][search]=caps). As for additional keys, the dedicated back and forward browser keys, as well as brightness and sound keys are a nice touch. All in all using this little machine is a pleasant experience, and I doubt the casual user will have much to complain about with regard to typing, using the touch pad, and the screen. I have noticed some negative reaction to the screen over the twittersphere regarding black white contrast, however, I certainly don’t have anything to complain about and the matte finish on this should work well for outside use. Battery life for me has been easily 6-7 hours with normal light internet use (its rated for 5-6 hours with heavier use). Overall, very happy with the look and feel.

ChromeOS. Well what can I say, its pretty much like using the chrome web browser. Startup times are ridiculously quick, pretty much instant for coming out of sleep and less than 10 seconds for a full reboot. It does have a desktop and a launch bar at the bottom of the screen, but most of the time you’ll find yourself working in tabs within the browser window. Perhaps to highlight this best, outside the typical chrome browser settings, there are only a small handful of settings that actually relate to ChromeOS itself (available by clicking the clock on the bottom right corner). Basically, what this means is that setup is nearly zero work, in fact after you provide your google details all your browser history and settings are automatically synced onto the new machine. Everything is backed up to the cloud too, so performing a factory reset is cost less, you are back up and running in a matter of minutes. As long as you trust google, this means no more backups!

Google apps and the cloud. This machine is designed to work well with google’s ecosystem and they give you 100GB free storage for two years, as I’m already a google user for my music, most docs, and email, obviously for me it just works. Importantly, offline access is available via the google drive app, as is email and many other apps in the offline category of the chrome webstore. For some reason I had problems getting offline Docs to work, in retrospect I probably didn’t allow enough time for the online files to sync to the machine (the main offline settings are available at the bottom of the cog options on your google drive homepage). Well anyway, now I can happily edit docs away from wifi and they will automatically synced with Drive once I am back online. Apart from google apps, I also

Offline drive setup can be found under the cog

installed a simple calculator app, angry birds (works well), the weatherbug app, a light image editing (sorry no links, I haven’t really tried these so can’t really recommend them, but their all in the popular downloads section of the web apps store). Coding is bit more of a challenge, there’s a nice little python shell app that is really only good for testing, perhaps a better long term solution will be with a cloud based IDE like cloud9IDE , and I’m sure more solution will be coming online as the popularity of cloud computing grows. Perhaps the best app I have discovered is Chrome Remote Desktop (CRD), this allows you to access all your computers through the browser using pin-codes. Despite the awesomeness of CRD, the big drawback for me was that there is no way of adding a Linux machine to your “computers”, so my plans to run my work computer as a code slave for my chromebook went up in smoke.

Chrome Remote Desktop is brilliant

Hopefully, future updates will add better Linux support for this awesome app (you can access a Linux computer but only through the “share” function that requires you or someone else to be sitting at the slave computer). That said, if you want to give granny this computer helping her out in real time via the CRD will save your life and sanity.

Issues. Besides the ones related to using google docs (please please integrate docs with google scholar as a citation manager), perhaps the biggest is this constant page reloading that seems to occur when you have “too many” tabs open (I’m talking 10ish). The reloading slows everything down and tends to cause docs and especially music (via google music) to hang for a frustrating number of seconds. Hopefully this “feature” will work better in future chromeOS updates.

Bottom line. This thing would be awesome for the road worrier or as a general use house laptop, especially if they were already part of the google docs ecosystem. With an external monitor it might even be perfect starter for your granny. With the SSD its pretty robust, fast to start up, has a small footprint make it a computer that I’ll happily drop into my backpack for that I need a browser now moment.

Arch linux on an eeepc 900

This is a summary of my attempt to install the Arch linux distro on my eeepc 900. I had some problems with the login manager and the desktop, but appart from that the base install was not that difficult if you follow the beginners guide. Arch is a hard core distro in that you have to set everything up yourself, but by doing that you get a better understanding of what is going on under the hood.

My little old eeepc 900 running on Arch

Download the Arch iso, I used an external CD-drive to install but alternatively a USB thumb drive could be used. To make it easier, initially I used gparted (http://gparted.sourceforge.net/) GUI to set the / to the 4gb SSD and /home to the 15gb SSD, I formatted both as ext4, which is not recommended because it will destroy your ssd, but apparently ext4 is SSD aware, I just need to work out how to turn journaling off.

Boot to the CD, chose “Boot Arch Linux”. This first section is almost word for word from the beginers guide, and that has a lot more detail, I have skipped the parts that are not relevant to the eeepc.

There was no wireless at startup, on the eeepc it is identified as “wlan0” and the interphase can be checked with, then bring it up and scan

#iwconfig
#ip link set wlan0 up
#iwlist wlan0 scan

Look for the ESSID:”yourwirelessnetworkname”

Backup the original file then modify and set it up for wpa encription.
# mv /etc/wpa_supplicant/wpa_supplicant.conf /etc/wpa_supplicant/wpa_supplicant.conf.original
# wpa_passphrase linksys "my_secret_passkey" > /etc/wpa_supplicant/wpa_supplicant.conf
#wpa_supplicant -B -Dwext -i wlan0 -c /etc/wpa_supplicant/wpa_supplicant.conf

Wait a few seconds for it to associate, now check and requiest an ip address, then ping google to check
#iwconfig wlan0
# dhcpcd wlan0
#ping -c 3 http://www.google.com

Now the hard drive
# fdisk -l

If you formatted your drives like I did (using gpartted) the output should indicate that you have sda1 which is your base and home is sdb1, since we used gparted I can skip all the ugly manual partitioning, so we just need to format them as ext4 (bad idea, see below).

# mkfs.ext4 /dev/sda1
# mkfs.ext4 /dev/sdb1

Remember that this is a bad idea, using ext2 would be safer, but I figure I can work out how to turn of journaling. Save your SSD, don’t make a swap partition!

Mount the base and the home and install the base and devel packages, then setup fstab
# mount /dev/sda1 /mnt
# mkdir /mnt/home && mount /dev/sdb1 /mnt/home
# pacstrap /mnt base base-devel
# genfstab -p /mnt >> /mnt/etc/fstab
#cat nano /mnt/etc/fstab

Modify the mount flags for drive to avoid unessary writes, more infomation at https://wiki.archlinux.org/index.php/Solid_State_Drives

Now install the graphics environment

pacman -S xorg-server xorg-xinit xorg-utils xorg-server-utils xterm

Now let’s enter the new system, start configuration by creating a hostname, or a name for your computer
#arch-chroot /mnt
#nano /etc/hostname

And add “daves-eeepc” or something like that, no quotes! Save [ctl][o] an exit [ctl][x], now modify the /etc/hosts and add the new name
#nano /etc/hosts
add the name as below
127.0.0.1   localhost.localdomain   localhost daves-eeepc
::1         localhost.localdomain   localhost daves-eeepc

Now configure the time, first find out the region and local time and settings
# ls /usr/share/zoneinfo/
America and then for the subdomain
# ls /usr/share/zoneinfo/America
I’m New_York
Now create a sim link to local time directory
# ln -s /usr/share/zoneinfo/America/New_York /etc/localtime
Now set up local settings and uncomment the en_US.UTF-8 UTF-8
setting.
# nano /etc/locale.gen
Then generate the file
# locale-gen

Setup system wide preferences by adding “LANG=en_US.UTF-8” to (dont inc quotes) the file shown using nano.
#nano /etc/locale.conf
Then export the settings and set the clock to utc
# export LANG=en_US.UTF-8
# hwclock --systohc --utc

Now set the network, check the /etc/rc.conf file make sure “network” is between the brackets, also add “net-auto-wireless” for our wireless.
DAEMONS=(syslog-ng network net-auto-wireless netfs crond)

Since we are wireless, we need to do a little work!
#exit
# pacstrap /mnt wireless_tools netcfg
# pacstrap /mnt wpa_supplicant wpa_actiond
# pacstrap /mnt zd1211-firmware
# arch-chroot /mnt

Set the interface in /etc/conf.d/netcfg to
WIRELESS_INTERFACE="wlan0"

Finally create the ramdisk environment, and install the grub boot loader on the root drive, and get rid of any error messages
# mkinitcpio -p linux
# pacman -S grub-bios
# grub-install --target=i386-pc --recheck /dev/sda
# mkdir -p /boot/grub/locale
# cp /usr/share/locale/en\@quot/LC_MESSAGES/grub.mo /boot/grub/locale/en.mo

Now set up grub and create a root password
# pacman -S os-prober
# grub-mkconfig -o /boot/grub/grub.cfg
# passwd

Nearly done with the base install, finally lets reboot

# exit
# umount /mnt/{boot,home,}
# reboot

Once we are back refresh pacman and lets add you as a user and install sudo

# pacman -Syy
# pacman -Syu
# adduser

Add the following, blank is just hit return to accept defaults, add contact details when asked if you like. If you make a mistake just delete and try again (# userdel -r [username])

Login name for new user []: dave
User ID ('UID') [ defaults to next available ]:
Initial group [ users ]:
Additional groups (comma separated) []: audio,games,lp,optical,power,scanner,storage,video
Home directory [ /home/dave ]:
Shell [ /bin/bash ]:
Expiry date (YYYY-MM-DD) []:

Now install and setup sudo, we need to edit the sudoer file with a special editor that has vi underneath, only use this editor as it has failsafes.

#pacman -S sudo
#visudo

Scroll through the file, find the “root  ALL=(ALL) ALL” and on the next line add yourself (without quotes) “dave   ALL=(ALL) ALL”
Get tab complete working

#pacman -S bash-completion

Now install the X-server for graphics

#pacman -S xorg-server xorg-xinit xorg-utils xorg-server-utils xterm
#pacman -S xf86-video-intel xf86-input-synaptics

This is where I ran into trouble following the online guides, we are going to install the XDM login manager and the LXDE desktop both are light and work well on the eeepc.

Firstly, login as a user then

#sudo pacman -S xorg-xdm
#cp /etc/skel/.xsession /etc/skel/.xinitrc ~

These are the launch scripts, xsession calls xinitrc, change the permissions on them
#chmod 744 ~/.xinitrc ~/.xsession

Now we need to increase the run level, in /ect/inittab edit the top and bottom lines by
removing the quotes:
…..
#id:3:initdefault:
…..
id:5:initdefault
…. Unhash (may already be) the line that refers to xdm which we will be using as our desktom manager
x:5:respans:/usr/sbin/xdm -nodeamon

Now install dbus

#sudo pacman -S dbus
#sudo nano /etc/rc.conf

add dbus to the array
DAEMONS=(syslog-ng dbus network net-auto-wireless netfs crond)

Now install lxde and configure open box

#pacman -S lxde
#mkdir -p ~/.config/openbox
#cp /etc/xdg/openbox/menu.xml ~/.config/openbox
#cp /etc/xdg/openbox/rc.xml /etc/xdg/openbox/autostart ~/.config/openbox
# pacman -S gamin
# pacman -S leafpad obconf epdfview

finally add a exec command to the .xinitrc you created during the XDM, add at the bottom (no quotes) “exec startlxde”

That should be it! Reboot and hopefully you will have a simple graphics login page and a functioning (if not light) desktop,

Sources
http://8thstring.blogspot.com/2011/11/arch-linux-on-eee-pc-900-chronicle.html
https://wiki.archlinux.org/index.php/Beginners%27_Guide
https://wiki.archlinux.org/index.php/XDM
https://wiki.archlinux.org/index.php/LXDE