How to use DESeq to analyse RNAseq data

Several packages have been written to measure gene expression differences using RNAseq data. In this post I’m going to do a run through of the popular R based DESeq package. Basically, I’m just following along with the packages Vignette which is really well written and informative! For the test data I’m using 2 treatments, with 3 biological reps each containing 750K reads (yes its small).

A. Counting reads

DEseq uses a table of raw counts as the basis of DE calls. Any transformed counts, such as FPKM counts will not work with this package. The top row of the table can be a descriptor, these can be imported into R to act as column labels. Gene names should be column 1 (remember R counting starts from 1 rather than 0), each column after that should be a raw count for each biological replicate and for each treatment. Note that if technical reps are performed, these should be merged based on the sum of the two runs (they will be normalised by read count latter).

Here I used BWA to align the reads to my reference and generate SAM files that are then processed by HTseq count. Note that you have to run this on each sample, so in this current example I will generate 6 sam files, 1 for each treatment and condition. Also, see that I used the intersection-strict mode for HTseq.

>bwa aln sarcop_pep_rnaseqfiltered.fa Sample_ST72-1/ST72-1.subset.fastq > ST72-1.subset.sai

>bwa samse sarcop_pep_rnaseqfiltered.fa ST72-1.subset.sai Sample_ST72-1/ST72-1.subset.fastq > ST72-1.subset.sam

>htseq-count -m intersection-strict –stranded=no ST72-1.subset.sam sarcop_pep_rnaseqfiltered.gtf > ST72-1.subset.counts

In the above ST72-1.subset.counts is a table of counts that can be merged with the other count data before being imported into DEseq as a R data table. Here is a python script for merging, see the header of the script for usage, if that helps. For full detailed explanation of HTseq see (http://www-huber.embl.de/users/anders/HTSeq/doc/install.html#installation-on-linux) and this seems to be the default way of doing it (instructions :http://www-huber.embl.de/users/anders/HTSeq/doc/count.html).

B. R DESeq package 

Enter R environment from a terminal window open where your data is.

>R

>library(“DESeq”)

>countsTable <- read.delim(“merged_counts.txt”,header=TRUE)

>rownames(countsTable) <- countsTable$gene

>countsTable <- countsTable[,-1]

 This imports the DESeq library, reads the tab delineated file created by the python script into a data table. The next two lines create row names from the gene column created by the script, if you use some other method to make the counts you need to change the name of the gene column to whatever it is. Finally, we get rid of the column used to make the row names. My table looks like this

> head( countsTable)

                  UN72_1 UN72_2 UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2          0      0      0      0      0      0

g10671t00001            2      1      1      0      0      0

g16789t00001            0      0      0      0      0      0

Now we need to add the treatment information so that DESeq knows what data each column represents, order is obviously important. 

>conds <- factor( c( “untreated”, “untreated”, “untreated”, “treated”,”treated” ,”treated”  ) )

> cds <- newCountDataSet( countsTable, conds )

This has created a instantiate a CountDataSet, which is the central data structure in the DESeq. The next step will adjust for different library sizes among the treatments, something called “size factors”. DESeq only needs relative library sizes, so if (on average) the counts for non-DE genes in one library are twice that of a second library the size factor should be twice as large.

 >cds <- estimateSizeFactors( cds )

 To see these size factors, do this:

 >sizeFactors( cds )

UN72_1    UN72_2    UN72_3    ST72_1    ST72_2    ST72_3

0.9805609 1.0420015 0.9164864 1.1083583 0.9599370 1.1462168

 Apparently, the way this works (http://www.biostars.org/p/12273/) is that you take the genometric mean of each condition for a gene and use this as its reference value. Then you divide the gene expression value for each conditions by its reference value to create a list of values (quotient ie result of the division) for each condition. The size factor is the median value of this list, as there is a list per condition it will generate a size factor value for each condition. The counts are normalised by dividing each column by its size factor, here lets look before and after:

 > head(counts(cds))

                  UN72_1 UN72_2 UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2          0      0      0      0      0      0

g10671t00001            2      1      1      0      0      0

> head(counts(cds,normalized=TRUE))

                    UN72_1    UN72_2   UN72_3 ST72_1 ST72_2 ST72_3

GWD8ZKM01ESPR2     0.000000 0.0000000 0.000000      0      0      0

g10671t00001       2.039649 0.9596916 1.091124      0      0      0

g16789t00001       0.000000 0.0000000 0.000000      0      0      0

The next part is a key to how DESeq calculates significance in DE. The package uses the relationship between the data’s variance (or dispersion) and its mean. The inference in DESeq relies on an estimation of the typical relationship between the data’s variance and their mean, or, equivalently, between the data’s dispersion and their mean. The dispersion can be understood as the square of the coefficient of biological variation. The variance between counts is the sum of two factors, firstly, the level of variation between replicates of each treatment, and secondly an uncertainty measure based on the concentration of the counts. This latter factor takes into consideration the level of noise predicted for low expressed genes based on a poisson distribution (see the DEseq manual for more detail).

>cds <- estimateDispersions( cds ) 

Now we write a little function that will plot this.

>plotDispEsts <- function( cds ){

plot(rowMeans( counts( cds, normalized=TRUE ) ), fitInfo(cds)

$perGeneDispEsts, pch = ‘.’, log=”xy”, ylab=”dispersion”,

xlab=”mean of normalized counts”)

xg = 10^seq( -.5, 5, length.out=300 )

lines( xg, fitInfo(cds)$dispFun( xg ), col=”red” )

}

>plotDispEsts(cds)

> jpeg(“DispEsts_DESeq.jpg”)

> plotDispEsts(cds)

> dev.off()

The Above commands make a plotting function, the pass cds to that function and save the result in a jpg file.

The empirical (real) dispersion values (black dots) and fitted values (red line)

The empirical (real) dispersion values (black dots) and fitted values (red line)

Remember that the level of dispersion is related to the biological variation seen in each treatment. Note this is a double log graph, ahhh my brain hurts, but look carefully at what is happening on the Y axis, as read count increases dispersion decreases, as we would expect. However, a lot of those black dot values below the red fitted line are probably underestimates of dispersion based on the small samples sizes used (only 3 replicates values measured in this example). So, to be conservative DESeq moves all these values below the red line up to the fitted value, BUT keeps all the empirical values above the red line, even if some of these are over estimates of dispersion, as I said it is being conservative. Why does dispersion matter? The more dispersion, or biological variation, the bigger the difference between counts between treatments is required before it differences become significant!

Finally we are ready to call DE values!

>res = nbinomTest( cds, “untreated”, “treated” )

>head(res)

                 id  baseMean baseMeanA baseMeanB foldChange log2FoldChange

1     GWD8ZKM01ESPR2 0.0000000 0.0000000         0        NaN            NaN

2       g10671t00001 0.6817440 1.3634880         0          0           -Inf

3       g16789t00001 0.0000000 0.0000000         0        NaN            NaN

      pval padj

1        NA   NA

2 0.2083517    1

3        NA   NA

Now lets plot the log2 fold changes (untreated versus treated) against the mean normalised counts, colouring in red those genes that are significant at 10% FDR.

jpeg(“plotMA.jpg”)

> plotMA(res)

> dev.off()

plotMA

Note how lower lower expressed genes need a bigger log2 fold change value to be made red (not so obvious in this data due to low number of reads used in mapping).

Lets explore this a little more by comparing the normalized counts between all three replicates of just the untreated sample.

>ncu = counts( cds, normalized=TRUE )[ , conditions(cds)==”untreated” ]

>jpeg(“MA_untreated_only.jpg”)

>plotMA(data.frame(baseMean = rowMeans(ncu),log2FoldChange = log2( ncu[,2] / ncu[,1] )),

col = “black”)

>dev.off()

 MA_untreated_only

Wow, look at all the DE in a single treatment,  thus we should call a gene DE only if the change between treated and untreated samples is stronger than what we see between replicates, and hence, the dividing line between red and black in the first figure mimics that for the second.

And now a histogram of p values to look at their distribution.

hist(res$pval, breaks=100, col=”skyblue”, border=”slateblue”, main=””)

> jpeg(“hist_pval.jpg”)

> hist(res$pval, breaks=100, col=”skyblue”, border=”slateblue”, main=””)

> dev.off()

 hist_pval

The pileup of very low p-values are the DE genes, the pileups at the right hand side of the distribution are from very low count genes that are given discrete values.

Now lets create some subset tables based around FDR cut offs (in this case 10%)

>resSig = res[ res$padj < 0.1, ]

>head(resSig)

Before we make any mistakes, lets save the data

>write.csv( res, file=”72hour_unstgVstg_all.csv” )

>write.csv( resSig, file=”72hour_unstgVstg_only_0.01FDR.csv” )

We are done with the basic analysis, we can do some more sophisticated filtering of the data. The main goal is to remove tests that give little chance of producing a meaningful result, with the aim of keeping the FDR in check. This results in an increased power to detect significant results whilst keeping the same level of type I error the same (false positive rate).

>rs = rowSums ( counts ( cds))

>theta = 0.4

>use = (rs > quantile(rs, probs=theta))

>table(use)

table(use)

FALSE  TRUE

30145 36029

Basically here we are summing the counts across all conditions for each gene, then remove the genes with the lowest 40 % quantile (basically filtering out those that are false above from the original normalised cds table)

>cdsFilt = cds[ use, ]

Now just repeat the tests on the filtered data

>resFilt = nbinomTest( cdsFilt, “untreated”, “treated” )

Now lets look at the difference this makes to the FDR.

tabAll <-table(res$padj<0.1)

tabFilt <-table(resFilt$padj<0.1)

> tabFull

FALSE  TRUE

35266   763

> tabFilt

FALSE  TRUE

35266   763

In this case it didn’t seem to make much difference because most of the dropped values were NA! Just to visualise what is going on here lets plot the un-ajusted pvals of the ~40% quantile that we filtered.

>jpeg(“pval_bot40.jpg”)

> plot(rank(rs)/length(rs), -log10(res$pval), pch=16, cex=0.45)

> dev.off()

 pval_bot40

Note that very few p value are less than 0.003 (this corresponds to about 2.5 on the − log10 scale), and this is demonstrated when we plot the pvals of each dataset.

> h1 = hist(res$pval,breaks=50,plot=FALSE)

> h2 = hist(resFilt$pval,breaks=50,plot=FALSE)

>colori <- c(‘do not pass’=”khaki”, ‘pass’=”powderblue”)

> barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, col = colori,

+ space = 0, main = “”, ylab=”frequency”)

> text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA)

> legend(“topright”, fill=rev(colori), legend=rev(names(colori)))

> dev.copy(png,’hist_filthist.png’)

> dev.off()

hist_filthist

Now we are going to process the data further, since some counts and be zero in some applications in makes sense to add psuedocounts (ie y = log2 (n + n0 ) where n0 is a small ‘chosen’ constant. The question is what is the value of the small number you add to the zero counts to make them non-zero (but keep them very small). DEseq uses a alternative approach.

>cdsBlind = estimateDispersions( cds, method=”blind” )

>vsd = varianceStabilizingTransformation( cdsBlind )

Now we will plot Per-gene standard deviation (taken across samples), against the rank of the mean, the standard deviation of the transformed data, across samples, against the mean, first using the shifted logarithm transformation (notAllZero is basically one of the numbers in the row is NOT ZERO), with 1 added to all values to remove zeros), then using DESeq’s variance stabilising transformation. While we are meant to see a constant standard deviation for the DESeq method along the whole dynamic range, with the shifted logarithm giving a highly elevated standard deviation in the lower count range for the other method, in fact they both look pretty similar for this small dataset.

>biocLite(“vsn”)

>library(“vsn”)

>par(mfrow=c(1,2))

>notAllZero = (rowSums(counts(cds))>0)

>meanSdPlot(log2(counts(cds)[notAllZero, ] + 1), ylim = c(0,2.5))

>meanSdPlot(vsd[notAllZero, ], ylim = c(0,2.5))

Now to deal with moderate counts that generate highly variable log fold change values and in some cases they become infinite. This will mess up any kind of clustering we may want to perform to identify expression common patterns. We can compare their ordinary log-ratios (lfc) against the moderated log-ratios (mod_lfc) using a scatterplot, with the genes bined by their log10 mean to colour them according to expression strength.

>mod_lfc = (rowMeans( exprs(vsd)[, conditions(cds)==”treated”, drop=FALSE] ) – rowMeans( exprs(vsd)[, conditions(cds)==”untreated”, drop=FALSE] ))

> lfc = res$log2FoldChange

> table(lfc[!is.finite(lfc)], useNA=”always”)

-Inf   Inf   NaN  <NA>

5085  4900 30145     0

Many of the latter are infinite (Inf,division of a finite value by 0) or not a number (NaN, resulting from division of 0 by 0).

> logdecade = 1 + round( log10( 1+rowMeans(counts(cdsBlind, normalized=TRUE)) ) )

> lfccol = colorRampPalette( c( “gray”, “blue” ) )(6)[logdecade]

> ymax = 4.5

> plot( pmax(-ymax, pmin(ymax, lfc)), mod_lfc,

xlab = “ordinary log-ratio”, ylab = “moderated log-ratio”,

cex=0.45, asp=1, col = lfccol,

pch = ifelse(lfc<(-ymax), 60, ifelse(lfc>ymax, 62, 16)))

> abline( a=0, b=1, col=”red3″)

Here grey to blue represents weakly to strongly expressed genes, for highly expressed genes there is general agreement (along diag) but for lowly expressed genes the moderated log-ratio is shrunk towards 0, compared to the ordinary log-ratio (the grey smear flattens out along the horizontal (0 value on the y-axis).

smoothing

Now to look at heat maps of counts to see any patterns using the transformed  log data.

>library(“gplots”)

>library(“RColorBrewer”)

> cdsBlind = estimateDispersions(cds,method=”blind”)

> vsd = varianceStabilizingTransformation(cdsBlind)

>select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:30]

>hmcol = colorRampPalette(brewer.pal(9, “GnBu”))(100)

>heatmap.2(exprs(vsd)[select,], col = hmcol, trace=”none”, margin=c(10, 6))

 heat_map_transformed

and for a laugh the untransformed data.

 heat_map_untransformed

Now for a

>dists = dist( t( exprs(vsd) ) )

>mat = as.matrix( dists )

> rownames(mat) = colnames(mat) = with(pData(cdsBlind), paste(condition))

> heatmap.2(mat, trace=”none”, col = rev(hmcol), margin=c(13, 13))

dend

No shocks here again. Of note is that we used the method “blind”, thus the variance stabilizing transformation is not informed about the design to prevent bias in the outcome. In comparison, the pooled method (the default for estimating the dispersion) “means that one dispersion is computed for each gene, which is now an average over all cells (weighted by the number of samples for each cells), where the term cell denotes any of the four combinations of factor levels of the design.”

Now a good olde PCA, pretty cool really.

>print(plotPCA(vsd, intgroup=c(“condition”)))

 pca

This type of plot is useful for visualizing the overall effect of experimental covariates and batch effect, so you might expect known conditions to group, ie treatment and any time of library prep effects (paired end reads versus single end data). Good to see that even with this subset of data biggest component is treatment (pc1).

>sessionInfo()

R version 2.15.3 (2013-03-01)

Platform: x86_64-pc-linux-gnu (64-bit)

locale:

[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C

[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8

[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8

[7] LC_PAPER=C                 LC_NAME=C

[9] LC_ADDRESS=C               LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] grid      stats     graphics  grDevices utils     datasets  methods

[8] base

 

other attached packages:

[1] arrayQualityMetrics_3.14.0 gplots_2.10.1

[3] KernSmooth_2.23-10         caTools_1.14

[5] gdata_2.12.0.2             gtools_2.7.1

[7] RColorBrewer_1.0-5         vsn_3.26.0

[9] BiocInstaller_1.8.3        DESeq_1.10.1

[11] lattice_0.20-15            locfit_1.5-9

[13] Biobase_2.18.0             BiocGenerics_0.4.0

 

loaded via a namespace (and not attached):

[1] affy_1.36.1           affyio_1.26.0         affyPLM_1.34.0

[4] annotate_1.36.0       AnnotationDbi_1.20.7  beadarray_2.8.1

[7] BeadDataPackR_1.10.0  Biostrings_2.26.3     bitops_1.0-5

[10] Cairo_1.5-2           cluster_1.14.4        colorspace_1.2-1

[13] DBI_0.2-5             gcrma_2.30.0          genefilter_1.40.0

[16] geneplotter_1.36.0    Hmisc_3.10-1          hwriter_1.3

[19] IRanges_1.16.6        latticeExtra_0.6-24   limma_3.14.4

[22] parallel_2.15.3       plyr_1.8              preprocessCore_1.20.0

[25] reshape2_1.2.2        RSQLite_0.11.2        setRNG_2011.11-2

[28] splines_2.15.3        stats4_2.15.3         stringr_0.6.2

[31] survival_2.37-4       SVGAnnotation_0.93-1  tcltk_2.15.3

[34] tools_2.15.3          XML_3.6-2             xtable_1.7-1

[37] zlibbioc_1.4.0

>

Advertisements

70 thoughts on “How to use DESeq to analyse RNAseq data

  1. Thank you for the tutorial.
    It is a great walk through tutorial for a beginner like me.
    Would you explain about the ‘htseq-count’ result files?
    In the first column (ID), some rows have ‘CUFF.XX’ and some others have ‘NM_XXXXXX’.
    What’s the difference between them, and how can I convert the IDs to RefSeq or something like it?

    Thanks!

    ps. Sorry for my poor English!!!

    • Hi there. I’m not really sure I understand your question. The ID column should just be the gene identifier, these would come from the GFF file you inputted into htseq-count. The format of the IDs depends on who made the GFF file and how the genes were derived, but the NM_* is a refseq accession format for mRNA sequences (see http://www.ncbi.nlm.nih.gov/projects/RefSeq/key.html). The CUFF.X probably come from cufflinks gene annotations based on read mappings, but Im guessing there.

      So yeah, take a look at the contents of the GFF file, that will have details about the genes (exon/introns etc), htseq-count uses this information to assign counts to specific genes based on the sequence coordinates of the genes that is recorded in the gff file.

      Sorry if I missed your question. Cheers
      Dave

      • Sorry for my vague question…
        I meant I had three different ID’s in the first column.
        In the first ~40,000 rows, they have CUFF.***,
        and in the next ~6,000 rows they have NM_****** followed by NR_******.

        I know how to find gene names for NM or NR.
        How can I, then, find gene names for CUFF.****?

      • I don’t really know. I’m guessing they are cufflink (http://cufflinks.cbcb.umd.edu/) annotations based on RNAseq data? If that is the case then they probably won’t have any online IDs like Nm or NR because they have not been submitted to ncbi (I’m guessing). I think you’ll have to find out from the place you got the GFF file.

        Sorry, thats not much help!

        Cheers

        On Mon, Jun 17, 2013 at 3:34 AM, Dave Wheeler's /t+

  2. Hi,

    Just started reading through the information. This is going to be very useful, thanks very much. I just noticed however, for those who do not code in Python, line 30 in the merge_tables.py script (provided above) must be changed to “sys.exit()” instead of “sys.quit()”.

    Thanks,

    Jom

  3. Hi Dave,

    Glad to find an article that doesn’t skip on the details! I was wondering about one issue though.

    The RNA fragments may be of average size, say, 200bp but we may only sequence 50bp. Shouldn’t the reads be shifted (up or down, based on strand) before we count it in region of interest?

    • Hi ausinha, the RNA concentration is what we are trying to measure. We sequence 50 bp of cDNA made from that RNA. We align the 50bp read back to our genome and count how many reads map to each loci (gene). We can’t move the read, we align the reads based on DNA homology and let them tell us which gene is being expressed.

  4. Great post. I was checking constantly this blog and I am impressed!
    Very useful information particularly the last part 🙂 I care
    for such information much. I was looking for this particular info for a long
    time. Thank you and good luck.

  5. hi
    nice post..

    could anyone please let me know how to extract gene name list from a general DESeq output file and how to go for next step that goseq or DAVID??

    Any suggestion would be appreciated…Thanks.

    • The easy way would be to just output the table to csv format from R and open it using excel, then just copy the first column. With DAVID etc I just sort by padj, copy all the IDS and use that as the background (DAVID should recoginise them if you work with a model species), then copy the ids that meet my cut-off (0.1 FDR or something) and use them as my gene list. If your working with an uncharacterised genome your probably going to have to use blast or something like that to assign gene identifiers that will be in the DAVID database. My two cents is that you might as well just blast against mouse, human, rat, xenopus, etc, ie the main model species because that is where the functional information comes from, for this analysis you gain nothing from having a hit to a ‘conserved uncharacterized protein’.

  6. Hello, thank you very much for this post.

    I have a question related “varianceStabilizingTransformation” and differentially expressed genes.

    My experimental set up is straightforward: It is a treated and untreated comparison with triplicates for each condition (six samples in total). I have done the DESeq analysis to the point of generating a table containing means, log2 transformations and p-values (the result of nbinomTest). I would now like to select for the most significantly expressed genes by padj (say the top 100) and plot them in a heat map. I am not so sound on the statistics behind this, but from reading this I gather I should do some sort of variance transformation instead of just plotting the mean count value per condition on a heat map. However when I do this I notice that there is much less differential expression between my “top-hit” genes, so much that the heatmap is all one color. Am I going about this the wrong way?

    Thank you.

    • Hi Carly, sorry for a slow reply, first thing to keep in mind is that some of the biggest fold changes can come from low count genes (so most signif might be all the same color), so it would be worth looking carefully at your raw data and examining the relationship between counts, p-values and fold change in your top 100 genes.

      There’s a good thread here that talks about this, and the take home message at the end is that this has been improved in DEseq2, so it might be worth trying that out if you get no where (http://comments.gmane.org/gmane.science.biology.informatics.conductor/47986)

  7. Can I say how awesome your tutorial is?! One quick question, my experience of using DESeq is generally OK, except it is too stringent. Namely, I obtained only 300 genes with FDR less than 5% for a control vs. disease RNA-Seq experiment. Is there a way to relax the statistics in DESeq? Thanks!

    • “relax the statistics” – Fisher is turning over in his grave! (-:

      300 is pretty good, at least for my work on insects. The simplest thing to do is just relax the 5% FDR cut-off, make it 10%, that seems to be the ‘norm’.

      This line controls that (the 0.1 is 10%):
      resSig = res[ res$padj < 0.1, ]

  8. Hi Dave,

    Thanks for this really useful post. I have some extra questions that maybe you can help me to answer.

    I’m trying to identify the significant up-regulated and down-regulated genes in 6 different moments of the life cycle of “my” animal.

    I have 11 libraries (2 replicates/stage, except for one stage that i couldn’t sequence a replicate). What i’m doing is to compare one stage with the stage immediately after using DESeq.

    This is my script:

    [CODE]#Load data
    Cele_SPvsEB = read.csv (file.choose(), header=TRUE, row.names=1)
    CeleDesign <- data.frame(
    row.names = colnames(Cele_SPvsEB),
    condition = c("SP", "SP", "EB", "EB"))

    #to load conditions w/out pData file: conditions
    conds = factor(c("SP", "SP", "EB", "EB"))

    #Make cds file
    cds <-newCountDataSet(Cele_SPvsEB, conds )

    #Est size factor = normalize for library size
    cds<-estimateSizeFactors(cds)

    #estimate dispersion from the expected
    cds <- estimateDispersions(cds, method=c("pooled"), fitType=c("local"))
    str(fitInfo(cds))

    #plot dispersion
    plotDispEsts <- function( cds )
    {
    plot(
    rowMeans( counts( cds, normalized=TRUE ) ),
    fitInfo(cds)$perGeneDispEsts,
    pch = '.', log="xy", ylab="dispersion", xlab="mean of normalized counts" )
    xg <- 10^seq( -.5, 5, length.out=300 )
    lines( xg, fitInfo(cds)$dispFun( xg ), col="red" )
    }
    plotDispEsts( cds )

    #call differential expression
    res <- nbinomTest( cds, "SP", "EB" )

    #there is no expression for a gene accross all samples… need to remove those genes
    whichIsNA<-which(is.na(res[,8]))
    res<-res[-whichIsNA,]

    #plot volcano
    plotDE <- function( res )
    plot(
    res$baseMean,
    res$log2FoldChange,
    log="x", pch=20, cex=.3,
    col = ifelse( res$pval < .05, "red", "black" ) )

    plotDE( res )

    #filter for upregulated and downregulated genes

    resSig <- res[ res$pval < .05, ]
    up 2)
    down <- subset(resSig, foldChange < 2)

    #list most sig genes
    head( resSig[ order(res$pval), ] )
    [/CODE]
    By using this script i found around 600 genes in both up-regulated and down-regulated categories for this example (i.e. SP vs EB).

    But, if i change my filtering to:

    [CODE]resSig <- res[ res$padj < .1, ]
    up 2)
    down <- subset(resSig, foldChange < 2)[/CODE]

    Then i have 0 genes up or down-regulated.

    If i just keep going only filtering by pval < 0.05, is that enough to consider my genes significantly differentially expressed?

    Do i have any error in my script?

    Any suggestions?

    Thanks!!!

    P.D.: If you are interested i can send you my graphs and one of my .csv files.

    • Just looked quickly, is the “up 2)” in
      resSig <- res[ res$pval < .05, ]
      up 2)
      down <- subset(resSig, foldChange < 2)

      a typo? Also, it seems your filtering down as anything less than 2, dont you want foldChange<0 for down (or -2 fold change as a cut off if you wanted).

      To rule out a code problem just download the table
      write.csv( res, file=”table.csv” )
      and sort by fold change in a excell-like program, that will at least give you an idea that you're not doing anything wrong in that wonderful language that is R.

      • Sorry, yes, was a typo.
        The correct code is:

        resSig <- res[ res$pval < .05, ]
        up 2)
        down <- subset(resSig, foldChange < 2)

      • Hi Dave,

        Alicia said:
        ”I’m trying to identify the significant up-regulated and down-regulated genes in 6 different moments of the life cycle of “my” animal.”

        And in the package vignette it says:
        “Furthermore, it is important that each column stems from an independent biological replicate.”

        Is it okay to use the DESeq package if the columns of the counts table are a kind of Time course. Alicia’s problem might be that there is some dependencies in his matrix columns?

      • Thats a good point Joel. The strict definition of a biological replicate is that they come from different individuals subject to the same conditions. However, I guess one could argue (and I’ve seen publications with this) that separate leaves from a single individual represent biological replicates or even different cell populations depending on your spatial perspective – these latter cases certainly are not technical replicates? The problem here seems to be “Then i have 0 genes up or down-regulated” and my immediate expectation would be that under-sampling biological variation would actually lead to type 1 error, rather than type 2 which ‘seems’ to be the problem here. But that is a good point about dependencies in the matrix, and I don’t have a good answer!

  9. Hello Dave Wheeler,

    Thanks for nice tutorial for beginners as me. I am dealing with miRNA-seq data and having same doubt as above mentioned by Alicia.
    1) Filtering by resSig <- res[ res$padj < .1, ]
    I am not able to get any significant miRNAs on my sample while by using
    resSig <- res[ res$pval < 0.05, ]
    i am getting about 41 DE miRNAs (up and down)…

    what should i do ??? may i rigid with only pval makes sense ??? As the same question asked by Alicia ?? "i just keep going only filtering by pval < 0.05, is that enough to consider my genes significantly differentially expressed? "

    2) after this analysis may go further for filtering by overall counts, plotting Heat map and PCA plots would be good results ??? This makes any significant sense ?? I am asking because, i am not very expert in statistics.

    Thanks and looking forward for your helpful reply

    • Hi Rupesh, its sounds like you are getting some significant results based on p-value, but none of these survive the multiple testing cut-off. You could try resSig <- res[ res$padj < .2, ] to reduce the stringency. Also, look at the padj values in your 41 genes, they might be close to the 0.1 cut-off (or not). Remember that if you do enough tests eventually you are going to get a result by chance alone, that is why we do multiple testing corrections. If all your loci fail this test then you can only have low confidence that the result is real. It might be worth checking your mapping results and making sure lots of reads were not thrown away etc. Or collect more data to see if you can improve the power to detect changes in expression.

  10. Thank you sir for your advice, i checked my adjusted p value that is always either 1 or NA. 😦 some body suggests dmd to perform some independent filtering ??? by ‘genefilter’ or suggested to use DESEq2 (bcz of its automated here by gene filter package) instead deseq….

    find detail here http://www.biostars.org/p/90384/

    Now i am looking forward for your suggestion

  11. I am having a ton of trouble getting my cds created. I am following the steps found in the vignette and other people’s write-ups on DESeq with my data seemingly printing okay in the R terminal. However, I keep getting an error at the cds countTable head(countTable)
    A B
    20ALPHA-HSD 0 0
    A1BG 0 0
    A2M 0 0
    A2ML1 0 21
    A4GNT 0 0
    AAAS 0 1
    > conds conds
    [1] highfert lowfert
    Levels: highfert lowfert
    > cds <- newCountDataSet( countTable, conds )
    Error in if (any(round(countData) != countData)) stop("The countData is not integer.") :
    missing value where TRUE/FALSE needed

    Through some forum responses I have done:

    which(round(countTable) != countTable)
    and that tells me "integer(0)"

    I have a lot of 0s for read counts in many of my genes across the two samples, I have even replaced the 0s in those columns with 1s just to try and get the cds creation working.

    Any ideas on how I can fix this? I am fairly new to bioinformatics and I have been struggling to get this working for a bit now. I thought all the other steps in the R window seemed fine leading up to this but no matter what I try I can't get the cds creation step working.

    Thanks,
    Chris

      • Thank you for the sample data as it worked just fine. I realized with some help from your table and the sapply information I got from a forum that 2 of my cells had no number in them across the 10,000+ rows so I had to fill those numbers in and the cds started working. Now onto the actual analysis, hoping to not bother you with more questions.

        Thank you very much for the input, it was very helpful.

  12. Hi,

    This is my code and it gives me this error any idea?
    thanks in advance

    > class(res)
    [1] “data.frame”
    > str(res)
    ‘data.frame’: 110190 obs. of 8 variables:
    $ id : chr “GRMZM2G382890” “GRMZM2G019250” “GRMZM2G019251” “GRMZM2G481529” …
    $ baseMean : num 1.64 0 344.32 41.9 0 …
    $ baseMeanA : num 2.95 0 330.21 37.84 0 …
    $ baseMeanB : num 0.319 0 358.434 45.963 0 …
    $ foldChange : num 0.108 NaN 1.085 1.215 NaN …
    $ log2FoldChange: num -3.211 NaN 0.118 0.281 NaN …
    $ pval : num 0.539 NA 0.693 0.675 NA …
    $ padj : num 0.995 NA 1 1 NA …

    > plotMA(res)
    Error in .local(object, …) :
    When called with a data.frame, plotMA expects the data frame to have 3 columns, two numeric ones for mean and log fold change, and a logical one for significance.

    • You’ve probably solved it by now (sorry, very busy ATM), but I would try the counts table I uploaded for the previous question and just look to see how they are different when consumed into a data.frame.

  13. Thanks a lot for this tutorial, it works great. I would also be very interested in your Deseq2 tutorial. Is it available?

  14. Pingback: How to use DESeq2 to analyse RNAseq data | Welcome to the Dave Wheeler lab page and blog

  15. Hi Dave, on the DESeq mannual (http://bioconductor.org/packages/release/bioc/html/DESeq.html) I have some naive questions:
    1. Must I generate the ‘metadata’ which likes a Index listed the information of samples, conditions (treated, untreated) and libtype (single, pair-ended). because I found that in your case is this page, you didn’t generate it but still working.
    2. I am also confused of the gene ID generate by HT-seq, like “FBgn0000014” in the mannual PDF or “g10671t00001” in your case? Because it is not like the regular gene symbol we saw , like “MYF5″,”E2F1”. Could I put these ID “FBgn….”,”g106….” for the downstream analysis , e.g. go to DAVID for GO and Kegg analysis?
    Thanks!

    • 1) For the example from the pasilla dataset used in the manual it is important to set library type because there is a mix of single end and paired end reads. These need to be tagged so that they can be treated appropriately and we can test to see if the library type is influencing the results. My data was all single end.
      2) HTseq doesn’t generate the names, it just takes them directly from the GTF file. And I’m offended (-; that you think our insect type naming strategies are “irregular” because they don’t look like the stupid mammalian names [just kidding]. Fbgn are just flybase gene numbers (drosophila). My numbers are generated by the de novo assembler I used and mean absolutely nothing to anyone but myself (-: But yes for downstream GO terms etc its important that ID have entries in the relevant databases. If you human or mouse it should be no trouble, for some reason there is a lot of work done on these beasts.

      • Thank you . So if all single-ended or all pair-ended it should be fine. But if it is single-ended mixed pair-ended, the metadata is essential. Am I right?

      • PS: And is the FDR=0.1 is the only threshold? And could you please explain The empirical (real) dispersion values (black dots) plot more clearly? The sentence made me confused
        ” The more dispersion, or biological variation, the bigger the difference between counts between treatments is required before it differences become significant!”
        Thank you!

      • Hi Dave I found a big problem that gene ID output from Cuffdiff and egdeR are different so that I cannot find the common DE genes.
        I use the data set with the Drosophila melanogaster genome.
        The edgeR output gene ID (e.g. FBgn0000370,FBgn0000500,…)
        But the Cuffdiff output gene ID is something like (XLOC_000028,XLOC_000038,…) and gene symbol (e.g.,KH1,RpLP1,…).
        Could you please recommend any software or command to translate them automatically? or Could you have any good idea for me to generate the same gene ID?
        Thank you!

      • I found that the Cuffdiff ‘s gene ID is corresponding to merged.gtf (by Cuffmerge) but egdeR/DESeq’s gene ID is corresponding to genes.gtf (by reference genome)
        Could I generate the counts table based on the HTSeq-count and merged.gtf?

      • Hi Dave, Thank you .
        1.So if all single-ended or all pair-ended it should be fine. But if it is single-ended mixed pair-ended, the metadata is essential. Am I right?
        2.And is the FDR=0.1 is the only threshold? And could you please explain The empirical (real) dispersion values (black dots) plot more clearly? The sentence made me confused
        ” The more dispersion, or biological variation, the bigger the difference between counts between treatments is required before it differences become significant!”

  16. Hi Dave I want to ask some naive questions to ask :
    1.If I got the raw data from Ion-proton / Illumina, what is the first step of pre-processing those fastq reads? I mean before I put those reads to Tophat to alignment, (1)what should I do (do I need to remove the replicated reads or too short reads) ? (2) what software could I use? I only know that I could check the reads quality by FASTQC software.
    And after Tophat, I found that the mapping rate is always ~ 60%, is it too low ? Do I need to re-alignment the unmapped reads from Tophat output and go to the downstream analysis (e.g. cufflinks, edgeR, or Cuffdiff)? Or any other better choice for alignment ?
    Thank you!

  17. Hi
    Thanks for this usful post. Iam a very beginner in Bioinformatics and programming. I have 2 RNA sampls and already counted the read using HTSeq and now I have 2 count.txt files. I wants to merging the files and make the table for DESeq. The problem is I do not konw how I can use the phyton code that you posted. can u explaine the lines for me and tell me how I can give the directory of my count.txt files for running the program ?

    thank you so much

    • Hi there, just run the script from the same directory of the files you want to merge, you need a text guide file (call it “guide_file.txt”) that contains filenames and tags like this. The file name is separated from the tag by a space.
      file1.counts untreated1
      file2.counts untreated2
      file3.counts treated1
      file4.counts treated2

      this will generated a tab separated table like this

      gene untreated1 untreated2 treated1 treated2
      gene1 0 0 0 0
      gene2 1 0 11 10
      etc

      Save this file in the same dir as the script and your count files and just run it using python merge_table.py guide_file.txt

  18. Hi everyone,
    I am trying to filter my data. But when i want to make a plot (scatter or Histo) I get a error that my unfiltered res is not numeric.
    How knows what the problem is? Thanks

    > h1 = hist(res.combined$pval,breaks=50,plot=FALSE)
    Error in hist.default(res.combined$pval, breaks = 50, plot = FALSE) :
    ‘x’ must be numeric
    > h2 = hist(resFilt.combined$pval,breaks=50,plot=FALSE)

  19. Hi Dave
    I am trying to use DESeq to test for differential expression between 2 samples
    I got the below error.

    I would be grateful if you could help me
    regards

    countsTable rownames(countsTable) countsTable head( countsTable)
    untreated1 untreated2
    LOC100506866 3 4
    CHMP1B 280 203
    DISC2 0 0
    MTVR2 0 1
    LOC441204 3 23
    TCOF1 465 154
    > conds cds cds sizeFactors( cds )
    untreated1 untreated2
    1.1109495 0.9001309
    > head(counts(cds))
    untreated1 untreated2
    LOC100506866 3 4
    CHMP1B 280 203
    DISC2 0 0
    MTVR2 0 1
    LOC441204 3 23
    TCOF1 465 154
    > head(counts(cds,normalized=TRUE))
    untreated1 untreated2
    LOC100506866 2.700393 4.443798
    CHMP1B 252.036661 225.522746
    DISC2 0.000000 0.000000
    MTVR2 0.000000 1.110949
    LOC441204 2.700393 25.551838
    TCOF1 418.560884 171.086221
    > cds <- estimateDispersions( cds )
    Error in parametricDispersionFit(means, disps) :
    Parametric dispersion fit failed. Try a local fit and/or a pooled estimation. (See '?estimateDispersions')
    In addition: Warning message:
    glm.fit: algorithm did not converge

    • Hi,
      I have encountered the same error as yours, I am working with a group of data without any replicates. I run DESeq refers to Simon’s protocol,error happens when I run this step:
      < cds <-estimateDispersions( cds, method="blind", sharingMode="fit-only" )
      Error in parametricDispersionFit(means, disps) :
      Parametric dispersion fit failed. Try a local fit and/or a pooled estimation. (See '?estimateDispersions')
      In addition: Warning message:
      glm.fit: algorithm did not converge

      I just wondering if u have solved this problem?How?

  20. Excellent post! Thank you so much!

    I do have a couple of questions:

    1) When I plot the normalized counts between all of my replicates for a single sample (> plotMA(data.frame(baseMean = … ), I end up with a plot with a single dot instead of the filigreed arrow looking plot. Does this mean that the data I have for a single treatment is fairly stable?

    2) Regarding continuity of the ‘cds’ variable. After creating the filtered version ‘cdsFilt’, you plot the filtered histogram for ‘resFilt$pval’ and ‘res$pval’. However, after that point, the code continues to use the original ‘cds’ and ‘res’ rather than the filtered versions. Is this switch intentional or did you save the filtered version to their shorter names (e.g., > cds <- cdsFilt)?

    3) 'cdsBlind' is also calculated twice using the same data — is one (or both) supposed to be created using the filtered version?

    4) The rank(rs)/length(rs) plot — should this be created from the resFilt$pval instead of res$pval?

    5) For my data, the rank(rs)/length(rs) plot has a slight curve near 1 on the x-axis, but for the most part, the points are dispersed over the range {0-1, 0-1}. If I understand correctly, this isn't necessarily bad, correct?

    5) For the ordinary log-ratio vs the moderated log-ratio plot, I get the same basic shape, but the line is slightly offset from the points. Does that indicate a problem with the data?

    That ended up being more questions than I originally expected. Thanks!!

    • Hi Stephen, thanks for your question/queries. Been a while since I’ve done this but heres my explanations.
      1) A single dot doesn’t make sense to me. Even if all your replicates were the same you should see a line rather than a a dot, with LGC = 0 but a large range of count numbers (making a dotted line along the horizontal axis.
      2) The filtering is optional, helps most to reduce multiple testing problems and thus makes sense when looking at DEG lists; most of the latter work could be done with either as its really just visulisations of the data. Thats my story anyway! This was just a demo, if it was a real dataset I would consider filtering based on its effect on FDR and perhaps the graph in your (q4)
      3) It is calculated twice, it should have only been calculated one, as in 2 I deliberately used all the data rather than the filtered set.
      4) In my mind its correct to use all the unfitlered table, after all this graph is used to show that most low count genes don’t pass the pval cut-off;thus justifying filtering.
      5) Bit hard to know from the description, x axis count concentration y axis -log10 so bigger along y == smaller (better) pval
      5b!)All dark blue dots should essentially run along the diagonal (not sure about ‘slightly’ offset), lower count (light blue) dots should flatten out to indicate the moderation is working. A lot of larger logFC for low counts are just due to divisions of small and very small numbers.
      Hopoe that helps, and more importantly is sensible, its been a while since I did this.

  21. I have a couple of additional questions regarding the process of filtering the data after the initial DESeq run.

    1) When calculating the row sums [ rs = rowSums(counts(cds)) ], is there a reason why the normalized counts are not used? It seems that using the un-normalized counts could lead to a skew in the data if one of the treatment sets is larger than the other.

    2) Should the size factors and dispersions be recalculated for the filtered cds before we rerun the nbiomTest on the filtered data? It seems like these would be very likely to change after filtering, but I’m not sure how much of an effect they have on the analysis.

    3) Also, how stable is DESeq2, and should I switch to it?

    Thanks!

  22. For the normalisation procedure are you sure the reference count value is calculated per condition? I thought it was calculated across all samples…
    This is a brilliant blog by the way (though personally, I prefer your DESeq2 tutorial)

  23. Hello,

    I hope you can help me, I have followed your tutorial for my data, in which I have no replicates, but 5 different conditions. I did use the blind method and sharing fit only.
    I come up with results all the way through, up until the PCA.

    > print(plotPCA(vsd, intgroup=c(“conditions”)))
    Error in svd(x, nu = 0) : infinite or missing values in ‘x’

    I cannot see where I have gone wrong, since throughout, the rest of the analyses and commands work. Is it something to do with my data having no replicates?

    Thank you for your tutorial.

  24. Thanks! Great site!

    I have a list of genes of interest in a file. I want to make a heat map of just those genes. How can I select them from the whole list.
    genes_of_interest <- scan("gene_list_try", what=character() )
    and then create the heatmap as listed above.

    • Hi there, hopefully this gets you started:
      once you have your array of target names
      if ‘cds’ is your DESeq table object and ‘targets’ is a list of gene names (from your file):

      s<-which(rownames(cds) %in% targets)
      heatmap.2(exprs(vsd)[s,])

  25. Hello,
    Nice tutorial,
    I am trying to extrapolate the protocol to my experiment. In my case I have one control and two conditions ( Control, C1, C2) each has three replicates. Then 9 samples in total. Which is the best approach:
    1) load the read count table with all the control and conditions, normalize the reads, and then conduct nbinomTest between Control Vs C1 and then Control Vs C2 …or
    2) Separate in two tables Control and C1 , and Control and C2 , normalize each table independently and continue with the nbinomTest

    Thanks

    Jose

    • Hi Jose, unless c1 and c2 were connected in some way I think it would be better to do Cont v C1 and Cont v C2 independently (they are separate hypothesis tests after all). For some time course stuff that I’ve done I have used them altogether, mostly so I could do some clustering over time, but in that case there was an obvious contection between T2, T3 etc ie time. Good luck.

  26. Hi,
    I am having difficulty in running your script merge_tables.py. I am a beginner in python programming. I am getting this error.
    ……………………………………………………………
    Traceback (most recent call last):
    File “merge_tables.py”, line 53, in
    gene,count = line.strip().split(‘\t’)
    ValueError: need more than 1 value to unpack
    …………………………………………………………….

    Can anyone help me out?

    Thank you,

    Best Regards,
    Naresh D J

    • Hi Naresh, the script expects a file containing tab separated pairs of ‘gene’ and ‘count’ values, as generated from HT-seq.
      The error is indicating that there is only one item on some line in one of the files so it can’t assign a count value to the ‘count’ variable.

      Try with different pairs of files and see if some work and some don’t. If this happens look through the ones that break the script for lines with only a single item (probably top or bottom of file). If all of the files break the script then its something common to all of them.

      if you know python catch the error and print the line causing the issues, by changing that part of the script
      try:
      gene,count= line.strip().split(‘\t’)
      except ValueError:
      print line
      good luck

  27. Hi
    Useful tutorial!
    I am comparing two samples without replicates. I am struck at making the heatmap as the heatmap obtained only shows color for the genes with higher number of reads in blue color and the lower one in white when making the plot. I have tried changing the color scheme. Its not because of color scheme I guess.
    for e.g.
    variety 1 variety 2
    gene_id_12345 25000 18000

    So instead of showing gradient colors for there two its showing variety 1 in blue color and variety 2 as white. Happening for all the genes. Therefore I am getting a binary kind of graph. I am not very good with stats but have followed the manual completely.

    Any help appreciated.

    Thanks!!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s