Uncategorized | Dave the Data Dero

This is just a dumping ground for commands I perform all the time so that I have somewhere to look them up!

Convert tophat bam file into a sorted sam file for htseq-count

#if used tophat should be already sorted samtools view -h tophat_sorted.bam > tophat_sorted.sam

#if need to sort before conver (output tophat_sorted.bam)
samtools sort tophat.bam tophat_sorted

Bash

wc -l seq_file.fas #count lines
grep -c '>' seq_file.fas #count seqs in fasta file
ls -l #permissions and file size
ls -lh #above with human format for sizes
chmod a+wx #add read write permissions to all
chmod 755 #read and execute for user/world
chmod -R 755 #change all files in folder
tail -n 400 #last 400 lines
tail -n +400 #from 400 to end
history | grep "command" #search history for commands
#sed prints to screen, redirect or us -i for inplace
#print a specific line (100) using sed -n just shows that line
sed -n -e '100d' test.txt
#delete a specfic line (line 1000) from a file and backup original
sed -i.bak -e '1000d' file.txt
#delete lines
sed -i.bak -e '1d,10d
#removes line 10 to 20 INCLUSIVE ie 11 lines
sed -i -e '10,20d' test.txt
#remove line 10, lines 15-20, line 100
sed -i -e '10d;15,20d;100d' test.txt
#convert fastq to fasta file!(effecient)[p is print]
sed -n '1~4 s/^@/>/p;2~4p' seq.fq > seq.fas
#get the md5 checksum of a folder
find FOLDER/ -type f -exec md5sum {} \; | sort -k 34 | md5sum > md5.txt

Htseq-count -s flag turn off stranded union is default

htseq-count -s no -m union tophat_sorted.sam > sample_counts.txt

Vim

v #visual
p #paste
# find word at hash
yy #yank word or line
d #delete line
wq #save and close
ma #mark a, use 'a' to goto this section
mz # mark z as above
{} #select paragraph
{d} #cut block</pre>

Python specific

pip install --upgrade packagename
#virtual env
virtualenv venv
source venv/bin/activate
deactivate venv

print "%f"%(np.mean(seq_length))
#prints 1019.662143
print "%.2f"%(np.mean(seq_length))
#prints 1019.66
print "%.1f"%(np.mean(seq_length))
#prints 1019.7 note rounding up
print "%d"%(np.mean(seq_length))
#prints 1019

#use with open('') as for automatic file closing
with open('infile.txt','r') as f:
    f.read()

Git

WordPress (see text for tags)

use the following tags
in sqaure_brackets code language="css" and end with code in sqr bracktets

import sys

test = sys.argv[1]

for name in test:
    print name
plot(xmir205,ycbx1,'go',fig_fn205(xmir205'),'-k')
plot(xmir205,ycbx1,'go',fig_fn205(xmir205),'-k')

browseVignettes()

Matplotlib
Change the font family.

font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 22}

matplotlib.rc('font', **font)

I’ve been playing around a little with the stats-centric programming language R, mostly to get a better handle on the Bioconductor differential gene expression analysis packages edgeR and DEseq. The language is designed to allow for easy manipulation of tabular data as well as providing access to a rich library of statistical packages and graphing tools. One thing I learned was that often it was more efficient (at least for me) to spend a little time pre-formatting the data using python/perl before even getting started with R. The charting features of R are pretty cool, but once again I missed my familiar Python environment)-:

But, as always there is a Pythonic way, I came across a great book called Python for Data Analysis by Wes McKinney. The book introduced me to the pandas library, which contains R-like tools for tabular data manipulations and analyses. The book also introduced the Ipython development environment; basically a souped up feature rich but light weight “python IDLE”. The best features of Ipython for me are the logging capabilities, advanced history and de-bugging utilities – very cool! Ipython has been designed to work well with the matplotlib, thus allowing production and manipulation of nice looking graphs and charts within an interactive python environment. I’m still very much learning this (and I’m a hack programmer), but here is some fun data crunching based on a USDA food database wrangled into into Json format by Ashley Williams.

Ok, I think it would look better on one chart. Here is the top 50 protein containing foods with their fat content as part of one graph.

Note that I chopped the y-axis labels off, doh, no problem just modify them in real time using the slider editor!

Not too bad at all for a newbee. Hopefully by next post I will feel a little more comfortable to share some hard learned lessons on data manipulation using pandas.

Dave the Data Dero

Category Archives: Uncategorized

Oft repeated commands and work flows [work in progress]

Pandas, matplotlib and Ipython – all you need for great data anaylsis