VIM shortcuts

Basic navigation


h j k l - LEFT DOWN UP RIGHT
2h - 2 steps LEFT
H - home (top screen)
10j - 10 steps DOWN
J - join this line to the line below
{ - UP a section 
} - DOWN a section
o O - open line BELOW open line ABOVE
b - beginning of current word
w - beginning of next word
e - end of current word
$ - end of line 
^ - start of line
H - TOP of screen
L - BOTTOM of screen
v - select text
V - select line
v3j - select 3 lines DOWN

Modifying text


i - insert
a - append
dh - DELETE left
dl - DELETE right
dd - DELETE line
dw - delete word
D - DELETE to end of line
x - DEL
X - BACKSPACE - INDENT line LEFT (use visual for block) 
>> - INDENT line RIGHT  (use visual for block)
p - paste after
P - paste before
3p - paste 3 copies
yy - yank whole line
yw - yank word
y$ - yank to end of line
4yy - yank four lines
Y - yank whole line
J - join current line to next

Save Exit and help


K - HELP (bash, req select word)
ZZ- save and quit
ZQ - quite without saving
z - save

SPECIAL


[ - enter command mode
, - leader key
v - visual
: - enter command mode
i - insert mode
a - append mode
:tabnew - open new tab
,n - left tab
,m - right tab
/ - search # press enter, then n to repeat, N repeat up 
:s/old/new/g - globally change

Additional/Plugins


F2 - paste code blocks from external (press again to end)
n - remove highlight last search
,e - quit
,E - qa! quit all windows
ctl j - move to lower window
ctl k - move to upper window
ctl h - move to left window etc
,s - sort selected
f - fold code block (again to undo)
F - fold all (again to undo)
ctl space - auto complete
shift p - filesearch

Git


:Gstatus - status / stage by "-"
:Gcommit - message 'wq' save
:Gmerge

Cool combinations


cH, dH - copy/delete to top of screen
d3+ - delete line + 2 lines below
y3+ - copy lien + 2 lines below
d/test - delete all words 'test'

Insert breakpoint and ipdb PDB bit…

Setting up VIM as a python IDE

If you are looking for a lightweight but feature rich editor you can’t go past VIM! Sure those keyboard shortcuts are difficult remember and it’s frustratingly hard trying to stop reaching for that bloody mouse (the mouse will actually work in this set-up BTW). However, what I have found is the best way to learn VIM (I don’t think you ever really learn vim) is to make it useful enough that you want to start using it; this helps build up the muscle memory required to become efficient [enough]. The ultimate payback is that with practise you can become really really efficient relative to a user stuck with a mouse and complex GUI. The mouse is so 1980’s anyway 😛

vim_terminal_example1

So this post is about making VIM useful! The steps are based around Ubuntu/Debian Linux but should be adaptable for OSX and other NIX distros. Windows users, like yourselves I have no idea (-;?

Of course nothing I ever do is original and this especially goes for this post which borrows heavily from an excellent (but old) PyCon Asia talk by Martin Brochhaus. The .vimrc and much of the instructions are taken from the talk so feel free to watch it on youtube. I’ve also tried to update some of the info and stick to the minimum to get you coding away ASAP.

Section 1: Basic VIM install

These steps will clone the current version of VIM from source and enable additional features. If you already have a copy installed (ie vim starts when you type ‘vim’) I would first fun ‘apt-get remove vim’ (or your equivalent). First dependencies and some tools.

sudo apt-get update
sudo apt-get install mercurial curl python-pip flake8
sudo apt-get build-dep vim

#install python packages
sudo pip install ipdb
sudo pip install jedi
sudo pip install flake8

As we are going to use a bin folder in our home directory we need to put this location into our path so that our new version of vim will run from the command line.

#add a home bin dir to your path
nano ~/.bashrc

#add the following to the end of the file
if [ -d $HOME/bin ]; then
PATH=$PATH:$HOME/bin
fi

#save and exit

Now we make the ~/bin dir as well as an ~/opt for the vim install. Eventually we will use a simlink from ~/bin (now in our path) to the ~/opt folders in our home directory, the -p flag says ‘don’t complain if this dir is already there’. The final command reloads your bashrc without you having logout. Cool.

mkdir -p ~/bin
mkdir -p ~/opt
source ~/.bashrc

Now to install VIM.

cd ~/Desktop
hg clone https://vim.googlecode.com/hg/ vim
cd vim/src
./configure --enable-pythoninterp --with-features=huge --prefix=$HOME/opt/vim
make
make install
cd ~/bin
ln -s ~/opt/vim/bin/vim

#the following should return /home/YOUR_USERNAME/bin/vim
which vim

Great!

Section 2: The vimrc settings file

Now we need a “.vimrc file”. This file will contain our customisations (the dot says I’m a hidden file). You can get mine from here (download to your desktop or copy contents into a blank txt file). It is worth look at this file in a text editor as it contains some information on the key short-cuts and the plugins we are going to use (BTW in .vimrc files a single quote ” indicates a comment (aka #) a double quote “” indicates a bit of code that can be uncommented to enable something). The next bit of code simply copies the file to your home dir and the gets a custom colour scheme (mostly just to show how to do it).

mv ~/Desktop/vimrc.txt ~/.vimrc
mkdir -p ~/.vim/colors
cd ~/.vim/colors
wget -O wombat256mod.vim http://www.vim.org/scripts/download_script.php?src_id=13400

Section 3: Getting set-up for the plugins with pathogen

This next part is important. To manage plugins we will use a bit of kit called pathogen. The plugins can then be installed (mostly by a git clone) right into the “~/.vim/bundle/” (this will result in a folder structure like this: .vim/bundle/plugin-name) and pathogen will handle everything for us – awesome! Most plugin developers set-up their folders to work nicely with pathogen to make life easy.

mkdir -p ~/.vim/autoload ~/.vim/bundle
curl -so ~/.vim/autoload/pathogen.vim https://raw.githubusercontent.com/tpope/vim-pathogen/master/autoload/pathogen.vim

Section 4: Installing plugins

The first plugin we will use is called powerline and it adds features and makes a better looking status line.

#install powerline plugin
cd ~/.vim/bundle
git clone git://github.com/Lokaltog/vim-powerline.git

The next plugin allows code folding to make it easier to look through long blocks of code. Its simple to use, just type f to collapse a section of code of F to collapse it all. Type again to reverse the folding – sweet as.

# install folding flugin us f to fold block of F to fold all
mkdir -p ~/.vim/ftplugin
wget -O ~/.vim/ftplugin/python_editing.vim http://www.vim.org/scripts/download_script.php?src_id=5492

The next one installs ctrlp which allows nice fuzzy file searches from directly inside vim.

#install ctrlp
cd ~/.vim/bundle
git clone https://github.com/kien/ctrlp.vim.git

Install the jedi plugin which allows awesome auto-completion of commands and imports using [ctl][space].

#install jedi plugin
git clone --recursive https://github.com/davidhalter/jedi-vim.git

Install syntastic is another awesome plugin that does syntax checking and will check your code for compliance with PEP8.

#install syntastic
git clone https://github.com/scrooloose/syntastic.git

Forget tabbing into the terminal to stage/commit/merge etc just install git support with fugitive.

# install git support
cd ~/.vim/bundle
git clone git://github.com/tpope/vim-fugitive.git
vim -u NONE -c "helptags vim-fugitive/doc" -c q

So that is it! Your editor should look something like this:

vim_terminal_example2

Well that’s the install. Next post will be the shortcuts I use all the time and some that I want to learn as a bookmark for myself. Otherwise checkout the links to the plugins for detailed descriptions of what they do. But now you should be up and running with a pretty nice looking IDE!

Make an android app in minutes with the App Inventor 2

Featured

The App Inventor 2 (AI2) is a really cool bit of software from Google and [now] MIT. It contains two GUI tools to make creating Android apps a “relatively” easy task. The first tool is a designer that helps create your screen layouts, pretty much like the layout designers that come with the Android SDK. The really cool and unique part is the “blocks” designer, which helps you create code by moving object blocks around (rather than typing java). These GUI’s dramatically speed up development, whilst also providing an excellent framework to teach people how to code in a fun and accessible way (ie no language syntax).

The following quick run through of a very simple conference app that I’m using for a demo at our teaching symposium. The goal is to do this demo in 10 min, so really this is bare-bones stuff! If you want more I highly recommend the book App Inventor, edition 2, which covers AI2 in much more detail through a series of fun (and useful) projects.

Its worth pointing out now that AI2 has some additional cool features. Firstly, its all on the cloud so you don’t need any software except chrome or firefox (ha ha IE is NOT supported). Secondly, if you are on WIFI, by installing a simple application on your phone you can test the app instantly by scanning a QT code from the computer screen. If you have ever created an app with eclipse and used the emulator for testing you will be delighted by how much simpler and quicker this process is, its super cool –  really!

The conference app is just a listview of names and session times that are clickable. If a item is selected a photo and brief bio pops up for that session. No fancy formatting at this stage, anyway.

1. Go to the AI2 page and login with your google account

The first screen will recommend you install some additional applications to allow easy testing on your phone. By far the simplest is to choose option one, as long as you are sharing a wifi network with your computer and phone. For the rest of this guide I’ll assume this is how you will test your app.

fig1

2. On the menu goto “Projects” – “Start new project” and call it something informative

After a few seconds the designer window will appear. The designer page allows you to layout all of the elements your require for your app. The left hand “palette” contains elements for layout as well as user interaction widgets. These elements can be dragged onto the middle view pain or alternatively if they are a invisible element (a database for example) to the bottom of the viewer screen. The “components” pane allows you to customise each element and finally the properties pane lets you modify size, layout, and text aspects.

fig2

3. Re-title

Just to to see how simple things can get, lets give a more informative title. All we need to do is change the “title text” in the property pane.

fig34. Add the list-view

Ok the meat and potatoes of the app, the listview will be populated with speaker names and times that are clickable. Just drag the listview icon from the palette window to the viewer.

fig4

 5. Adjust the placement and re-sizing to suit your tastes

We can now adjust the listview properties, such as the width and height so that this element fills the available screen space.

fig56. Go to the blocks page and start creating the logic for your app

Now the nuts and bolts (and the cool part). We have to populate our listview with the speaker names when the app is opened, as well as specify some behaviour when the user clicks on the element in the listview. For this work we will need to use the code blocks on the left hand side of the “blocks” screen.

fig6

7. Initialise by populating the list-view with strings once the screen is loaded

Clicking on the “Screen1” icon will bring up all the blocks associated with this choice. We want to drag the “when screen initializes” block to the work area. This “block” of code will execute when the app opens. Notice the gap in the middle, this is where we place the “do” part. In this case what we want to “do” is populate the listview with text for the speaker names.

fig7

To do this we click on the listview block and select the “listview elements” block, notice how this block will fit nicely inside the previous block. Notice also that the open jigsaw like parts at the end of this new block, this is where we add our new list that will hold the speaker string elements. I hope you notice this “feels” like coding nix with objects all the typing.

fig8

Clicking on the list block allows us to make a new list, we click this into our listview and now we populate the list with items that will be displayed.

fig9

So it should look like this (below). The little blue gear lets us increase the number of spots for elements to populate our list. For this app we will have four speakers for this section, so just drag the item block into the right “list block” twice to create two more spaces.

fig10

To actually populate the list we need some text. So we grab the empty text fields and drag them to the open slots on the list.

fig11

By typing between the quotes we can add our custom text. By right clicking on this first text element we can duplicate it for the remaining 3 spaces (or we could just drag 3 more empty elements across).

fig12

We modify each element to be a speaker name. Ok, now when the app opens the listview will be populated by string elements found in the list. Next we need to specify some behaviour when the user clicks on the speakers name.

fig13

At this stage our basic app should be “working”, as in the listview should display, although nothing will happen when we click on each item. Now this is where the simple testing functions of AI2 come to the fore. Just click on the build and then “provide QR code for apk”.

Screenshot 2014-12-04 23.00.00

Screenshot 2014-12-04 23.13.48

Once the QR code pops up just open up the App inventor 2 companion app and scan the code, its bloody brilliant!

Screenshot_2014-12-04-23-19-31

Now we create an action once an element (speaker) in the list is selected by the user. As this is a listview action we click back on the listview icon and grab the “afterPicking” block; thats logical don’t you think?

One thing you might notice about these blocks, if you have coded before, is that they are very much like a normal object that we would specify using code syntax, its just its GUI based.

fig14fig15Our after picking behaviour will be opening a new screen that contains a photo of the speaker and a brief bio. So we need to setup a new screen to contain this new information. I’m being really lazy here and I’m just going to hard wire the page so that only one of the listview items actually works as expected (mine of course). However, in real life you would setup some logic that would ask for the index item that was clicked, and use this to send the user to the correct screen.
 fig16fig17So we have setup the method to open the new screen (above), now we need the screen. We’ll create the “DaveScreen” (as specified by another purple text element block ) that will contain my mock bio and a great photo. To do this just click the “add screen” button at the top of the screen and re-name it.fig18Back in the designer we can populate this new screen with a photo and some text, using a vertical alignment is simple just drag it onto the window and specify its attributes in the right hand panel.fig19Now Drag an image icon onto the panel and then by clicking on “image1” in the components plane we can upload an image and resize it using the properties. You really can’t get much easier than that really, can you!

fig20 fig21

fig23Wow, what a good looking bloke, finally, we create a label and populate it with text in the blocks screen. We could do this in the properties section of the designer page, but since I’m feeling guilty for hard coding the list view, I thought I would use the blocks to set the text.

fig24

Thats it. The app should provide a list of speakers, and when you click on my name (any name actually due to the hard wireing) the new page should open with my bio and a photo.

You can download the files associated with the app from here. These include the APK’s that can be installed on any android phone as well as the *aia files that can be imported into AI2 if you want to play with the code. I’ve included a slightly improved (although still broken) version that replaces the hard-wired listview actions with a call to specific speaker pages (that don’t exist but could be easily made based on the above information).

New Zealanders and their sheep – part 2

Ok, based on the graphs in the last post NZ is slowly being cow-a-fyed, so whats driving this trend. Well google will tell you that its …

WARNING: This data is dodgy, but I’m really just using it to demonstrate how cool pandas is. So I found some information on milk and lamb meet prices, we’ll load them up as dataframes and work out the percent change since 1994 like we did before. We’ll try out the datetime functionality of pandas, which is really quite nice. But first just to import our table from the last post and make the year the index so we can easily merge the new data.

import pandas as pd
per_decline = pd.DataFrame(pd.read_csv('percent_decline.csv'))
cols = per_decline.columns.values
cols[0] = 'Year'
per_decline.columns = cols
per_decline.index = per_decline['Year']
per_decline = per_decline.ix[:,1:] #all rows, skip first column (date is now the index)
per_decline.head()

	Total beef cattle	Total dairy cattle	Total sheep	Total deer	Total pigs	Total horses
Year						
1994	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000
2002	-11.025827	 34.444950	-20.002034	 33.858009	-19.100637	 11.811093
2003	 -8.344764	 32.882482	-20.041908	 37.229441	-10.766476	 18.504488
2004	-11.895128	 34.207998	-20.609926	 42.707754	 -8.072078	 13.376472
2005	-12.366101	 32.506699	-19.379727	 38.499840	-19.230733	-100.000000

Now we are going to create a table from the dodgy lamb price data, this table is in a slightly different format so we will have to use the groupby method to wrangle it into the shape we need.

lamb_data = pd.DataFrame(pd.read_excel('lamb_usd.xlsx',sheetname='lamb'))
lamb_data.head()
	Month	Price	Change
0	 Apr 1994	 130.00	 -
1	 May 1994	 126.59	 -2.62 %
2	 Jun 1994	 127.03	 0.35 %
3	 Jul 1994	 126.11	 -0.72 %
4	 Aug 1994	 119.62	 -5.15 %

Now to use datetime to make an index based on the month data.

lamb_data.index = pd.to_datetime(lamb_data['Month'])
lamb_data=lamb_data.ix[:,1:2] #just grab the price
lamb_data.head()
	Price
Month	
1994-04-02	 130.00
1994-05-02	 126.59
1994-06-02	 127.03
1994-07-02	 126.11
1994-08-02	 119.62

Pandas did a good job of converting the date format into a datetime index. As you’ll see in a second this datetime object has some extra functionality that makes dealing with dates a breeze. Although this new data has the date and price information we need, its divided into quarterly amounts. As you can see by the commented out code, initially I made a mistake and summed these values, but really we want the mean to get the average yearly price. I left the mistake code there as it shows how easy it would have been to get the sum using groupby.

#wrong! lamb_prices = lamb_data.groupby(lamb_data.index.year)['Price'].sum()
lamb_prices = lamb_data.groupby(lamb_data.index.year)['Price'].mean()
lamb_prices = pd.DataFrame(lamb_prices[:-1]) #get rid of 2014
lamb_prices.head()
	Price
1994	 124.010000
1995	 113.242500
1996	 145.461667
1997	 150.282500
1998	 116.013333

We pass the year index to groupby and get it to do its magic on the price column (our only column in this case, but you get the idea), we then just call the mean method to return the mean price per year. The datetime object made specifying the year easy. Now we are going to write a quick function to calculate the percent change since 1994.

def percent(start,data):
    '''calculate percent change relative to first column (1994), better than previous attempt )-:'''
    ans = 100*((start - data)/start)
    return 0-ans

lamb_change = percent(lamb_prices.Price[1994],lamb_prices)
lamb_change.head()
	Price
1994	 0.000000
1995	 -8.682768
1996	 17.298336
1997	 21.185791
1998	 -6.448405

Great! Now just add that column to our original dataframe. Notice how only the intersect of the dates are used, very handy (ie it drops 1995-2001 from the lamb price data as these dates are not in our stock number table)!

per_decline['Lambprice'] = lamb_change
per_decline.head()
	Total beef cattle	Total dairy cattle	Total sheep	Total deer	Total pigs	Total horses	Lambprice
Year							
1994	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000
2002	-11.025827	 34.444950	-20.002034	 33.858009	-19.100637	 11.811093	 17.768056
2003	 -8.344764	 32.882482	-20.041908	 37.229441	-10.766476	 18.504488	 28.869984
per_decline.index=per_decline.index.astype(int) #lamb2
per_decline.plot(kind='barh')
plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')
plt.ylabel('Year')

fig_9

 

The next series of code and graphs adds in milk and lamb prices to try and see why farmers are moving from ovines to bovines!

milk_data = pd.DataFrame(pd.read_excel('milk_prices_usd.xlsx',sheetname='milk'))
milk_data.index=milk_data['year']
milk_data.head()
	year	thousand head	pounds	mill lbs	price_cwt
year					
1989	 1989	 10046	 14323	 143893	 13.56
1990	 1990	 9993	 14782	 147721	 13.68
1991	 1991	 9826	 15031	 147697	 12.24
1992	 1992	 9688	 15570	 150847	 13.09
1993	 1993	 9581	 15722	 150636	 12.80
#get rid of the info we don't need
milk_data = pd.DataFrame(milk_data.ix[5:,:])
milk_change = percent(milk_data.price_cwt[1994],milk_data)
per_decline['milk_price'] = milk_change
per_decline.head()
	Total beef cattle	Total dairy cattle	Total sheep	Total deer	Total pigs	Total horses	Lambprice	milk_price
Year								
1994	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000	 0.000000
2002	-11.025827	 34.444950	-20.002034	 33.858009	-19.100637	 11.811093	 17.768056	 -6.630686
2003	 -8.344764	 32.882482	-20.041908	 37.229441	-10.766476	 18.504488	 28.869984	 -3.469545
2004	-11.895128	 34.207998	-20.609926	 42.707754	 -8.072078	 13.376472	 33.670672	 23.747109
2005	-12.366101	 32.506699	-19.379727	 38.499840	-19.230733	-100.000000	 29.762385	 16.653816
per_decline.plot(kind='barh')

lamb_3 These graphs are a little busy, lets just concentrate on the important stuff.

<pre>animals=['Total dairy cattle','Total sheep','Lambprice','milk_price']
interesting_data=per_decline[animals]
interesting_data.plot(kind='barh')

lamb_4

interesting_data.plot()

Finally!

lamb_5

Pandas data analysis – New Zealanders and their sheep!

Featured

As an Aussie I cop a little bit of flack living in New Zealand. It helps that since I’m from the South of Australia I follow Aussie Rules footy, not this rugby shenanigans, so I don’t bother too much with the constant focus on how much better the All Blacks are relative to our poor Wallabies (Australia has won as many world cups as New Zealand – Take that!).

They make bloody good gum boots (multi-purpose - inside joke (-; )

Kiwi’s have bloody good gum boots; they have to do more than keep the mud off! – inside joke (-;

That said as a bit of fun I thought I would do a post on the Pandas python module using data from StatsNZ. It has too be about sheep right! The actual spreadsheet I downloaded from StatsNZ and used in this is here.

I’ll be using pandas, ipython, and matplotlib to create some graphs to show the decline of sheep in NZ. In the next post I’ll try and work out why (yes I know you could find this out in 3 seconds with google, but thats no fun).

First we will read the file using pandas and make a dataframe. Pandas can import in all kinds of file formats, with excel you need to include a sheet name (that little tab at the bottom of the excel sheet).

#imports
import pandas as pd
import matplotlib.pyplot as plt
#sheetname required for xlsx
data_file = pd.read_excel('livestock1.xlsx',sheetname='livestock')

#make a dataframe
data = pd.DataFrame(data_file)
data.head()
Out[8]:
                                                 1994          2002      2003  \
Total beef cattle                             5047848       4491281   4626617
Calves born alive to beef heifers/cows        1262522       1083485   1079334
Total dairy cattle                            3839184       5161589   5101603
Calves born alive to dairy heifers/cows       2455975       3225238   3115897
Total sheep                              4.946605e+07  3.957184e+07  39552113
-----
                                             2010      2011      2012
Total beef cattle                         3948520   3846414   3734412
Calves born alive to beef heifers/cows     901258    901375    827749
Total dairy cattle                        5915452   6174503   6445681
Calves born alive to dairy heifers/cows   3640914   3884257   3879543
Total sheep                              32562612  31132329  31262715

So we made a dataframe, the head method acts like bash head in that it shows the start of the frame rather than the whole thing. Currently the years are columns and the stock type are rows, lets flip the table, which is super easy!

 #we really want the dates as the index
data = data.T
data.head()
-------
     Total beef cattle Calves born alive to beef heifers/cows  \
1994           5047848                                1262522
2002           4491281                                1083485
2003           4626617                                1079334
2004           4447400                                1013893
2005           4423626                                1018730

Now that we have the years as rows, actually an index, it will be much easier to do our plotting. But the column names are overly informative, as in long, lets shorten them.

#the column names are pretty long, lets fix that now
data.columns
Out[13]: Index([u'Total beef cattle', u'Calves born alive to beef heifers/cows', u'Total dairy cattle', u'Calves born alive to dairy heifers/cows', u'Total sheep', u'Total lambs marked and/or tailed', u'Total deer', u'Fawns weaned', u'Total pigs', u'Piglets weaned', u'Total horses'], dtype=object)
sub_data = data.loc[:,['Total beef cattle','Total dairy cattle','Total sheep','Total deer',
'Total pigs','Total horses']]
sub_data = sub_data.replace('..',0)#replace .. with 0
sub_data = sub_data.astype('float')
sub_data.head()
Out[17]:
      Total beef cattle  Total dairy cattle  Total sheep  Total deer  \
1994            5047848             3839184     49466054     1231109
2002            4491281             5161589     39571837     1647938
2003            4626617             5101603     39552113     1689444
2004            4447400             5152492     39271137     1756888
2005            4423626             5087176     39879668     1705084
2006            4439136             5169557     40081594     1586918
2007            4393617             5260850     38460477     1396023
2008            4136872             5578440     34087864     1223324
2009            4100718             5860776     32383589     1145858
2010            3948520             5915452     32562612     1122695
2011            3846414             6174503     31132329     1088533
2012            3734412             6445681     31262715     1060694

Great! Now to make the plot easier to look at lets divide the dataframe by 1 million.

#now divide by a million
sub_data = sub_data.divide(1000000)
#first plot everything
sub_data.plot(legend=True)
plt.xlabel('Year')
plt.ylabel('Total stock (millions)')
plt.title('NZ farm stock')

Yes that is correct, back in the 90’s there were 50 million sheep in NZ, not bad for a country with a population of ~3 million people. Baa. But their numbers have been in serious decline since then, replaced by their bigger brothers the cows.
Poor sheepies!

Lets face it, NZ is all cows and sheep, lets just look at that data.

#lets just plot cows and sheep, that being the first 3 columns
cow_sheep = sub_data.ix[:,[0,1,2]]
cow_sheep.plot(label=True,title="Numbers of cattle and sheep in NZ")
plt.xlabel('Year')
plt.ylabel('Total stock (millions)')

fig2The data has meat cows and milk cows as separate columns, its easy to combine them.

#a milk cow and meat cow are still cattle
cow_sheep['Total cows'] = cow_sheep.ix[:,0] +  cow_sheep.ix[:,1]
cow_sheep.ix[:,2:].plot(legend=True,title="Total cattle and sheep in NZ")
plt.xlabel('Year')
plt.ylabel('Total stock (millions)')

fig_3So the country is cowafying, mostly dairy replacing beef cattle. Although the increase looks slight, basically its 25% up! That’s a lot of grass, cow shit, and methane!

Now lets look at the change in the numbers of each stock since 1994 (the start of our data). We do this by passing all the data as well as all the data minus 1994 to a function that handles the calculation. Pandas handles this all in the back end and parses the data in the frame through the function, nice.

def percent(dataframe,d1):
    '''calculate percent change relative to first column (1994)'''
    a = 100*((dataframe.ix[0,:]-d1)/dataframe.ix[0,:])
    return 0-a

#pass the entire data frame to this function
percent_data = percent(sub_data,sub_data.ix[0:,:])

percent_data.head()
Out[38]: 
      Total beef cattle  Total dairy cattle  Total sheep  Total deer  \
1994           0.000000            0.000000     0.000000    0.000000   
2002         -11.025827           34.444950   -20.002034   33.858009   
2003          -8.344764           32.882482   -20.041908   37.229441   
2004         -11.895128           34.207998   -20.609926   42.707754   
2005         -12.366101           32.506699   -19.379727   38.499840   

      Total pigs  Total horses  
1994    0.000000      0.000000  
2002  -19.100637     11.811093  
2003  -10.766476     18.504488  
2004   -8.072078     13.376472  
2005  -19.230733   -100.000000  
#the years are plotted as floats, its easy to convert them!
percent_data.index=percent_data.index.astype(int)
#figure 4
percent_data.index=percent_data.index.astype(int)
percent_data.plot(kind='barh')
plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')
plt.ylabel('Year')

fig_4I really want the graph the other way around, so lets re-index. Also, lets get rid of the nags, they look funny because the frame was missing some data. More proof that Far Lap was an Australian horse.

horseless = sub_data.ix[:,:-1]
horseless_per = percent(horseless,horseless.ix[0:,:])
#flip the axis
horseless_per = horseless_per.reindex( index=data.index[ ::-1 ] )
horseless_per.index=horseless_per.index.astype(int)
horseless_per.plot(kind='barh')
plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')
plt.ylabel('Year')

fig_8

Finally, its silly having the 1994 data as its zero, but it was a nice sanity check to make sure the % function was working correctly. But lets get rid of it now by just plotting part of the frame.

horseless_per[:-1].plot(kind='barh')
plt.title('Percent change in stock in NZ since 1994')
plt.xlabel('Percent change since 1994')
plt.ylabel('Year')

The ‘barh’ is just bar graph horizontal.
fig_9So it looks like there was a venison craze in the early 00’s, but mostly just more and more dairy.

#save the file
horseless_per.to_csv('stock_num_diff_average.csv')

Ok, so although there are still a LOT of sheep in NZ, it really is the country of the cow now. What might be driving that? Lets look at some commodity prices in the next post!

How to use DESeq2 to analyse RNAseq data

Featured

News: My colleagues at NZGL have developed an open source R based GUI for generating plots using cuffdiff data. One for deseq2 will be available soon! Feel free to check it out and get back to us with any suggestions. Only requires R-studio.

There is only one thing better than DESeq and thats DESeq2! The updated version is out and I’m keen to give it a whirl. Like with my old DESeq post, once again I am really just following the excellent DESeq2 manual, thanks again to the authors for the great documentation!

Just a quick warning that I haven’t tested this workflow extensively, let me know if things don’t work. Also, DESeq2 is new so some of the function might have changed, so if you have problems makes sure you check what version you are using versus what I used (see sessionInfo below).

The files I used for this are found here if you want to give them a go or follow along. I’ve already got a blow by blow run through on how to do use DESeq and much of that applies to the new package, so here I’ll just concentrate on some DESeq2 specific stuff as well as all the graphs. I’ve included some side by side comparisons between DESeq and DESeq2.

Installing is easy:

source('http://bioconductor.org/biocLite.R')
biocLite('DESeq2')

One important change is that now you can directly create the count table using raw HT-Seq-count output, and I’ll show you how to do that below. Remember HT-Seq-count will create a single file for each replicate of a condition (based on an SAM alignment file), so in my case with two conditions (control and treatment) and 3 reps each, that makes a total of 6 files. I called these files treated1.txt, treated2.txt, treated3.txt, untreated1 untreated2, untreated3 so that i can use grep to import them (explained below). Previously i would use a script to merge them all together, now DESeq2 allows you to import these files directory.

Below I set the directory where the files are located; use grep to catch all these files based on the string match “treated” that they all share (be carefully it doesn’t catch anything else), this is stored in sampleFiles. If you like you could just directly specify the files using “sampleFiles<-c(“treated1.txt”,..etc..”untreated3.txt”). Importantly, we need to setup the sampleConditions, with the same order as that found for the file names in sampleFiles (so it knows what each files represents). Finally we make a dataframe that becomes a deseq table. Note my “#” lines are just nonfunctional comment lines that I am using to print out the output from the screen!

library('DESeq2')
directory<-'/home/dwheeler/Desktop/BLOG/Dec_post2'
#use grep to search for the 'treated' part of filename to collect files
sampleFiles<-grep('treated',list.files(directory),value=TRUE)
# sampleFiles
#[1] 'treated1.txt'   'treated2.txt' 'treated3.txt'  'untreated1.txt'
#[5] 'untreated2.txt' 'untreated3.txt'

sampleCondition<-c('treated','treated','treated','untreated','untreated','untreated')
sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition)
####
#sampleTable
#     sampleName       fileName condition
#1   treated1.txt   treated1.txt   treated
#2   treated2.txt   treated2.txt   treated
#3   treated3.txt   treated3.txt   treated
#4 untreated1.txt untreated1.txt untreated
#5 untreated2.txt untreated2.txt untreated
#6 untreated3.txt untreated3.txt untreated
######

ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition)
#####
#ddsHTSeq
#class: DESeqDataSet
#dim: 7921 6
#exptData(0):
#assays(1): counts
#rownames(7921): seq_1 seq_2 ... seq_7920 seq_7921
#rowData metadata column names(0):
#colnames(6): treated1.txt treated2.txt ... untreated2.txt
#  untreated3.txt
#colData names(1): condition
#######
colData(ddsHTSeq)$condition<-factor(colData(ddsHTSeq)$condition, levels=c('untreated','treated'))

The levels in colData are important because they are used in the log calculations; it makes sense to set untreated or control first so that the direction of the logs fold changes doesn’t confuse everyone (typically we do comparisons to the control)! Now for the guts of the DEseq2 analysis.

dds<-DESeq(ddsHTSeq)
res<-results(dds)
res<-res[order(res$padj),]
head(res)
#DataFrame with 6 rows and 6 columns
#          baseMean log2FoldChange      lfcSE      stat       pvalue
#
#seq_3146  997.5419      0.7894523 0.08297687  9.514125 1.832488e-21
#seq_1802  746.3972      0.5685789 0.08533961  6.662544 2.691282e-11
#seq_2146  406.1395      0.9424543 0.14108613  6.679993 2.389544e-11
#seq_7548  466.5453      0.6036683 0.10178158  5.931017 3.010637e-09
#seq_3240 1569.6556      0.6132326 0.11145966  5.501835 3.758596e-08
#seq_958   149.6504      0.7398193 0.14154162  5.226868 1.724055e-07
#                 padj
#
#seq_3146 1.299050e-17
#seq_1802 6.359498e-08
#seq_2146 6.359498e-08
#seq_7548 5.335601e-06
#seq_3240 5.328937e-05
#seq_958  2.036971e-04

Looking good, time for some plots. BTW I’m using the same dataset I used for the original DESeq blog post (links to data and that blog at top of page).

plotMA(dds,ylim=c(-2,2),main='DESeq2')
dev.copy(png,'deseq2_MAplot.png')
dev.off()
MAPlot of DESeq1 (left) and DESeq2  (right) for the same data

MAPlot of DESeq1 (left) and DESeq2 (right) for the same data

As expected for this dataset there are not many differentially expressed genes (red). There certainly is a difference in the level of scatter with this dataset using DESeq and DESeq2. Also note that there is good reduction in scatter for low count reads (left hand side of the graph) in DESeq2 versus the original version. DESeq tends to be a conservative approach, I like that, and with that in mind the update uses a test called cooks distance to remove outliers from the analysis. Cooks distance looks to see how much each sample contributes to a genes overall value fold change, with samples that cause extreme effects removed. To be specific, the gene will not be analysed for differential expression if one of its samples is considered an outlier. The idea being here that we want to see only DE genes that show a consistent pattern. The draw back of this approach is that there is a loss of power, as some genes that are truly DE will be removed before the statistical tests are performed.

We can save the table, and also print out some information on what the columns mean.

mcols(res,use.names=TRUE)
#DataFrame with 6 rows and 2 columns
#                       type

#baseMean       intermediate
#log2FoldChange      results
#lfcSE               results
#stat                results
#pvalue              results
#padj                results
                                                          #description

#baseMean                        the base mean over all rows
#log2FoldChange log2 fold change (MAP): condition treated vs untreated
#lfcSE           standard error: condition treated vs untreated
#stat            Wald statistic: condition treated vs untreated
#pvalue          Wald test p-value: condition treated vs untreated
#padj            BH adjusted p-values
#write the table to a csv file
write.csv(as.data.frame(res),file='sim_condition_treated_results_deseq2.csv')

BTW take this with a pinch of salt because its only a simple sample dataset, but the difference in gene counts is that deseq only found a single differentially expressed gene (at padj 0.1), whilst deseq2 called this same gene plus 23 others.  Also, reducing the cut-off multiple testing correction to 0.05 only removes 3 genes from the list with DESeq2.

Now we want to transform the raw discretely distributed counts so that we can do clustering. (Note: when you expect a large treatment effect you should actually set blind=FALSE (see https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).

rld<- rlogTransformation(dds, blind=TRUE)
vsd<-varianceStabilizingTransformation(dds, blind=TRUE)

Here we choose blind so that the initial conditions setting does not influence the outcome, ie we want to see if the conditions cluster based purely on the individual datasets, in an unbiased way. According to the documentation, the rlogTransformation method that converts counts to log2 values is apparently better than the old varienceStabilisation method when the data size factors vary by large amounts.

The code and plot below shows the [nice] effect of the transformation.

par(mai=ifelse(1:4 <= 2, par('mai'), 0))
px     <- counts(dds)[,1] / sizeFactors(dds)[1]
ord    <- order(px)
ord    <- ord[px[ord]<150]
ord    <- ord[seq(1, length(ord), length=50)]
last   <- ord[length(ord)]
vstcol <- c('blue', 'black')
matplot(px[ord], cbind(assay(vsd)[, 1], log2(px))[ord, ], type=l, lty=1, col=vstcol, xlab='n', ylab='f(n)')
legend('bottomright', legend = c(expression('variance stabilizing transformation'), expression(log[2](n/s[1]))), fill=vstcol)
dev.copy(png,'DESeq2_VST_and_log2.png')
Graph showing variance stabilizing transformation for sample 1 (blue) and of the transformation f (n) = log2 (n/s1 ) (black)

Graph showing variance stabilizing transformation for sample 1 (blue) and
of the transformation f (n) = log2 (n/s1 ) (black)

The x axis is the square root of variance over the mean for all samples, so this will naturally included variance due to the treatment. The goal here is to flattern the curve so that there is consistent variance across the read counts, and that is what we got.

library('vsn')
par(mfrow=c(1,3))
notAllZero <- (rowSums(counts(dds))>0)
meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1), ylim = c(0,2.5))
meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5))
meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5))

deseq2_stabilizing_comp

This interesting plot shows the standard deviation across all samples against the mean counts using three different methods of transformation. With this data you can see that the shifted logarithm method (left) seems to do pretty badly at for low count genes, with both regularized log (center) and DESeqs variance stabilisation (right) doing a much better job across the entire dynamic range of counts.

For some reason, everyone loves a good heat map!

library('RColorBrewer')
library('gplots')
select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:30]
hmcol<- colorRampPalette(brewer.pal(9, 'GnBu'))(100)
heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol,
Rowv = FALSE, Colv = FALSE, scale='none',
dendrogram='none', trace='none', margin=c(10,6))
dev.copy(png,'DESeq2_heatmap1')
dev.off()
heatmap.2(assay(rld)[select,], col = hmcol,
Rowv = FALSE, Colv = FALSE, scale='none',
dendrogram='none', trace='none', margin=c(10, 6))
dev.copy(png,'DESeq2_heatmap2')
dev.off()
heatmap.2(assay(vsd)[select,], col = hmcol,
Rowv = FALSE, Colv = FALSE, scale='none',
dendrogram='none', trace='none', margin=c(10, 6))
dev.copy(png,'DESeq2_heatmap3')
dev.off()
heatmaps

heatmaps

The above shows heatmaps for 30 most highly expressed genes (not necessarily the biggest fold change). The data is of raw counts (left), regularized log transformation (center) and from variance stabilizing transformation (right) and you can clearly see the effect of the transformation has by shrinking the variance so that we don’t get the squish effect shown in the left hand graph.

Now we calculate sample to sample distances so we can make a dendrogram to look at the clustering of samples.

distsRL <- dist(t(assay(rld)))
mat<- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds),
paste(condition,sampleFiles , sep=' : '))

#updated in latest vignette (See comment by Michael Love)
#this line was incorrect
#heatmap.2(mat, trace='none', col = rev(hmcol), margin=c(16, 16))
#From the Apr 2015 vignette
hc <- hclust(distsRL)
heatmap.2(mat, Rowv=as.dendrogram(hc),
symm=TRUE, trace='none',
col = rev(hmcol), margin=c(13, 13))
dev.copy(png,'deseq2_heatmaps_samplebysample.png')
dev.off()

dist_matrix

Although this result looks terrible, as we would expect samples to cluster by treatment, in this case I’m actually happy by this result. Why? Well this was actually a control experiment to show that slightly different (and unavoidable) experimental setup for the different samples, wasn’t responsible for the observed expression differences, so seeing that there is little treatment effect makes me happy. Remember, always try and do what Fisher tells us to, replicate, randomised, block.

Similarly the pca.

print(plotPCA(rld, intgroup=c('condition')))
dev.copy(png,'deseq2_pca.png')
dev.off()

deseq2_pca
I hope your’s looks better!

Previously we talked about the cooks distance treatment of outliers, in that a gene is thrown away if one of its samples is deemed to be an outlier. You may not want this to happen so DESeq2 we can take a different approach by replacing the outlier value with one estimated value as predicted by the distribution using the trimmed mean approach. DESeq2 recomends you only do this if you have several replicates per treatment, and indeed it automatically uses this feature if you have 7 or more replicates in your datatable.

ddsClean <- replaceOutliersWithTrimmedMean(dds)
ddsClean <- DESeq(ddsClean)
tab <- table(initial = results(dds)$padj < .1,
cleaned = results(ddsClean)$padj < .1)
addmargins(tab)
write.csv(as.data.frame(tab),file='sim_condition_treated_results_cleaned_summary_deseq2.csv')
resClean <- results(ddsClean)
write.csv(as.data.frame(resClean),file='sim_condition_treated_results_cleaned_deseq2.csv')

In my case it didn’t really make much difference.

Dispersion plot shows how the estimates are shrunk from the gene wise values (black dots) toward the fitted estimates, with the final values used in testing being the blue dots.

plotDispEsts(dds)
Dispersion plots deseq and deseq2 (right)

Dispersion plots deseq and deseq2 (right)

and compared DESeq1 and 2.

dispersion plots Deseq1 Left and right deseq2

dispersion plots Deseq1 Left and right deseq2

Now independent filtering to remove any tests that have little chance of pass to reduce the number of tests we have to perform, thus reducing the effects of multiple testing error. (false discovery). You can see how many genes are rejected based on different values of alpha (FDR)

#filtering threashold
attr(res,'filterThreshold')
#     10%
#91.48005
plot(attr(res,'filterNumRej'),type='b', ylab='number of rejections')
dev.copy(png,'deseq2_filtering_treshold.png')
dev.off()

deseq2_filtering threashold_Ed

W <- res$stat
maxCooks <- apply(assays(dds)[['cooks']],1,max)
idx <- !is.na(W)
plot(rank(W[idx]), maxCooks[idx], xlab='rank of Wald statistic',
ylab='maximum Cook's distance per gene',
ylim=c(0,5), cex=.4, col=rgb(0,0,0,.3))
m <- ncol(dds)
p <- 3
abline(h=qf(.99, p, m - p))
dev.copy(png,'deseq2_cooksdist.png')
dev.off()
cooks distance

cooks distance

Plot of the maximum Cook’s distance per gene over the rank of the Wald statistics for the condition.

Here more about independent filtering. What it shows in genes with very low counts are unlikely to have a significant p-value due to excessive dispersion at the left side of the dynamic range of counts. The y-axis here is -log10, so bigger numbers are smaller p-values (better).

plot(res$baseMean+1, -log10(res$pvalue),
log='x', xlab='mean of normalized counts',
ylab=expression(-log[10](pvalue)),
ylim=c(0,30),
cex=.4, col=rgb(0,0,0,.3))
dev.copy(png,'deseq2_indep_filt.png')
dev.off()

deseq2_indep_filtering

All those dots on the left hand side the graph represent failed tests due to very low count values, thus we can really just get rid of them to reduce our chance of making a type I error.

And again, you can see that only a few small (or no) p-values are discarded by the filtering. NOTE: You might only see blue lines [I’ve broken something??]

use <- res$baseMean > attr(res,'filterThreshold')
table(use)
h1 <- hist(res$pvalue[!use], breaks=0:50/50, plot=FALSE)
h2 <- hist(res$pvalue[use], breaks=0:50/50, plot=FALSE)
colori <- c('do not pass'='khaki', 'pass'='powderblue')
barplot(height = rbind(h1$counts, h2$counts), beside = FALSE,
col = colori, space = 0, main = '', ylab='frequency')
text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)),
adj = c(0.5,1.7), xpd=NA)
legend('topright', fill=rev(colori), legend=rev(names(colori)))

deseq2 pvals and mulittest
The graph on the left ranks the p-values from smallest to biggest (x-axis) and plots them. The black line is the actual p-value numbers (remember only about 23 genes had a p-value lower than 0.05). The red line has a slope that represents the number of tests divided by the false discovery rate (0.1). The idea here is the FDR is controlled at the 0.1% value for all tests that occur to the left of the right-most intersection of the black and red line.

The code for the right hand plot above.

resFilt <- res[use & !is.na(res$pvalue),]
orderInPlot <- order(resFilt$pvalue)
showInPlot <- (resFilt$pvalue[orderInPlot] <= 0.08)
alpha <- 0.1

plot(seq(along=which(showInPlot)), resFilt$pvalue[orderInPlot][showInPlot],
pch='.', xlab = expression(rank(p[i])), ylab=expression(p[i]))
abline(a=0, b=alpha/length(resFilt$pvalue), col='red3', lwd=2)
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] gplots_2.12.1 RColorBrewer_1.0-5 BiocInstaller_1.12.0
[4] DESeq2_1.2.8 RcppArmadillo_0.3.920.3 Rcpp_0.10.6
[7] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.6
[10] BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] affy_1.40.0 affyio_1.30.0 annotate_1.40.0
[4] AnnotationDbi_1.24.0 Biobase_2.22.0 bitops_1.0-6
[7] caTools_1.16 DBI_0.2-7 DESeq_1.14.0
[10] gdata_2.13.2 genefilter_1.44.0 geneplotter_1.40.0
[13] grid_3.0.2 gtools_3.1.1 KernSmooth_2.23-10
[16] lattice_0.20-24 limma_3.18.5 locfit_1.5-9.1
[19] preprocessCore_1.24.0 RSQLite_0.11.4 splines_3.0.2
[22] stats4_3.0.2 survival_2.37-4 tools_3.0.2
[25] vsn_3.30.0 XML_3.98-1.1 xtable_1.7-1
[28] zlibb

[Updated July ’14: to fix errors with distance matrix plot, cooks distance, and the Benjamini-Hochberg multiple testing adjustment procedure (props to Stefan for pointing them out]

Arch linux on an eeepc 900

This is a summary of my attempt to install the Arch linux distro on my eeepc 900. I had some problems with the login manager and the desktop, but appart from that the base install was not that difficult if you follow the beginners guide. Arch is a hard core distro in that you have to set everything up yourself, but by doing that you get a better understanding of what is going on under the hood.

My little old eeepc 900 running on Arch

Download the Arch iso, I used an external CD-drive to install but alternatively a USB thumb drive could be used. To make it easier, initially I used gparted (http://gparted.sourceforge.net/) GUI to set the / to the 4gb SSD and /home to the 15gb SSD, I formatted both as ext4, which is not recommended because it will destroy your ssd, but apparently ext4 is SSD aware, I just need to work out how to turn journaling off.

Boot to the CD, chose “Boot Arch Linux”. This first section is almost word for word from the beginers guide, and that has a lot more detail, I have skipped the parts that are not relevant to the eeepc.

There was no wireless at startup, on the eeepc it is identified as “wlan0” and the interphase can be checked with, then bring it up and scan

#iwconfig
#ip link set wlan0 up
#iwlist wlan0 scan

Look for the ESSID:”yourwirelessnetworkname”

Backup the original file then modify and set it up for wpa encription.
# mv /etc/wpa_supplicant/wpa_supplicant.conf /etc/wpa_supplicant/wpa_supplicant.conf.original
# wpa_passphrase linksys "my_secret_passkey" > /etc/wpa_supplicant/wpa_supplicant.conf
#wpa_supplicant -B -Dwext -i wlan0 -c /etc/wpa_supplicant/wpa_supplicant.conf

Wait a few seconds for it to associate, now check and requiest an ip address, then ping google to check
#iwconfig wlan0
# dhcpcd wlan0
#ping -c 3 http://www.google.com

Now the hard drive
# fdisk -l

If you formatted your drives like I did (using gpartted) the output should indicate that you have sda1 which is your base and home is sdb1, since we used gparted I can skip all the ugly manual partitioning, so we just need to format them as ext4 (bad idea, see below).

# mkfs.ext4 /dev/sda1
# mkfs.ext4 /dev/sdb1

Remember that this is a bad idea, using ext2 would be safer, but I figure I can work out how to turn of journaling. Save your SSD, don’t make a swap partition!

Mount the base and the home and install the base and devel packages, then setup fstab
# mount /dev/sda1 /mnt
# mkdir /mnt/home && mount /dev/sdb1 /mnt/home
# pacstrap /mnt base base-devel
# genfstab -p /mnt >> /mnt/etc/fstab
#cat nano /mnt/etc/fstab

Modify the mount flags for drive to avoid unessary writes, more infomation at https://wiki.archlinux.org/index.php/Solid_State_Drives

Now install the graphics environment

pacman -S xorg-server xorg-xinit xorg-utils xorg-server-utils xterm

Now let’s enter the new system, start configuration by creating a hostname, or a name for your computer
#arch-chroot /mnt
#nano /etc/hostname

And add “daves-eeepc” or something like that, no quotes! Save [ctl][o] an exit [ctl][x], now modify the /etc/hosts and add the new name
#nano /etc/hosts
add the name as below
127.0.0.1   localhost.localdomain   localhost daves-eeepc
::1         localhost.localdomain   localhost daves-eeepc

Now configure the time, first find out the region and local time and settings
# ls /usr/share/zoneinfo/
America and then for the subdomain
# ls /usr/share/zoneinfo/America
I’m New_York
Now create a sim link to local time directory
# ln -s /usr/share/zoneinfo/America/New_York /etc/localtime
Now set up local settings and uncomment the en_US.UTF-8 UTF-8
setting.
# nano /etc/locale.gen
Then generate the file
# locale-gen

Setup system wide preferences by adding “LANG=en_US.UTF-8” to (dont inc quotes) the file shown using nano.
#nano /etc/locale.conf
Then export the settings and set the clock to utc
# export LANG=en_US.UTF-8
# hwclock --systohc --utc

Now set the network, check the /etc/rc.conf file make sure “network” is between the brackets, also add “net-auto-wireless” for our wireless.
DAEMONS=(syslog-ng network net-auto-wireless netfs crond)

Since we are wireless, we need to do a little work!
#exit
# pacstrap /mnt wireless_tools netcfg
# pacstrap /mnt wpa_supplicant wpa_actiond
# pacstrap /mnt zd1211-firmware
# arch-chroot /mnt

Set the interface in /etc/conf.d/netcfg to
WIRELESS_INTERFACE="wlan0"

Finally create the ramdisk environment, and install the grub boot loader on the root drive, and get rid of any error messages
# mkinitcpio -p linux
# pacman -S grub-bios
# grub-install --target=i386-pc --recheck /dev/sda
# mkdir -p /boot/grub/locale
# cp /usr/share/locale/en\@quot/LC_MESSAGES/grub.mo /boot/grub/locale/en.mo

Now set up grub and create a root password
# pacman -S os-prober
# grub-mkconfig -o /boot/grub/grub.cfg
# passwd

Nearly done with the base install, finally lets reboot

# exit
# umount /mnt/{boot,home,}
# reboot

Once we are back refresh pacman and lets add you as a user and install sudo

# pacman -Syy
# pacman -Syu
# adduser

Add the following, blank is just hit return to accept defaults, add contact details when asked if you like. If you make a mistake just delete and try again (# userdel -r [username])

Login name for new user []: dave
User ID ('UID') [ defaults to next available ]:
Initial group [ users ]:
Additional groups (comma separated) []: audio,games,lp,optical,power,scanner,storage,video
Home directory [ /home/dave ]:
Shell [ /bin/bash ]:
Expiry date (YYYY-MM-DD) []:

Now install and setup sudo, we need to edit the sudoer file with a special editor that has vi underneath, only use this editor as it has failsafes.

#pacman -S sudo
#visudo

Scroll through the file, find the “root  ALL=(ALL) ALL” and on the next line add yourself (without quotes) “dave   ALL=(ALL) ALL”
Get tab complete working

#pacman -S bash-completion

Now install the X-server for graphics

#pacman -S xorg-server xorg-xinit xorg-utils xorg-server-utils xterm
#pacman -S xf86-video-intel xf86-input-synaptics

This is where I ran into trouble following the online guides, we are going to install the XDM login manager and the LXDE desktop both are light and work well on the eeepc.

Firstly, login as a user then

#sudo pacman -S xorg-xdm
#cp /etc/skel/.xsession /etc/skel/.xinitrc ~

These are the launch scripts, xsession calls xinitrc, change the permissions on them
#chmod 744 ~/.xinitrc ~/.xsession

Now we need to increase the run level, in /ect/inittab edit the top and bottom lines by
removing the quotes:
…..
#id:3:initdefault:
…..
id:5:initdefault
…. Unhash (may already be) the line that refers to xdm which we will be using as our desktom manager
x:5:respans:/usr/sbin/xdm -nodeamon

Now install dbus

#sudo pacman -S dbus
#sudo nano /etc/rc.conf

add dbus to the array
DAEMONS=(syslog-ng dbus network net-auto-wireless netfs crond)

Now install lxde and configure open box

#pacman -S lxde
#mkdir -p ~/.config/openbox
#cp /etc/xdg/openbox/menu.xml ~/.config/openbox
#cp /etc/xdg/openbox/rc.xml /etc/xdg/openbox/autostart ~/.config/openbox
# pacman -S gamin
# pacman -S leafpad obconf epdfview

finally add a exec command to the .xinitrc you created during the XDM, add at the bottom (no quotes) “exec startlxde”

That should be it! Reboot and hopefully you will have a simple graphics login page and a functioning (if not light) desktop,

Sources
http://8thstring.blogspot.com/2011/11/arch-linux-on-eee-pc-900-chronicle.html
https://wiki.archlinux.org/index.php/Beginners%27_Guide
https://wiki.archlinux.org/index.php/XDM
https://wiki.archlinux.org/index.php/LXDE

Pandas, matplotlib and Ipython – all you need for great data anaylsis

I’ve been playing around a little with the stats-centric programming language R, mostly to get a better handle on the Bioconductor differential gene expression analysis packages edgeR and DEseq. The language is designed to allow for easy manipulation of tabular data as well as providing access to a rich library of statistical packages and graphing tools. One thing I learned was that often it was more efficient (at least for me) to spend a little time pre-formatting the data using python/perl before even getting started with R. The charting features of R are pretty cool, but once again I missed my familiar Python environment)-:

But, as always there is a Pythonic way, I came across a great book called Python for Data Analysis by Wes McKinney. The book introduced me to the pandas library, which contains R-like tools for tabular data manipulations and analyses. The book also introduced the Ipython development environment; basically a souped up feature rich but light weight “python IDLE”. The best features of Ipython for me are the logging capabilities, advanced history and de-bugging utilities – very cool! Ipython has been designed to work well with the matplotlib, thus allowing production and manipulation of nice looking graphs and charts within an interactive python environment. I’m still very much learning this (and I’m a hack programmer), but here is some fun data crunching based on a USDA food database wrangled into into Json format by Ashley Williams.

Ok, I think it would look better on one chart. Here is the top 50 protein containing foods with their fat content as part of one graph.

Note that I chopped the y-axis labels off, doh, no problem just modify them in real time using the slider editor!

Not too bad at all for a newbee. Hopefully by next post I will feel a little more comfortable to share some hard learned lessons on data manipulation using pandas.

My Review of Python for Data Analysis

Data processing heaven

By dave_au from Rochester on 7/7/2012

 

4out of 5

Pros: Well-written, Easy to understand, Helpful examples

Best Uses: Intermediate

I’ve only made it through the first few chapters, and the book is not complete, but I have been getting regular updates which is great. So far though it is well written and as far as I can tell there are very few technical errors.

It starts with a “Whetting your appetite” chapter where the author takes you on a journey through the possibilities of data analysis in Python; its a lot of fun.

The chapter on getting the most our of IPython has made this book worth the purchase price alone, its my new editor and I love it.

Highly recommended for Bioinformaticians or biologists who have to use R, but would prefer to stick with their beloved Python as much as possible.

(legalese)

Why perl for Science should die!

I think its fair to say the most popular programming/scripting language for Bioinformatics is Perl. My language of choice Python would certainly be coming in second. I reallly really don’t like perl, and I think it is a poor language for Science in general, here’s why!

1. Its hard to read

This is the biggest issue for me. Well commented Perl is actually pretty fun to read, its a punchy language so you don’t have to do too much scrolling to work out what is going on. However, the key point of that last comment is the word “comment”, poorly commented Perl scripts are a nightmare to read. And frankly, most of the scripts that I have read fall into this latter category. Turnover of people in Science is pretty high, so the idea using a language that makes it really hard for other people to just pick up where the other person left of is such a silly idea. The other key thing is that Science should be about transparency, and Perls silly syntax just adds another layer dust that hides mistakes and errors in logic. This is latter point is especially important when bioinformaticians are working with biologists who don’t know a computer language, essentially no oversight or review because the language is so god dam hard to understand unless you know the syntax.