sequences | Dave the Data Dero

What is bash

Bash programming is a form of scripting that originates from the early days when a computer was a keyboard with a flashing cursor on a monochrome screen!

Credit: Wtshymanski @ wikipedia

Bash acts as a command line interpreter, taking input from a user (via the keyboard) and using it to make the computer to act in a certain way. Many of these commands are small scripts that do a multitude of tasks, from text processing to network communications. Although most of us now use a graphic user interface (GUI) to communicate with our computers, a basic understanding of bash programing can dramatically improve our productivity in certain situations. This is especially with dealing with the large files generated by most bioinformatic applications.

Where do I get this bash program

Well technically speaking Bash is not really a program, its more a command line interpreter and a scripting language. That said it is available by default on any NIX (Unix-like system ie ubuntu or fedora) Macintosh (OSX) and windows using a free program called cygwin (there are others as well.

Enough already!

First thing, don’t copy and paste! Why? Typing these commands is part of the hard wiring required to make them stick! Also, “with great power comes great responsibility”, and bash has the power to wipe your computer – no trash recovery – completely gone forever (except for that backup – right?)! By typing commands it gives you another opportunity to think about the command before you execute it and typing is more likely to trigger a big red flag in dangerous situations! Think twice, type once.

Ok, Ok, in the following tutorial I have attempted to provide command line solutions to some common problems faced by life scientists who have to deal with files generated by next generation sequencing technologies. I’ll probably do a few of these as with the explanations they get quite long. Of course, this is not an attempt to teach you to be a terminal master, that can only come from practice, but hopefully it will give you a taste of the power hiding behind that drab terminal exterior. It is also not designed to teach you how to program a computer, but I’m hoping this might be the gateway drug you need to get started.

The solutions in this tutorial were tested on a linux machine (Ubuntu), but should mostly work on Mac and windows through cygwin.

Problem 1. I have this huge file of DNA sequences in FASTA format (>seq1\nATGTAA) and I want to know how many sequences are in there?

Solution. This is a common problem, if your text editor manages to open the file without freezing you are still left with the problem of counting sequences. To our rescue is grep, grep is a little application that searches for strings. In our case as this is a fasta file we can search and count instances of the “>” to let us know how many sequences are in the file. And yes, this is a small file and you probably could have opened it on a 10 year old computer, but use your imagination on this one. Very important put the > in single quotes(‘>’)!!

grep ‘>’ fasta_example.fas

—————-output—————-

>fid|47273465|locus|VBIADEClo150191_2639| >fid|47286894|locus|VBIADEClo157809_4509| >fid|47397394|locus|VBIAERSal180824_4606| >fid|20665099|locus|VBIAciCap40988_0842| >fid|20666571|locus|VBIAciCap40988_1572| >fid|20668069|locus|VBIAciCap40988_2310| >fid|46816701|locus|VBIAciSp155132_0285| >fid|46817921|locus|VBIAciSp155132_0889| >fid|38082798|locus|VBIAciSp155501_0265| >fid|38086805|locus|VBIAciSp155501_2252| >fid|38088330|locus|VBIAciSp155501_3002| >fid|38088628|locus|VBIAciSp155501_3149| >fid|38090581|locus|VBIAciSp155501_4107| >fid|85865228|locus|VBIActMas228228_1098| >fid|101184055|locus|VBIActMas238937_2668|

So grep found all the lines with the ‘>’, but thats not exactly what we wanted. How do we count the lines. Now is a good time to introduce the manual entry (using the man command) available for all bash commands. This manual entry for grep is rather long, however it contains the information we need (highlighted in red below).

man grep

—————-output—————-

GREP(1) GREP(1)NAME

grep, egrep, fgrep, rgrep – print lines matching a pattern

SYNOPSIS

grep [OPTIONS] PATTERN [FILE…]

grep [OPTIONS] [-e PATTERN | -f FILE] [FILE…]

…………….shortened for clarity…………………

General Output Control

-c, –count

Suppress normal output; instead print a count of matching lines

for each input file. With the -v, –invert-match option (see

below), count non-matching lines. (-c is specified by POSIX.)

—————-end of output—————-

It looks like ‘-c’ flag (or switch) will cause grep to count the lines that match the search string rather than just print them out – perfect! Before we try it out its worth taking a second to look at another important part of the manual that specifies how to use grep. BTW you can use either -c or –count, the double minus sign just specifies the longer format of the flag that is often a little more human readable.

grep [OPTIONS] PATTERN [FILE...]

The square brackets are not used as part of the command, they merely specify that whatever is in the brackets is OPTIONAL. So the minimum command is “grep pattern”, with the possibility of adding a OPTION flag/switch and a FILE to search.This time let’s add our optional flag and repeat the search, note that the command order fits the structure described in the manual (grep ->option-> pattern ->file) .

grep -c ‘>’ fasta_example.fasta

—————-output—————-

Thats more like it. But reading through the manual was a little painful, we know we want to count something, so why don’t we use grep to search through the manual entry for grep, to find a line that uses the string ‘count’. Circular enough for you?To do that we need to introduce what are called pipes (“|”), they are the bar symbol on the keyboard (SHIFT backslash on my computer). First the command, then the explanation.

man grep ‘count’ | grep ‘count’

—————-output—————-

-c, –count

Suppress normal output; instead print a count of matching lines

below), count non-matching lines. (-c is specified by POSIX.)

-m NUM, –max-count=NUM

–count option is also used, grep does not output a count

Large repetition counts in the {n,m} construct may cause grep to use

Sweet as Bro! Pipes are a redirection, that is rather than send the output of a program to the screen (called “stdout”) or a file etc, send it to another program. In this case we redirect the output from the man program to the grep program (that the manual entry is about grep is irrelevant to your stupid computer) so that grep can look for the string “count”.

A quick aside: To ‘single quote’, “double quote”, or Not to quote, that is a good question! This will be covered a little later, but for now it is important to use single quotes whenever you want to specify a string of letter(s). We have already used one situation where this can cause problems, for example: grep > filename acts very differently to grep ‘>’ filename. As you will learn now.

Since we are talking about pipes now is a good time to talk about another type of redirection. The > symbol means redirect the output from a command to a file, which it will overwrite if it already exists (>> will append the output onto the end of the file). So

grep -c ‘>’ fasta_example1.fasta > tmp.txt

Will output nothing to the screen, rather it will put the number 15 in a file called tmp.txt. You’ll see in a moment how useful this can be when dealing with sequence files. So what do you think the grep > fasta_example1.fasta (without the single quotes around the > symbol) might do to your wonderful sequence file (Hint, it won’t end well, so if you want to try it use some throw away file)? Well time to get back to some biology!

Problem 2. I have this huge file of DNA sequences in FASTA format (>seq1\nATGTAA) and I want to split it in half (as best I can).

Solution 2. Lets combine what we have learnt with a couple of new commands to solve this problem. Firstly, lets just count the sequences to find out how many each file will have to include for the files to be split in two.

grep -c ‘>’ fasta_example1.fasta

———-output—————-

Well thats not an even number, so lets just put 7 in the first file and 8 in the second file. But first, two new commands, head and tail.

head fasta_example.fasta

—————-shortened output—————-

>fid|47273465|locus|VBIADEClo150191_2639| MIKISEDYIEKRFDNELLRIEAWGKNSLRIRSFVDQNFVDENYALNEKPKLNKDDITINKNEDGSAIIKNGKIKAVLDHRDRITFYNDKNEILLKEYIRLRAVKHDDGGEDVGTIEITKDFNSTLKLKSREY

tail fasta_example.fasta

—————-shortened output—————-

RSGLYEIHQRFRAYPGERIYGLGQHTHGRLDHKGLVLDLVQRNAEVSIPFYLSSRGYGFLWNNPAVGRVEFAENGTRWVADAAPGLDYWVTAGKTPAEILEHYADAVGHAPVLPEWATGFWQCKLRYLGQEELLDVAREYRRRELPLSVIVTDFFHWTAMGDYRFDPEEYPDPEAMMRELDELGVKLMVSIWPTISPLSENYDAMAEQGLLVGADQGVEFHQDIHDKKMPRSLPVAFYDP

Head by default outputs the first 10 lines of a file, whilst tail outputs the last 10 lines. You can change this default behaviour using the -n flag. Lets start by finding out what the 7th sequence is in the file.

grep ‘>’ fasta_example1.fasta | head -n 8

—————-output—————-

>fid|46817921|locus|VBIAciSp155132_0889|

Normally the first part of this command just prints each line that contains the ‘>’ FASTA symbol, but here we piped the output from grep to head, using the -n flag to show only the first 8 lines. So what? Well remember that you file might have 1000s of sequences, finding out which one is the 501’st might be a little more tricky if you lived in a world without grep, head and tail. So now we know that “>fid|46817921|locus|VBIAciSp155132_0889|” is the first sequence that needs to go in our second file. Now once again after looking at the manual page for grep we see that the -n flag reports the line number of the pattern match.

grep -n ‘>fid|46817921|locus|VBIAciSp155132_0889|’ fasta_example1.fasta

—————-output—————-

102:>fid|46817921|locus|VBIAciSp155132_0889|

There are multiple ways of doing this. We could use another command wc -l filename which counts the number of lines in a file, then subtract 102 from this number and feed that to tail, basically to print out the last x lines. However, rather than introducing a new command (too late) lets look at the man page for tail and see if we can do what we want that way.

man tail

—————-output shortened—————-

-n, –lines=K

output the last K lines, instead of the last 10; or use -n +K to

output lines starting with the Kth

Great, so all we need to do is pass the line number we want tail to start from with the switch -n and the “+” symbol. Putting this all together.

head -n 101 fasta_example1.fasta > first_half.fasta
tail -n +102 fasta_example1.fasta > second_hald.fasta
grep '>' first_half.fasta

Lets quickly check to see if it worked.

grep ‘>’ second_half.fasta

>fid|46817921|locus|VBIAciSp155132_0889| >fid|38082798|locus|VBIAciSp155501_0265| >fid|38086805|locus|VBIAciSp155501_2252| >fid|38088330|locus|VBIAciSp155501_3002| >fid|38088628|locus|VBIAciSp155501_3149| >fid|38090581|locus|VBIAciSp155501_4107| >fid|85865228|locus|VBIActMas228228_1098| >fid|101184055|locus|VBIActMas238937_2668|

Perfect! 7 sequences in the first file, 8 in the second. As I said previously this will work just as well with 100000000 sequences as it did in this small example!

Dave the Data Dero

Tag Archives: sequences

Bash for Biologists: A gateway to the terminal!

What is bash