Caveat: Many of the intricacies of Molecular evolution are fuzzy to me!
Intro: Due to the triplet code, changes in third codon position are mostly silent (synonymous), in that they do not result in a change to the encoded amino acid. In contrast, changes that occur in first and second codon positions nearly always result in an amino acid substitution (non-synonymous). Non-synonymous changes are rare as they mostly cause deleterious mutations that are removed from the population through purifying selection. However, in cases where positive selection is acting on a protein sequence, accumulation of changes that lead to the adaptive shift in protein function tend to result in an excess of Non-synonymous mutations. This means the ratio of non-silent to silent changes, as a proportion of each kind of site in the sequence, can be used as a measure of adaptive evolution. More specifically, a dNdS ratio of greater than 1 is used to indicate that a protein is/has been subject to positive selection. Here I’ll describe some of the important parameters when using arguably the most popular method for detecting selection based on dNdS in coding genes, that being the PAML package.
Preliminaries: The predominant program that we will use is called codeml, which (for the most part) takes an alignment of homologous coding regions. The most important aspect of the alignment is that the codon positions are correctly aligned. Codeml make calculations based on the nucleotide positions in the coding triplet, so the the open reading frame in the DNA needs to be maintained if the alignment contains indels or missing bases. The best way to guarantee this outcome is to constrain the DNA alignment based on a protein alignment (i.e. any gaps ‘-’ should be inserted as ‘—’ as not to disrupt reading frame) and in a sense any phylogeny based on coding DNA sequences should always be based such an alignment (evolution acts at the level of the protein after all). A tool for doing this is called transalign, but be warned that 1) your protein and DNA sequences need to be completely complementary (DNA must = protein, no partial or extra triplets etc) (2), they need to share the same names (3), they must be in the same order.
Pairwise dNdS analysis: Codeml reads a control file that contains the run parameters, typically this file is called codeml.ctl, the basic file for doing pairwise dNdS is below.
seqfile = seqfile.txt * sequence data filename
outfile = results.txt * main result file name
treefile = treefile.txt *tree file
noisy = 9 * 0,1,2,3,9: how much rubbish on the screen
verbose = 1 * 1:detailed output
runmode = -2 * -2:pairwise
seqtype = 1 * 1:codons
CodonFreq = 0 * 0:equal, 1:F1X4, 2:F3X4, 3:F61
model = 0 *
NSsites = 0 *
icode = 0 * 0:universal code
fix_kappa = 1 * 1:kappa fixed, 0:kappa to be estimated
kappa = 1 * initial or fixed kappa
fix_omega = 0 * 1:omega fixed, 0:omega to be estimated
omega = 0.5 * initial omega value
I’ve highlighted some important parameters in red, “seqfile” and “outfile” are self explanatory, except to say that the infile needs to be in a “typical” format, I use fasta because it is what I use(-;. The “runmode” is set to -2 to indicate that you are doing pairwise dNdS. “CodonFreq” and “kappa” incorporate parameters into the model that account for biological relevant aspects of protein evolution. CodonFreq controls the codon frequency parameter of the model, setting it to 0 implies that all codons are equally likely, 1 assigns the value based on the average nucleotide frequencies, 2 assigns the values based on the average frequencies of nucleotides in the codon, 3 allow the parameters to be estimated freely. All codons are generally not equally likely due to differing tRNA availability, ignoring these processes can lead to biases in the estimated dNdS. Kappa is the transition to transversion ratio. The chemical properties of DNA mutation create a bias toward transitions, and as transitions at third codon positions are more likely to be silent than transversions, ignoring this parameter nearly always results in an overestimation of dS and thus a correspondingly underestimation of dNdS. If the kappa value is set to 1, the transition bias is ignored, while setting it to 0 allows it to be estimated from the data. With these reference to these parameters, counting models such as NG86 (Nei and Gojobori (1986)) can be mirrored by setting K=1 and codonFreq = 0. Alternative settings for these values comes at a cost in that it increase the number of parameters that need to be estimated, however, their incorporation into the model makes for a more biologically relevant estimation of dNdS.
Final notes: It should be noted that these estimates of dNdS are not very sensitive to detecting positive selection. This is for two main reasons, firstly many amino acids are structurally important for the proteins core function and are thus correspondingly nearly invariably, this results in positive selection acting on a relative small number of codons. Secondly, selection tends to be episodic (occurs in spurts). This means it is difficult to detect positive selection by just taking the average dNdS across the entire protein from a single pairwise comparison. To overcome this methods have been developed that detect changes in dNdS across a phylogeny derived from multiple sequence alignment, and I’ll focus on these methods in the next addition!
I’ve been very slack with references for this post because I wonted to focus on the core considerations of using codeml, however, most of the above is based on reading the excelling coverage of this topic in: Maximum likelihood methods for detecting adaptive protein evolution
Bielawski, JP and Yang, Z (2005) Maximum likelihood methods for detecting adaptive protein evolution. In: Nielsen, R, (ed.) Statistical Methods in Molecular Evolution. (103 – 124). Springer-Verlag: New York.