Codon bias in Streptococcus

Introduction

This exercise is a continuation of the exercise from last week. As for last week the constants and data that we will use in this exercise path are defined in the module "exerciseWeek4.py". Start by downloading and skimming it you don't have it handy from last week.

Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA. A codon is a series of three nucleotides (triplets) that encodes a specific amino acid residue in a polypeptide chain or for the termination of translation (stop codons). There are 64 different codons (61 codons encoding for amino acids plus 3 stop codons) but only 20 different translated amino acids. The overabundance in the number of codons allows many amino acids to be encoded by more than one codon. Because of such redundancy it is said that the genetic code is degenerate. Different organisms often show particular preferences for one of the several codons that encode the same amino acid, that is, a greater frequency of one will be found than expected by chance. How such preferences arise is a much debated area of molecular evolution.

Exercise

Write a function, findCodonBias(orf). Given an open reading frame like the ones in the list exerciseWeek4.exactGenes it should return a dictionary that maps each amino acid to another dictionary mapping each codon, that codes for that amino acid, to its relative usage. Your returned dictionary should not contain entries for unused amino acids. You should handle both uppercase and lowercase input.

Example usage:

print findCodonBias(exerciseWeek4.exactGenes[0])
{'A': {'GCA': 0.0, 'GCC': 0.0, 'GCT': 1.0, 'GCG': 0.0}, 'C': {'TGC': 0.0, 'TGT': 1.0},
'E': {'GAG': 0.33333333333333331, 'GAA': 0.66666666666666663}, 'D': {'GAT': 1.0, 'GAC':
 0.0}, 'G': {'GGT': 0.33333333333333331, 'GGG': 0.0, 'GGA': 0.66666666666666663, 'GGC':
0.0}, 'F': {'TTC': 0.0, 'TTT': 1.0}, 'I': {'ATT': 1.0, 'ATC': 0.0, 'ATA': 0.0}, 'H': {'
CAC': 0.0, 'CAT': 1.0}, 'K': {'AAG': 0.20000000000000001, 'AAA': 0.80000000000000004},
'*': {'TAG': 0.0, 'TGA': 1.0, 'TAA': 0.0}, 'M': {'ATG': 1.0}, 'L': {'CTT': 0.0, 'CTG': 0
.66666666666666663, 'CTA': 0.0, 'CTC': 0.0, 'TTA': 0.33333333333333331, 'TTG': 0.0}, 'N'
: {'AAT': 0.5, 'AAC': 0.5}, 'Q': {'CAA': 0.59999999999999998, 'CAG': 0.40000000000000002}
, 'P': {'CCT': 0.5, 'CCG': 0.0, 'CCA': 0.5, 'CCC': 0.0}, 'S': {'TCT': 0.0, 'AGC': 0.0, 'T
CG': 0.0, 'AGT': 0.5, 'TCC': 0.0, 'TCA': 0.5}, 'R': {'CGA': 0.33333333333333331, 'CGC': 0
.0, 'AGA': 0.33333333333333331, 'AGG': 0.0, 'CGG': 0.0, 'CGT': 0.33333333333333331}, 'T':
{'ACC': 0.0, 'ACA': 0.0, 'ACG': 0.0, 'ACT': 1.0}, 'W': {'TGG': 1.0}, 'V': {'GTA': 0.666666
66666666663, 'GTC': 0.0, 'GTT': 0.16666666666666666, 'GTG': 0.16666666666666666}, 'Y': {'
TAT': 1.0, 'TAC': 0.0}}

Think about what is is you need to do. From the dictionaries exercise you know how to count letters in a string using a dictionary. What you need to do here is similar, but here you have an extra layer of dictionaries.

After "uppercasing" orf you can call splitCodons(orf) to get a list of the codons in the ORF. You then loop over the codons, translate each codon into an amino acid aa using translateCodon(codon). Then if your top dictionary is D you can count your codons this way:

D[aa][codon] += 1

but before doing so you need to check each time if the top dictionary has the aa key and if the nested one have the codon key. If not you need to make them and set the value of the latter to 0.

Your dictionary should not include amino acids that are not in the ORF, but you want all the possible codons for each amino acid represented in your nested dictionaries. Your result would not correctly represent codon bias if the codons that where not used was not included. We include the codons that we did not see in the ORF by looping over the exerciseWeek4.aminoAcidMap like this:

for codon, aa in exerciseWeek4.aminoAcidMap.items():

and then if aa is in your top dictionary and if codon is not in corresponding nested dictionary you set D[aa][codon] = 0.0.

Finally you need to normalize the counts so they become frequencies (i.e. count/total). To do that you need to loop over the keys in the top dictionary, and use each key to retrieve all the counts for that amino acid, and sum them: nrCodons = sum(D[aa].values()). Then use a nested for loop (a for loop within the for loop) to loop over the keys (the codons) of corresponding nested dictionary. In that nested for loop you can then divide each count by the total number of codons for that amino acid:

D[aa][codon] /= float(nrCodons)

All that remains then is to return the finished dictionary.

Write a function printCodonBias(orf), that given a DNA sequence it should print a pretty table showing, for each amino acid, the relative usage of each codon coding for that amino acid.

Example usage:

printCodonBias(exerciseWeek4.exactGenes[0])
A:
   GCA:   0%
   GCC:   0%
   GCT: 100%
   GCG:   0%
C:
   TGC:   0%
   TGT: 100%
E:
   GAG:  33%
...