Finding candidate proteins in Streptococcus

Introduction

In this exercise we will be looking at DNA from Streptococcus bacteria. We will work with the detection of open reading frames in this DNA and with the translation from DNA to amino acid sequences. The purpose of these exercises is to get familiar with strings, list and dictionaries and to learn and use some common methods on these data types. Another pupose is to give you an example of breaking down a larger problem into smaller more manageable sub-problems.

Read Wikipedia: Genetic Code and Wikipedia: Open Reading Frame if you do not already know of DNA to amino acid translation and open reading frames.

Start by reading through the exercises before you start coding, so you have a good overview of the tasks a head. As you will see there is quite a bit of hits to help you on you way.

Useful constants and data that we will use in this learning path are defined in the module "exerciseWeek4.py". Start by downloading and skimming it.

Exercise: Translating Open Reading Frames

In this exercise we write code to translate an open reading frame on a DNA sequence into single letters representing the corresponding sequence of amino acids.

Translating a single codon

Write a function, translateCodon(codon) that takes a string codon, and if codon is a valid codon it returns codon translated into the corresponding amino acid. If codon is not a valid codon, the function should return '?'.

Example usage:

print translateCodon('ACG')
'T'

For mapping codons to amino acids you can use the dictionary exerciseWeek4.aminoAcidMap. Notice here, though, that it only maps strings in upper case, so you must make sure that codon is in upper case for the look up. You can translate codon into an upper case version of itself using the upper() method:

codon = codon.upper()

Before you start coding you should outline for yourself intuitively what you need to do to implement the desired functionality. In this case want to translate, or map, between a three letter sequence, codon, and the corresponding one letter symbol for the amino acid that the codon corresponds to. You already have a dictionary, exerciseWeek4.aminoAcidMap that does just that. Your key is the string, codon, with the three bases and the value you retrieve using that key is the letter symbol for the amino acid. Try that out first. E.g.:

import exercisesWeek4

codon = "TTG"
aminoAcid =  exerciseWeek4.aminoAcidMap[codon]
print aminoAcid

What you then need to do is to create a function that takes a string as an argument and returns the corresponding amino acid symbol after having looked it up in the dictionary. Make a function that does just that before you go on.

Before you are completely done you need to make your function handle the situation when the argument to the function is not a key in the exerciseWeek4.aminoAcidMap dictionary. Use an if-else construct to handle the two cases. The boolean expression needs to test if codon is a key in exerciseWeek4.aminoAcidMap. Remember the has_key() method? - or the even cooler in operator on dictionaries?

Splitting an ORF into codons

To translate an entire open reading frame into the corresponding amino acid sequence, we need to split the DNA sequence into codons. Then we can translate each codon using the function translateCodon(codon) we just wrote.

Write a function, splitCodons(orf), that takes a string orf and splits it into triplets returning a list of these triplets.

Example usage:

print splitCodons('atcgagcttadcgta')
['atc', 'gag', 'ctt', 'adc', 'gta']

Divide the exercise into simpler tasks like above. You need to loop over the sequence to perform operations on it. Start by writing a function that prints each character:

def splitCodons(orf):
    for i in range(len(orf)):
        print orf[i]

Try now to figure out how you can modify the function to make it move over the sequence in jumps of three. Look at the documentation for the range function by typing help(range) in the interactive Python shell to see how you can make range return a list of numbers with increments of three like this: 0, 3, 6, 9, 12, ... . Modify your function so that it now prints every third character.

What you want is obviously not every third character but the three characters that follow every third character. You can use index i in the for loop to get the corresponding codon using slicing like this:

codon = orf[i:i+3]

Modify your function so that it prints each codon.

Now all that remains is to put each codon on a list that we can return from the function. You can define a list codonList before your for loop so you have a list to add codons to:

codonList = []

Each codon you slice out of the orf you add to the end of codonList using the append method:

codonList.append(codon)

Modify your function so that it creates and returns a list of all the codons and you are done.

Translating an ORF

Now write a function, translateSequence(orf), that takes a DNA sequence as input and translates it into the corresponding amino acid sequence.

Example usage:

print translateSequence('atcgagcttadcgta')
'IEL?V'

The function should first split the ORF into codons. Use your splitCodons(orf) function to take care of that. So start by writing a function that just calls and prints the result of splitCodons(orf):

def translateSequence(orf):
    codonList = splitCodons(orf)
    print codonList

Make that work first.

Now modify the function by using the translateCodon(codon) function to translate each codon into the corresponding amino acid independently. Instead of printing the codonList create a for loop that iterates over the codons in codonList so you can translate each codon using translateCodon(codon) and print the amino acid letter.

Now modify the function so each amino acid letter is appended to a list letterList in the same way you appended codons to codonList in the previous exercise. Make the function print the list so you can see it looks okay.

Now all you need to do is to turn the list of letters, letterList, into a string that you can return from the function. Use the join() method on an empty string to create a string from the list of amino acids.

aminoAcids = ''.join(letterList)

Exercise: Finding Open Reading Frames

In this exercise we write code that finds and translates open reading frames in a DNA sequence.

Finding the start positions of ORFs in a DNA sequence

Write a function, findStartPositions(seq), that takes a DNA sequence seq as input and returns a list of the positions where the start codons occur. You can get the start codon from exerciseWeek4.startCodon. exerciseWeek4.approximateGenes is a list of DNA sequences that each contain a gene (and thus at least one open reading frame). You can use one of these sequences as input to your function.

Example usage:

print findStartPositions(exerciseWeek4.approximateGenes[0])
[21 39, 47, 94, 197, 395, 450, 458, 539, 568]

Your function should contain a for loop that iterates over all possible positions in the seq where a codon can begin. Not surprisingly these are all the positions except for the last two. So start out with this:

def findStartPositions(seq):
    for i in range(len(seq) - 2):
        print i

Now, instead of just printing i try and make it print the three bases following i using the slicing technique you also used in the splitCodons(orf) function. E.g. if you sequence is 'TGACTG' it should first print 'TGA' then 'GAC' then 'ACT' and so on.

When you have this working you need to replace the printing of codons with an if construct that tests if each triplet is equal to exerciseWeek4.startCodon. Try and make your function only print i when i is the start of an ORF.

Finally, modify the function so all the i values are collected in a list using the same technique as in the splitCodons(orf) function and then return this list from the function.

Finding the next specific codon in an ORF

Now that we can find where the ORFs begin in our sequence we must also be able to identify where each of these end. As you know an ORF ends at any of three different stop codons in the same reading frame as the start codon. So we need to be able to find the next occurence of some specific codon starting from the beginning of the reading frame.

Write a function, findNextCodon(seq, start, codon), that returns the index for beginning of the next occurrence of the codon, codon, in seq when starting at index start and staying in the reading frame of start, i.e. you should look at codons in jumps of three. If the function does not find codon in the string it should return None.

Example usage:

print findNextCodon('AAAAATTTAATTTAA', 1, 'TTT')
10

Your function should contain a for loop that iterates over all the relevant starts of codons. Remember that no valid codon can start at the last two positions in the sequence. E.g. if start is 7 and len(seq) is 20 then the relevant indexes are 7, 10, 13, 16. Figure out how to make the range function return the appropriate list of numbers.

Start by writing a function just with this for loop that prints the numbers you get from the correct use of range.

def findNextCodon(seq, start, codon):
    for i in range(  ??  ):
         print i

When you have that working use the slicing technique to print the codons starting at each position instead.

Finally, add an if statement that tests if each codon is equal to codon and if so makes the function return i.

Finding the next stop codon in an ORF

We need to write a function, findNextStopCodon(seq, start) that returns the index for the beginning of the next stop codon in seq starting in position start and staying in start's reading frame. If there is no stop codon in start's reading frame the function should return None. You can find the stop codons in exerciseWeek4.stopCodons (and you probably want to reuse the function you just wrote).

Example usage:

print findNextStopCodon('AAATAGATGAAAA', 1)
7

This is a tricky function so we will walk through it together:

def findNextStopCodon(seq, start):
    seq = seq.upper()
    results = []
    for stopCodon in exerciseWeek4.stopCodons:
        pos = findNextCodon(seq, start, stopCodon)
        if pos != None:
            results.append(pos)
    if len(results) > 0:
        return min(results)
    else:
        return None

First we upper case the seq. Then we define a list to hold the indexes for our candidate stop codons. Then we loop over the possible stop codons to find the next of each in the reading frame. If findNextCodon(seq, start, stopCodon) does not find anything it returns None. We only want to keep results that are not None which is why we test if pos != None. At the end we test if we have any indexes in our results list. If we do find indexes for stop codons we return the smallest of them which is the one closet to the start If we do not find any stop codons the function returns None by default.

Finding ORFs

Write a function, findOpenReadingFrames(seq). It should return a list of all open reading frames in the input sequence seq, where an open reading frame is represented by a tuple (start, stop). start is the position of the start codon and stop is the position of the stop codon. The function should handle both uppercase and lowercase inputs.

Example usage:

print findOpenReadinFrames(exerciseWeek4.approximateGenes[0])
[(21, 48), (39, 48), (47, 89), (94, 154), (197, 503),
 (395, 503), (450, 456), (458, 503), (539, 569), (568, 634)]

First write a function that first uses findStartPositions(seq) to get a list of all the start positions in seq: codons in seq:

def findOpenReadingFrames(seq):
    startPositions = findStartPositions(seq)

When you have that working add a for loop that iterates over the start positions so you can find the position of the next stop codon for each start position. Use findNextStopCodon(seq, start) for that. Try and print the start and end indexes you find.

Finally, you need to add the (start, end) pairs to a list so you can return them. To append a pair to a list, you need to write:

L.append((start,stop))

notice the double parenthesis!

Test your function. Chances are that some of the end positions you get are None. This is because there was no stop codon for some of the start codons found. Add an if statement to your function to control that only pairs with valid stop codons are added to the list of results.

Finding AND translating ORFs

Write a function findAndTranslateORFs(seq). Given a DNA sequence seq it should return a list of the translations of all open reading frames in the sequence. It should handle both uppercase and lowercase inputs.

Example usage:

print findAndTranslateORFs(exerciseWeek4.approximateGenes[0])
['MPKKLAMTK*', 'MTK*', 'MILLLYCGAVKIQL*', 'MIQEVATRSVALPRNRCRFS*',
'MANKKIRIRLKAYEHRTLDTAAEKIVETATRTGATVAGPVPLPTERSLYTIIRATHKYKDSREQFEMRTHKRLVDIINPTQKTVDALMKLDLPSGVNVEIKL*',
'MRTHKRLVDIINPTQKTVDALMKLDLPSGVNVEIKL*', 'ML*', 'MKLDLPSGVNVEIKL*',
'MELEHELNST*', 'MKKINLPRNRSFCVRFSIFILS*']

The function should call findOpenReadingFrames(seq) to get the list of (start, end) pairs and then use a for loop to call translateSequence(seq[start:stop+3] on each pair:

def findAndTranslateOpenReadingFrames(seq):
    for start, stop in findOpenReadingFrames(seq):
        print translateSequence(seq[start:stop+3])

All that remains then is to add each amino acid sequence to a list that the function can return.

Hint: To check your result note that all returned sequences should start with a start codon 'M', end with a stop codon '*' and contain no stop codons in the middle.