This page contains both some explanation of the topics we cover and some exercises for you to do. Read through the page and complete the exercises at the end. Spend any remaining time playing around on the code examples in the explanatory text. Remember that the key here is learning by doing. Type some of the code examples into your IDLE editor and try them out yourself — and be curious. Try and see what happens if you change things a bit.

Creating dictionaries

Dictionaries are data structures that works as tables, mapping from "keys" to "values". Values can be anything e.g. a number, a string, a dictionary or a list. The key is the name you assign to that value so you can retrieve when you need to use it. You can use anything for keys, the way you can for values, but usually strings or names are used.

You create an empty dictionary like this:

D = {}

and you can then assign values to a key like this:

D["firstname"] = "Doris"

D["lastname"] = "Patterson"

Here "firstname" and "lastname" are the keys and "Doris" and "Patterson" are the associated values.

You get the value associated with a key like this

print D["firstname"]
Patterson

if there is a value associated with this key. Otherwise you will get an error.

In a programming context you will, most often, use variables for keys and values:

key = 34
value = "string associated with the key 34"
D[key] = value

Dictionary methods

To test if a given key is in the dictionary you can use a test like this:

if key in D:
    print key, "is in the dictionary."

Alternatively there is a method called has_key(key) that returns True if the dictionary has key key and false otherwise:

print D.has_key("firstname")
True

print D.has_key("middlename")
False

You can get a list of all keys in the dictionary using the method keys()

for key in D.keys():
    print key

and a list of all the values in the dictionary using the values() method

for value in D.values():
   print value

The items() method gives you both, by giving you a list of key-value pairs:

for key, value in D.items():
    print key, "points to", value

There are plenty more methods, though go see for yourself.

Exercises

In this exercise you will work on a HIV-1 sequence. First download the module exerciseWeek3.py, if you did not do that already, and put in the folder where you keep your other python code. Importing it like this:

from exerciseWeek3 import *

will import the HIV sequence as a string hivSeq.

Write a function, countBases(seq), that, given a DNA string seq returns a dictionary that maps each base to the number of occurrences of that base in seq.

Example usage:

print countBases("ACTGGCCCT")
{'A': 1, 'C': 4, 'T': 2, 'G': 2}

Then try it out on your HIV sequence and see what you get.

Write a function, printBaseComposition(seq), that given a DNA string seq prints a nice table with the proportion of each base. Call countBases(seq) from within printBaseComposition(seq) to count the bases.

Example usage:

printBaseComposition("ACTG")
A: 0.25
C: 0.25
T: 0.25
G: 0.25

Now download this file:

hivsequences.txt

Once you have downloaded the file you can open it in Notepad (on windows) or TextEdit (on a mac) and see what it looks like. Each line in the sequence file has a sequence name followed by a space followed by a sequence like this. Like this but with longer sequences ofcause.

HIV1.A.1 ATGGGTGCGAGAGCGTCAATATTAAGCGGGGGAAGATTAG...
HIV1.A.2 TGGAAGGGCTAATTTACTCCAAGAAAAGACAAGACATCCT...
HIV1.A.3 TGGATGGGTTAATTTACTCCAAGAAAAGGCAAGAAATCCT...
HIV1.A.4 TTGAAAAGCGAAAGTAACAGGGACTCGAAAGCGAAAGTTC...
HIV1.B.1 TGGAAGGGCTAATTCACTCCCAACGAAGACAAGATATCCT...
HIV1.B.2 GAGCCTGGGAGCTCTCTGGCTAGCTGGGGAACCCACTGCT...
HIV1.B.3 GGACCTGAAAGCGAAAGAGAAACCAGAGGAGCTCTCTCGA...
HIV1.B.4 GCGTCAGTATTAAGCGGGGGAAAATTAGATACATGGGAGA...
HIV1.C.1 GACTTGAAAGCGAAAGTAAGACCAGAGGAGATCTCTCGAC...
HIV1.C.2 AAATCTCTAGCAGTGGCGCCCGAACAGGGGACCTGAAAGC...
HIV1.C.3 AAATCTCTAGCAGTGGCGCCCGAACAGGGACCTGAAAGCG...
HIV1.C.4 TCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCT...
HIV1.D.1 GGTCTCTCTGGTTAGACCAGATTTGAGCCTGGGAGCTCTC...
HIV1.D.2 GCGAGAGCGTCAATATTAAGCGGGGGAAAATTGGATGCAT...
HIV1.D.3 GCGAGAGCGTCAGTATTAAGCGGGGGACAATTAGATGCAT...
HIV1.D.4 CTGAAAGCGAAAGTAGAACCAGAGGAGATCTCTCGACGCA...

Understand and explain in detail what the code below does and how. Hint: copy it into your editor so you can add print the values of different variables.

hivFile = open("hivsequences.txt", 'r')
statistics = {}
for l in hivFile:
    name, seq = l.split()
    if name not in statistics:
        statistics[name] = {}
    for b in seq:
        if b not in statistics[name]:
            statistics[name][b] = 0
        statistics[name][b] += 1
hivFile.close()

for name in statistics:
    print name
    total = sum(statistics[name].values())
    for b in statistics[name]:
        print "\t", b, statistics[name][b] / float(total)

Solutions to execise

Download linked file

Index

Contact info

Office address:
Bioinformatics Research Centre (BiRC)
Aarhus University
C.F. Møllers Allé 8
DK-8000 Aarhus C
Denmark
Office phone:
+45 871 55558
Mobile phone:
3013 8342
Email:
kaspermunch@birc.au.dk