Association Mapping

About the project

An important challenge in medicine and human genetics is to locate disease affecting genes and gene variants, to be able to study a disease, screen for high-risk individuals, and ultimately to help prevent or cure the disease. In spite of intense research there are still many unsolved problems, both in understanding the underlying biology of common diseases, in the statistical modelling of diseases and in efficient computation methods for locating causative variants.

This project is concerned with development of computation methods and computer tools for locating disease genes. The amount of data and the complexity of the problems make computer tools essential for successful studies. With the recent improvements in genotyping technology that now allow simultaneous genotyping of hundreds of thousands of polymorphisms, the analysis of data is becoming the bottleneck of studies, and hence it is increasingly important to develop better and faster analysis methods.

The project is running February 2008 – January 2011 at Bioinformatics Research Center, University of Aarhus. It is a continuation of a project in the period February 2006 – January 2007 at the Department of Statistics, University of Oxford, (funded by FNU, grant 272-05-0283) and the period February 2007 – January 2008 at Bioinformatics Research Center, University of Aarhus (funded by FTP, grant 274-05-0365) and in turn of a continuation of an ISIS Katrinebjerg collaboration between Bioinformatics ApS and Bioinformatics Research Center (BiRC), University of Aarhus (March 2004 – February 2006).


The following publications have resulted from this project:

Using biological networks to search for interacting loci in genomewide association studies
M. Emily, T. Mailund, J. Hein, L. Schauser and M.H. Schierup
European Journal of Human Genetics; doi: 10.1038/ejhg.2009.15
Haplotype frequencies in a sub-region of chromosome 19q13.3, related to risk and prognosis of cancer, differ dramatically between ethnic groups
M.H. Schierup, T. Mailund, H. Li, J. Wang, A. A. Tjønneland, U. Vogel, L. Bolund and B. Nexø.
BMC Evolutionary Biology, 10(20); doi:10.1186/1471-2350-10-20
A fast algorithm for genome wide haplotype pattern mining
S. Besenbacher, C.N.S. Pedersen and T. Mailund
BMC Bioinformatics, 10(Suppl 1): S74 doi:10.1186/1471-2105-10-S1-S74
Local phylogeny mapping of quantitative traits: Higher accuracy and better ranking than single marker association in genomewide scans
S. Besenbacher, T. Mailund and M.H. Schierup
Genetics 181 747-753 2009; doi:10.1534/genetics.108.092643
SNPFile - A software library and file format for large scale association mapping and population genetics studies
J. Nielsen and T. Mailund
BMC Bioinformatics 9(526). doi:10.1186/1471-2105-9-526
Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genome Data
Z. Ding, T. Mailund and Y.S. Song
Bioinformatics 24(19): 2215–2221. doi:10.1093/bioinformatics/btn406
On Recombination Induced Multiple and Simultaneous Coalescent Events
J. Davies, F. Simancik, R. Lyngsø, T. Mailund, and J. Hein
Genetics 177: 2151–2160. doi:10.1534/genetics.107.071126
Experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
In Future Generation Computer Systems 2007 23 580–586; doi:10.1016/j.future.2006.09.003.
Whole genome association mapping by incompatibilities and local perfect phylogenies
T. Mailund, S. Besenbacher and M.H. Schierup
BMC Bioinformatics 2006 7(454). doi:10.1186/1471-2105-7-454.
The effective size of the Icelandic population and the prospects for LD mapping: inference from unphased microsatellite markers
T. Bataillon, T. Mailund, S. Thorlacius, E. Steingrimsson, T. Rafnar, M.M. Halldorsson, V. Calian, and M.H. Schierup
European Journal of Human Genetics 2006, 14, 1044–1053. doi:10.1038/sj.ejhg.5201669.
GeneRecon—A coalescent based tool for fine-scale association mapping
T. Mailund, M.H. Schierup, C.N.S. Pedersen, J.N. Madsen, J. Hein, and L. Schauser
Bioinformatics 2006 22 (18): 2317–2318; doi:10.1093/bioinformatics/btl153.
CoaSim: A Flexible Environment for Simulating Genetic Data under Coalescent Models
T. Mailund, M.H. Schierup, C.N.S. Pedersen, P.J.M. Mechlenborg, J.N. Madsen, and L. Schauser
BMC Bioinformatics 2005, 6:252. doi:10.1186/1471-2105-6-252.
Initial experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
Proceedings of The 2005 International Conference on Grid Computing and Applications (GCA'05)

As part of the project, we have held the following tutorials:

Association Mapping: Fundamental Principles and Applications
T. Mailund, M.H. Schierup, J. Hein, L. Schauser, and J.N. Madsen
At Pacific Symposium on Biocomputing (PSB) 2006.
Association Mapping: Design Issues and Data Analysis Approaches
L. Schauser, T. Mailund, J.N. Madsen, J. Hein and M.H. Schierup
At Pacific Symposium on Biocomputing (PSB) 2005.

In addition, results from the project have been presented at various seminars and workshops.

Software developed for the project

During this project, and the preceding ISIS Katrinebjerg project, we have developed several new software tools. These have all been released under GNU General Public License and are listed below. A roadmap of future development can be found here.

Linux users, using yum to manage binary packages, can subscribe to software developed for the project by clicking here: Add Yum Repository.

Linux users, using apt-get or derivatives (aptitude, synaptic, ...) can subscribe to software developed for the project by adding the line:

deb mailund main

to /etc/apt/sources.list. You can either do this by editing the file directly, or using e.g. Synaptic to add a new repository (Settings > Repositories > Third-party software > Add).

SNPfile: File format for SNP marker data

SNPfile is a library and API for manipulating large SNP datasets with associated meta-data, such as marker names, marker locations, individuals' phenotypes, etc. in an I/O efficient binary file format.

SMA: Single Marker Association Tests

SMA consists of a small collection of programs that performs different tests for association between genotypes at a single marker and a binary disease status.

Blossoc: Block-Association Mapping

Blossoc is a linkage disequilibrium association mapping tool that attempts to build (perfect) genealogies for each site in the input and score these according to non-random clustering of affected individuals, and judge high-scoring areas as likely candidates for containing disease affecting variation. Building the local genealogy trees is based on a number of heuristics that are not guaranteed to build true trees, but have the advantage of more sophisticated methods of being extremely fast. Blossoc can therefore handle much larger datasets than more sophisticated tools, but at the cost of sacrificing some accuracy.

HapCluster++: A C++ implementation of the HapCluster method

HapCluster++ is a C++ implementation of the HapCluster MCMC association mapping method. Based on a simple model of relatedness it searches the state space of haplotype clusters and scores for significant clustering of cases rather than controls in such clusters.

GeneRecon: MCMC Based Fine-Scale Association Mapping

GeneRecon is a software package for linkage disequilibrium mapping using coalescent theory. It is based on a Bayesian Markov-chain Monte Carlo (MCMC) method for fine-scale linkage-disequilibrium gene mapping using high-density marker maps. GeneRecon explicitly models the genealogy of a sample of the case chromosomes in the vicinity of a disease locus. Given case and control data in the form of genotype or haplotype information, it estimates a number of parameters, most importantly, the disease position.

CoaSim: A Flexible Environment for Simulating Genetic Data

CoaSim is a tool for simulating the coalescent process with recombination and gene conversion under various demographic models. It effectively constructs the ancestral recombination graph for a given number of individuals and uses this to simulate samples of SNP, micro-satellite, and other haplotypes/genotypes. The generated sample can afterwards be separated in cases and controls, depending on states of selected individual markers. The tool can accordingly also be used to construct cases and control data sets for association studies.

CoAnnealing: Simulated Annealing Search for Coalescent Trees

CoAnnealing is a tool for inferring the local phylogeny (coalescent tree) at a point in a genomic region, based on haplotype sequence data. It deals with recombination events by factoring out all possible positions of recombination and search for the most likely using a simulated annealing algorithm.


For comments or questions, please contact Thomas Mailund <>, Bioinformatics Research Center (BiRC), University of Aarhus, Høegh-Guldbergsgade 10, DK-8000 Århus C.