CoaSim

About CoaSim

CoaSim is a tool for simulating the coalescent process with recombination and geneconversion under various demographic models. It effectively constructs the ancestral recombination graph for a given number of individuals and uses this to simulate samples of SNP, micro-satellite, and other haplotypes/genotypes. The generated sample can afterwards be separated in cases and controls, depending on states of selected individual markers. The tool can accordingly also be used to construct cases and control data sets for association studies.

To cite CoaSim, please use:

CoaSim: A Flexible Environment for Simulating Genetic Data under Coalescent Models
T. Mailund, M.H. Schierup, C.N.S. Pedersen, P.J.M. Mechlenborg, J.N. Madsen, and L. Schauser
In BMC Bioinformatics 2005, 6:252. doi:10.1186/1471-2105-6-252.

CoaSim is developed in collaboration between Bioinformatics ApS and Bioinformatics Research Center (BiRC) and released under the GNU General Public License.

Installation

CoaSim is written in C++, Guile Scheme and Python, and is available as source code (under the GNU General Public License, GPL) and as binary versions as Linux RPM files. The source code has been successfully compiled on various Linux and UNIX systems, under OS X and under Windows with Cygwin. As I have only limited access to architectures other than Linux, it is not possible for me to make binary distributions for other platforms, but if anyone is willing to build the distributions I will be more than happy to put them on this site.

The most recent versions can be downloaded below, older versions are available from here.

Binary Distributions

The rpm-files were built on Linux Fedora Core 3 and 4, but should run on any system having guile-1.6 (for coasim-guile) and qt-3.3 (for coasim-gui).

The binary python distributions were build on Linux Fedore Core 3 or 4 against Python 2.3.

Source Code Distributions

Building the GUI version

To build the source files, first untar the core module and place it in a directory called Core:

    $ tar zxf coasim-core-version.tar.gz
    $ mv coasim-core-version Core
    $ cd Core
    $ ./configure
    $ make

To build the GUI version, untar the source files next to the Core module, install the designer plugins used, then qmake and build:

    $ cd ..
    $ tar zxf coasim-gui-version.tar.gz
    $ cd coasim-gui-version/designer_plugins
    $ qmake
    $ make install
    $ cd ..
    $ qmake
    $ make

If you do not have write access to QTDIR/plugins, you will need to install the plugins locally, e.g. use qtconfig to add $(HOME)/.qt/plugins to the plugin path and do:

    $ cd ..
    $ tar zxf coasim-gui-version.tar.gz
    $ cd coasim-gui-version/designer_plugins
    $ qmake
    $ INSTALL_ROOT=~/.qt make install
    $ cd ..
    $ qmake
    $ make

Building the Guile version

To build the guile versions from the source files untar the file, build the Core module

    $ tar zxf coasim-guile-version.tar.gz
    $ cd coasim-guile-version/Core
    $ ./configure
    $ make

and then build the Guile module:

    $ cd ../Guile
    $ ./configure
    $ make

You will need to install the program, not just build it, for the Scheme modules to load correctly (or manually set the GUILE_LOAD_PATH environment variable, see Getting Started (Guile) for more details).

Building the Python version

To build the Python version (and please keep in mind that this is an beta release so do not expect full functionality from this version yet), first untar the source code and build the Core module:

    $ tar zxf coasim-python-version.tar.gz
    $ cd coasim-python-version/Core
    $ ./configure
    $ make

and then build the Python module:

    $ cd ../Python
    $ python setup.py build 

This will build a module that you can import into your Python scripts. You will need to either install it

    $ python setup.py install 

or make sure it is in your PYTHONPATH.

Usage

CoaSim comes in two flavours: a graphical interface with limited functionality, for exploratory use, and a script based version, based on either Guile Scheme or Python, for power-use.

GUI Version

Installing the graphical user interface version (coasim-gui) should, on GNOME or KDE desktops, add an icon in the start menu for running CoaSim. If this is not the case, the tool can be started on the command-line with the command:

    $ coasim_gui

Setting up a simulation consists of specifying the parameters for building the ARG (recombination rate, exponential growth rate, gene conversion rates, etc.) and list of markers (positions, types, and mutation parameters).

In coasim_gui, the simulation parameters are set in a dialog window, a second dialog window shows the status of the simulation while it is running, and the simulated sequences are presented in a third dialog window from which they can be saved to file.

For more details, see the Getting Started (GUI) manual.

the input dialog the simulation status the output dialog

Warning: The GUI version of CoaSim is limited in functionality compared to the Guile Scheme version — I currently do not have the resources to develop both versions so I have focused my efforts on the Scheme version and discontinued development of the GUI version. If I get the time for it later, or some assistance in maintaining the tool, I will update the GUI version with the missing functionality; in the mean time I refer power-users to the Guile Scheme version.

Guile Version

The scheme based version (coasim-guile) is started from the command-line; the parameters for the simulation or simulations to be run is described in one or more configuration scripts, which are written in the Scheme programming language. Starting CoaSim with the configuration script simulation.scm is done as:

    $ coasim_guile simulation.scm

Run coasim_guile --help to get a complete list of command-line options accepted by CoaSim.

Setting up a simulation consists of specifying the parameters for building the ARG (recombination rate, exponential growth rate, gene conversion rates, etc.) and list of markers (positions, types, and mutation parameters).

In coasim_guile the parameters are specied in a Guile Scheme script. This script also specifies how the simulation is performed, and how the ARG and simulated sequences should be manipulated when available, e.g. saved to files, or spilt into cases and control based on one or more trait markers.

Consult the reference manual for a list of the CoaSim specific Scheme functions. More documentation can be found in the Getting Started (Guile) manual, and the CoaSim/Guile Manual.

Python Version

The Python version is used as a Python module, by importing it as

     >>> import CoaSim

after which simulations can be set up and executed.

More documentation can be found in the Getting Started (Python) manual, and the CoaSim/Python Manual.

Example Scripts

A number of smaller example scripts are included in the distribution. These are briefly described in the Getting Started (Guile) manual. Below are examples of more complex scripts and a short description of the projects in which they have been used.

Association Mapping Study

To evaluate a fine-scale association mapping tool, GeneRecon, we performed a large scale simulation study where CoaSim was used to generate the input data for GeneRecon.

Initial results of this study are reported in:

Initial experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
In proceedings of The 2005 International Conference on Grid Computing and Applications (GCA'05)

and

Experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
To appear in Future Generation Computer Systems doi:10.1016/j.future.2006.09.003.

The simulation script simulates, for each run, a dataset for three mapping experiments, one with 20 uniformly placed SNP markers on the 'basis' region, one with twice as many SNP markers placed on a region twice the size, and one with twice as many SNP markers on the 'basis' region, giving a denser marker placement.

The way this is achived is to split the region into three parts, the left quarter, the middle half, and the right quarter. On each of the the left and right regions, 10 markers are placed, and in the middle 40 markers are placed. To generate the 'basis' region, half of the middle 40 markers are used, to generate the wider region, the 10 on the left and right are used together with 20 from the middle region, and to get the denser region, the middle 40 are used. Transforming the simulated data into these three regions is not done in the CoaSim script but was done in a post processing step.

The affected/unaffected status of simulated sequences was determined using a trait marker that was removed from the data after the sequences were split in two. All mutants are output as cases and all wild-types as controls; this is done to keep track of which cases are mutants and which are not after the simulation, and the "true" dataset analysed by GeneRecon is, in post-processing, sampled from the simulated data in ways depending on the chosen disease model.

The Effective Size of the Icelandic Population

This study is described in the paper:

The effective size of the Icelandic population and the prospects for LD mapping: inference from unphased microsatellite markers
T. Bataillon, T. Mailund, S. Thorlacius, E. Steingrimsson, T. Rafnar, M.M. Halldorsson, V. Calian, and M.H. Schierup
European Journal of Human Genetics 2006, 14, 1044–1053. doi:10.1038/sj.ejhg.5201669.

In the study, CoaSim was used to simulate genotype data under various simulation parameters, such as stationary population size vs. exponential growth, varying mutation rates, and three different mutation models: The infinite sites SNP model, the K-allele model, and the step-wise mutation model. The results of the simulations were then used to estimate parameters from genotype data sampled in the Icelandic population.

To get the same amount of variation in the simulated genotypes, we rescaled the mutation rate when simulating under exponential growth. To be able to do this, we first calculated the expected length of the genealogy under varying growth parameters using the script:

Knowing the scaling factors, we simulated data under various parameters using the scripts:

The LD-study-K-allele.scm and LD-study-SNP.scm uses the built-in marker types ms-marker and snp-marker, while the LD-study-step-wise.scm uses a custom marker.

The Effect of 1 Recombination

In this study we wanted to examine the effect of a single recombination on the simulated genealogy, in particular the effect of the age of recombination or the number of lineages at the time of the recombination.

The first simulations we ran we used to get a feeling for the distribution of the number of recombinations to expect under various growth parameters and recombination rates. For constant population size, this can be calculated analytically, but for exponential growth we had to simulate it. We adjusted the recombination rate according to the expected size of the genealogy, similar to the rescaling of the mutation rate in the project described above, and then calculated the number of recombinations for each simulated ARG.

We then simulated sequence data under varying recombination rates and exponential growth parameters, using simulation callbacks to extract information about the number, age and lineages of each recombination.

The population growth parameter, β, was provided on the command line, to make it simply to simulate under varying models from the scripts that handled the simulations.

As a spin-off we examined the effect of no recombinations: when sampling trees simulated under various parameters, conditional on no recombinations having occurred, we calculated statistics about the genealogy. Here we used simulation-callbacks to perform rejection sampling, restarting the simulation whenever a recombination event occurred.

The actual analysis of the data was not performed in CoaSim, but handled in a post-processing step. To properly simulate the process we needed the marker positions to depend on the length of the genealogy on each side of the recombination — something that is not currently supported in CoaSim — so as a bit of a kludge we placed the markers at the two extremes of the interval, discarded recombinations at the end, and moved the positions in the post-processing step.

With the Python bindings for CoaSim it becomes possible to combine all the analysis in a few Python scripts:

For the first two scripts above, the analysis tools are also wrapped, so they cannot be used directly as is, but can be examined to see how the rejection sampling is used to ensure a single recombination.

The first two scripts also use the Python newick parser available here.

Contact

For bug-reports or feature requests, please use our bug-tracking software.

For comments or questions, please contact Thomas Mailund <mailund@birc.au.dk>, Bioinformatics Research Center (BiRC), University of Aarhus, Høegh-Guldbergsgade 10, DK-8000 Århus C.

Contact: mailund@birc.au.dk