HapCluster++ is a software package for linkage disequilibrium mapping. It is based on a Bayesian Markov-chain Monte Carlo (MCMC) method for fine-scale linkage-disequilibrium gene mapping using high-density marker maps. HapCluster++ is a C++ implementation of the method described in the paper below (the original implementation was in R).
Fine Mapping of Disease Genes via Haplotype Clustering. E.R.B. Waldron, J.C. Whitaker and D.J. Balding. Genetic Epidemiology. 30: 170–179. (2006)
HapCluster++ is developed in collaboration with David Balding's group at Imperial Collage London, and is released under the GNU General Public License.
HapCluster++ is written in C++ and is available as source code (under the GNU General Public License, GPL) and as binary versions as Linux RPM or Debian package files. The source code has been successfully compiled on various Linux and UNIX systems. As I have only limited access to architectures other than Linux, it is not possible for me to make binary distributions for other platforms, but if anyone is willing to build the distributions I will be more than happy to put them on this site.
HapCluster++ requires the Boost Library and the GNU Scientific Library (GSL) to be installed. For SNPfile support, the SNPfile library is needed.
The most recent versions can be downloaded below, older versions are available from here.
The rpm-files were built on Linux Fedora Core 5. The deb-files were built on Ubuntu Feisty Fawn. If you have problems installing them on other RPM or Debian based systems, please let me know.
To build the source files, first uncompress and untar the file, then run 'configure' and finally 'make'. To test that the build was successful, run 'make check'. To install the program, run 'make install'.
$ tar zxf hapcluster-version.tar.gz
$ cd hapcluster-version
$ ./configure
$ make
$ make check
$ make install
HapCluster++ is started on the command line, taking as input a file containing marker positions and a file containing phased haplotypes:
$ hapcluster positions.txt haplotypes.txt
The format of the haplotype file is: One line per haplotype, where a haplotype is represented as a list of space-separated alleles, and each allele represented as either a '0' or a '1' with any negative number for missing data. The haplotypes are taken as pairwise genotypes, so for even numbers j the lines j and j+1 is taken as the haplotypes for individual j/2.
The first column is a 'pseudo'-allele used for the case/control dichotomy: a '0' in the first column is taken to mean that the haplotype is a control haplotype and a '1' at the first column is taken to mean that the haplotype is a case haplotype.
HapCluster++ currently implements an experimental extension for genotypes. If the input contains genotypes rather than phased haplotypes, use the option -u and specify one individual per line and use 0 for homozygote 0, 1 for homozygote 1, and 2 for heterozygote.
When run, the program outputs samples from the posterior density of disease loci. These samples can then be analysed in other software packages, such as e.g. R.
By default, samples of the disease locus is written to standard out, but this can be changed to a file using the option -s — this is especially recommended when running HapCluster in verbose mode (option -v).
Run hapcluster --help to get a complete list of command-line options accepted by HapCluster. Please refer to the Getting Started manual for more details.
Programmers, wishing to extend HapCluster or built new methods based on it, can find documentation of the source code here.
Simulated test examples can be downloaded below. The positions-*.*.txt files contain the marker positions, the haplotypes-*.*.txt the haplotypes.
Two datasets with 200 markers in a region corresponding to recombination rate ρ = 4000 or about 10 MB. Each contains 1000 affected and 1000 unaffected haplotypes; the disease risk for a mutant is 10%, the risk for a wildtype is 5%.
Two datasets with 200 markers in a region corresponding to recombination rate ρ = 40 or about 100 KB. Each contains 1000 affected and 1000 unaffected haplotypes; the disease risk for a mutant is 10%, the risk for a wildtype is 5%.
For bug-reports or feature requests, please use our bug-tracking software.
For comments or questions, please contact Thomas Mailund <mailund@birc.au.dk>, Bioinformatics Research Center (BiRC), University of Aarhus, Høegh-Guldbergsgade 10, DK-8000 Århus C.
Contact: mailund@birc.au.dk