Korak: An efficient method for exploring the space of gene tree/species tree reconciliations in a probabilistic framework
Jean-Philippe Doyon, Sylvie Hamel, and Cedric Chauve
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012
Given a gene tree G, a species tree S (both in newick format) and duplication/loss rates and branch length for S, Korak
computes the following:
A user manual of the program
- The size and diameter of the whole space of reconciliations between G and S.
- The likelihood of the LCA based reconciliation.
- The sum of likelihood overall reconciliations located in a
subspace of the whole space of reconciliations (given a maximal depth). That is the probability mass of the considered subspace.
- The (exact/approximate) posterior probability (given the tree G) of each visited reconciliation.
- Compare the probability analysis with the real reconciliation (for simulated gene tree only).
The archive Exploration.tgz contains the following:
Follow these steps to build the binary called Korak
- The C++ code which consists of four libraries (folder LIBRARY) and the main project (folder EXPLORATION/exploration/).
- Several input files (folder EXPLORATION/INPUT_FILES/)
- Download the archive Exploration.tgz.
- Create a new directory and move the archive in it.
- Extract the archive : 'tar -zxvf Exploration.tgz'
- 'cd EXPLORATION/exploration/'
- Type 'cmake .' to build the makefile corresponding to your system.
- Type 'make' to build the binary file called Exploration (several
warnings are written to the shell, don't worry, it is normal: job to do
- Execute the binary: './Exploration D ../INPUT_FILES/DATA2 E L Q D'
The optionn of the program and the format of the output are described in manualExploration.pdf
. The same example as Step 7 below is used.
Output files of the Exploration program:
- exploration.log2 : summarize of the running time
- exploration.results : results of the computation (see manualExploration.pdf)
Probabilistic Analysis on Real and Simulated Gene Trees
This section contains the following:
- Input data:
- Real data for 12 fungal genomes :
- 1278 real gene family trees.
- Species tree with branch length (in millions of years) and duplications and loss rates (computed by Cafe )
- Synthetic data based on the real ones above:
- Simulated gene trees using a "recursive" birth-and-death process (see the paper);
- Based on the rates R computed by cafe, three duplication/loss rates categories are considered
- 1051 trees with R x 1;
- 1025 trees with R x 1.4;
- 924 trees with R x 1.8.
- Output data:
- Real gene trees:
- Complete exploration
- Incomplete exploration
- Simulated gene trees
Input gene trees
Wapinski, A. Pfeffer, N. Friedman, and A. Regev. Natural history and
evolutionary principles of gene duplication in fungi. Nature,
De Bie, N. Cristianini, J.P. Demuth, and M.W. Hahn. CAFE: a
computational tool for the study of gene family evolution.
Bioinformatics, 22(10):1269–1271, 2006.