Korak: An efficient method for exploring the space of gene tree/species tree reconciliations in a probabilistic framework

IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012

Given a gene tree G, a species tree S (both in newick format) and duplication/loss rates and branch length for S, Korak computes the following:

- The size and diameter of the whole space of reconciliations between G and S.
- The likelihood of the LCA based reconciliation.
- The sum of likelihood overall reconciliations located in a subspace of the whole space of reconciliations (given a maximal depth). That is the probability mass of the considered subspace.
- The (exact/approximate) posterior probability (given the tree G) of each visited reconciliation.

- Compare the probability analysis with the real reconciliation (for simulated gene tree only).

A user manual of the program: manualExploration.pdf.

The archive Exploration.tgz contains the following:

- The C++ code which consists of four libraries (folder LIBRARY) and the main project (folder EXPLORATION/exploration/).
- Several input files (folder EXPLORATION/INPUT_FILES/)

Follow these steps to build the binary called Korak

- Download the archive Exploration.tgz.
- Create a new directory and move the archive in it.

- Extract the archive : 'tar -zxvf Exploration.tgz'
- 'cd EXPLORATION/exploration/'
- Type 'cmake .' to build the makefile corresponding to your system.
- Type 'make' to build the binary file called Exploration (several
warnings are written to the shell, don't worry, it is normal: job to do
latter)

- Execute the binary: './Exploration D ../INPUT_FILES/DATA2 E L Q D'

Output files of the Exploration program:

- exploration.log2 : summarize of the running time
- exploration.results : results of the computation (see manualExploration.pdf)

Probabilistic Analysis on Real and Simulated Gene Trees

This section contains the following:

- Input data:
- Real data for 12 fungal genomes [1]:
- 1278 real gene family trees.
- Species tree with branch length (in millions of years) and duplications and loss rates (computed by Cafe [2])

- Synthetic data based on the real ones above:
- Simulated gene trees using a "recursive" birth-and-death process (see the paper);
- Based on the rates R computed by cafe, three duplication/loss rates categories are considered

- 1051 trees with R x 1;

- 1025 trees with R x 1.4;
- 924 trees with R x 1.8.

- Output data:
- Real gene trees:
- Complete exploration
- Incomplete exploration
- Simulated gene trees
- Complete exploration

Input gene trees

Increasing Factor (I.F.) | Gene Trees | Branch Lengths (in time) and Rates | Species Tree (12 fungal genomes) | |

Real gene trees | Not applicable | 1278 trees realGeneTree.tgz | edgeValues-1 | |

Simulated gene trees | 1 | 1051 trees simulatedGeneTree_1.tgz | ||

1.4 | 1025 trees simulatedGeneTree_1.4.tgz | edgeValues-1.4 | ||

1.8 | 924 trees simulatedGeneTree_1.8.tgz | edgeValues-1.8 |

Probabilistic analysis

Reconciliation Tree Explored | Real Gene Trees | Simulated Gene Tree with I.F. | ||

1 | 1.4 | 1.8 | ||

Whole tree | realGeneTree_CompleteExploration.tgz | simulatedGeneTree_CompleteExploration_1.tgz | simulatedGeneTree_CompleteExploration_1.4.tgz | simulatedGeneTree_CompleteExploration_1.8.tgz |

Subtree with Depth | ||||

0 | realGeneTree_Depth_0.tgz | |||

1 | realGeneTree_Depth_1.tgz | |||

2 | realGeneTree_Depth_2.tgz | |||

3 | realGeneTree_Depth_3.tgz | |||

4 | realGeneTree_Depth_4.tgz | |||

5 | realGeneTree_Depth_5.tgz | |||

6 | realGeneTree_Depth_6.tgz | |||

7 | realGeneTree_Depth_7.tgz | |||

8 | realGeneTree_Depth_8.tgz | |||

9 |
realGeneTree_Depth_9.tgz | |||

10 |
realGeneTree_Depth_10.tgz |

References

[1] I. Wapinski, A. Pfeffer, N. Friedman, and A. Regev. Natural history and evolutionary principles of gene duplication in fungi. Nature, 449:54–61, 2007.

[2] T. De Bie, N. Cristianini, J.P. Demuth, and M.W. Hahn. CAFE: a computational tool for the study of gene family evolution. Bioinformatics, 22(10):1269–1271, 2006.