Genetree

The long term evolution of DNA can be modelled using the coalescent. The Markov chain simulation technique developed by Griffiths and Tavaré can be used to produce likelihood estimates for DNA sequence data under the infinitely-many-sites model. The program genetree will calculate not only likelihoods but also simulate likelihood surfaces for parameter estimationin subdivided populations with or without population growth. A whole host of statistics can be calculated including times to the most recent common ancestor, probabilities of the location of the most recent common ancestors, times to the different mutations and the probabilities of these having occurred in the various subpopulations.

The migration model is not fixed and can be specified by the user. Indeed the migration process need not even have a stationary distribution so absorption of the process into subpopulations is allowed. Likelihood estimation of migration matrix parameters can also be carried out. For two individuals the exact, rather than simulated, likelihoods can be calculated.

genetree supersedes ptreesim and timesim. Distribution is free, however it is requested that bug reports and queries are reported to R. C. Griffiths. We have endeavoured to check most options for bugs but bugs may still occur!

  • Quick Introduction and References
  • Data Preparation and Format
  • Genetree Options and Examples
  • Hypothesis Testing
  • Graphical Output (Program treepic)
  • Genetree Download

The URL for this website is: http://www.maths.monash.edu.au/~mbahlo/mpg/gtree.html

Quick Introduction and References

The coalescent is a popular model of genetic evolution. It is the limiting distribution (when the population size (N) is large) of many models of genetic evolution such as the Wright-Fisher and Moran models. For the Wright-Fisher model time is scaled in units of 2N generations for a diploid population or in units of N for a haploid population. The coalescent is a good approximation even when the population sizes are only moderate in reality. In the coalescent with constant population size the time distribution when there are i ancestors of a sample is exponential with mean 2/(i(i-1)).

Summary statistics such as the number of pairwise differences or the number of polymorphic sites are often used to make inferences on the evolutionary process of the population from the sample. Conditioning on a sample and finding the probability of its history yields much more information for estimation of parameters. Exact methods fail as the distributions are far too complicated to be derived analytically. Instead Monte Carlo Markov chain methods have been applied with success. For a different Monte Carlo Method see Joe Felsenstein's[1] work.

Griffiths and Tavaré pioneered their Markov chain simulation technique for a simple Kimura model in Simulating probability distributions in the coalescent. Run times were long however and inference not easy. The same technique was applied to the already well known infinitely-many-sites model, a much simpler model for the mutations in a sample of polymorphic sites from different individuals in Unrooted genealogical tree probabilities in the infinitely-many-sites model. An especially important feature of the infinitely-many-sites model is the property that a given tree is synonymous with the binary "incidence" matrix of mutations on lineages. The incidence matrix is simply derived by denoting an ancestor base with a 0 and the mutant base with a 1. The condition for each polymorphic site is that a mutation can only have occurred once there.

In Unrooted genealogical tree probabilities in the infinitely-many-sites model the ties between unrooted trees, trees where the haplotype of the MRCA (most recent common ancestor) and rooted trees were also explored, an important feature when analysing real data and wanting to find the MRCA.

In Sampling theory for neutral alleles in a varying environment variable population size was added to the model. A feature of the coalescent when the population size is constant is that times while there are n,..,2 ancestors are independent. This makes the process so easy to simulate forward or backward (dependent on the data). With varying population size ancestral times form a Markov chain.

The program genetree presented here not only includes the variable population size scenario with mutation parameter and TMRCA (Time to the most recent common ancestor) estimates, but can also deal with all of these in a subdivided population.

On a last note it should be pointed out that all results from such analysis should be considered as a rough guide only. The stochastic nature of the process used to describe the genealogy, the assumptions that have to be made and the possible removal of data can all have important effects on inference and lead to unusual conclusions. It is important that no analysis is done blindly but with the inclusion of prior knowledge and great care. Results should be used only when other evidence (studies using different DNA or fossil evidence) also indicate similar results.

Some nice results have recently been obtained with regards to modern human evolution. A possible ancient asian origin has been found not only in beta-globin data but also nuclear and mtDNA studies using these methods. See Archaic African and Asian Lineages in the Genetic Ancestry of Modern Humans for details. For the future it is envisaged that estimation of divergence times for populations will also be implemented. Other possibilities are the inclusion of bottlenecks and selection. genetree does not include recombination or selection. Recombination has already been investigated by several authors including Griffiths and Marjoram in Ancestral inference from samples of DNA sequences with recombination.

Papers for downloading

  • Inference from gene trees in a subdivided population[2]

Papers with descriptions of the computational methods used in genetree

  • Computational Methods for the Coalescent[3]

Relevant Papers

  • The Age of a Mutation in a General Coalescent Tree[4]
  • The Ages of Mutations in Gene Trees[5]

Papers for Reference

  • Simulating probability distributions in the coalescent.Theor. Popn. Biol., 46, 131-159, 1994. R. C. Griffiths, S. Tavaré
  • Sampling theory for neutral alleles in a varying environment. Phil.Trans. R. Soc. Lond. B. 344, 403-410, 1994. R.C. Griffiths, S. Tavaré.
  • Ancestral inference in population genetics. Statistical Science, 9, 307-319, 1994. R. C. Griffiths, S. Tavaré.
  • Unrooted genealogical tree probabilities in the infinitely-many-sites model. Math. Biosci., 127, 77-98, 1995. R.C. Griffiths, S. Tavaré.
  • Archaic African and Asian Lineages in the Genetic Ancestry of Modern Humans. Am. J. Hum. Genetic.,60, 772-789, 1997. Rosalind M. Harding, et al.

Other Papers of Interest

  • Inferring coalescence times from DNA sequence data. Genetics,145, 505-518 D. Balding, P. Donnelly, R.C. Griffiths, S. Tavaré.
  • On the time since Adam. Science,272, 1357-1359, 1996. D.J. Balding, P. Donnelly, R.C. Griffiths, S. Tavaré.
  • Markov chain inference methods in population genetics. Mathl. Comput. Modelling,23, 141-158, 1996. R.C. Griffiths, S. Tavaré.
  • Ancestral inference from samples of DNA sequences with recombination. J.Comp.Biol.,3, 479-502, 1996. P. Marjoram, R.C. Griffiths.
  • Efficient algorithms for inferring evolutionary trees. Networks,21, 19-28, 1991. D. Gusfield.
  • Numerical methods for inferring evolutionary trees. Quarterly Review of Biology, 57, 379-404, 1982.

Data Preparation and Format

Fully aligned sequence data is required as input by genetree.

The data can be in the form of a string of bases (just the polymorphic sites), as binary sequences (see references) or as mutation paths forming a gene tree (see references). The recursion for the likelihood of the sample actually operates on the tree (each haplotype is then described by a mutation path).

Each haplotype sequence requires a population label and a number representing the frequency of that haplotype. In the case of a panmictic population, or a single population, only the frequencies of the different haplotypes are required. These label(s) are separated from the haplotype (either a string of bases or a binary sequence or in the mutation sequence for the tree format) by a colon. Thus for the user with data in the raw form (just strings of bases), the data has to be entered only as different types, consisting of the haplotype (only the polymorphic sites) and with the frequency, or the number of occurrences of the type in the sample.

Inputting the data in the base form or binary sequences requires the -j or -J options. These options will change the data into a tree.

An example:

base formbinary form

0 1 : CGGTTCA 0 1 : 0 0 0

1 2 : AGGTTCA 1 2 : 1 0 0

2 1 : ATGTTCA 2 1 : 1 1 0

1 1 : AGGTT.. 1 1 : 1 0 1

In this data set there are three subpopulations, labelled 0, 1 and 2. There are four polymorphic sites out of a total of 7 sites, however the last polymorphic site is a deletion of two bases and we can merge this into a single site. It will be assigned a different "mutation" rate later on. The first type is 0 CGGTTCA, where 0 is the subpopulation label and CGGTTCA the haplotype. The frequencies are in the second column. Thus the first type has a frequency of one, that is there is only one of its kind in the sample.

Note that all the non-polymorphic sites are ignored (they are considered in the mutation rate) because they don't give any information towards the estimation of the parameters.

If the incidence matrix (the binary matrix describing the haplotypes) is not in the infinitely-many-sites format genetree will print an error message indicating which site(s) violate the conditions. See -j and -J options for more detail.

To the infinitely-many-sites model from raw data

For best results the data should ideally be immediately compatible with the infinitely-many-sites model or close to it. To make data infinitely-many-sites compatible requires that the data file only contains sites which

  • are polymorphic
  • have only mutated or changed once, so there has been no back or parallel mutation in the data.

Example: The simplest data set to violate this

0 1 : AG

1 2 : AT

2 1 : CT

1 1 : CG

which in the binary haplotype description is

0 1 : 0 0

1 2 : 0 1

2 1 : 1 1

1 1 : 1 0

The matrix

0 0

0 1

1 1

1 0

is often referred to as the incidence matrix of mutations on lineages in the literature. In this data set a parallel mutation has occurred at site 1/site 2. The pattern of binary sequences shows the violation in the pattern of the last three haplotypes. For further details see Efficient algorithms for inferring evolutionary trees and Numerical methods for inferring evolutionary trees. Removal of either site or any of the last three individuals, assuming that 0 is the wild type or ancestor base, will make the data set infinitely-many-sites model compatible. If there is a known parallel mutation from a wild type at a site it could be presented as two sites to make data compatible.

A program called seq2tr is used by genetree to change the base or binary haplotype description of the data into the genetree or report on the inconsistencies in the data with regards to the infinitely-many-sites model. It can also be used separately and has the command line

seq2tr seq_file tree_out_file [ancestor_base_file]

As an example consider the last data set. seq2tr produces the following output

Sequences are incompatible in broad sense

Compatibility matrix (& broad, * narrow, . ok)

12

1 .&

2 *.

Incompatible in the broad sense means that even toggling the 0 and 1's in each column, that is at each segregating site, (this produces a different rooted tree) will not make this data set consistent with the infinitely-many-sites model and a gene tree can not be produced.

The user can verify that changing 0's to 1's in either column still results in the same problem pattern of 0's and 1's (see references). Incompatible in the narrow sense means that this site is only a problem with the current arrangement of 1's and 0's for this particular rooted tree. Toggling sites thus classified will make the data infinitely-many-sites model compatible and able to be turned into a unique tree.

Although genetree provides feedback on which sites are incompatible but it may be more suitable to remove individuals rather than a particular site from the data. There are programs available that will prune the data for you according to weightings of columns (polymorphic sites) and rows (haplotypes) but the user may have some prior ideas as to what information should be retained and what can be pruned. Indicators that may aid in making these decisions include:

  • Maintaining the haplotype distribution amongst subpopulations
  • Maintaining the same proportions of individuals in the subpopulations

The less data has to be pruned the better. The infinitely-many-sites model has the distinct advantage however of computational efficiency. Even so, as subdivision and population growth are added, run times can be very long!

genetree Options

Typing genetree and pressing return will show some of the options that genetree supports. This is what will appear:

Usage: genetree tree_file theta runs seed {options}

Ancestral inference for gene trees

Version 8.3, Bob Griffiths, 11/05/98

Options on command line or in a file (or both)

{file containing options: @file_name}

-j input sequences [sequence file name],

-J input sequences [sequence file name] [ancestor file name],

(tree will be output in tree_file)

-z generate all possible trees with different roots

-e exponential growth rate file (array of rates)

[cr for more; space-cr to quit]

-s number of subpopulations (default - number in tree_file)

-m migration rate file (matrix of rates, diagonals zero)

-p subpopulation relative size file (array of rates)

[cr for more; space-cr to quit]

-f surface outfile_name (g m theta likelihood sd_like)

-g theta surface domain [theta0 theta1 surface_points]

-i growth magnification surface domain [g0 g1 surface_points]

-h migration surface domain [m0 m1 surface_points]

-H migration matrix surface [infile] [outfile]

-k tmrca distribution [n_cells top_value outfile]

-K tmrca and age distributions [n_cells top_value outfile] [cr for more; space-cr to quit]

-o output level [value 0-7] (default 3)

1 likelihood

2 show tree

3 TMRCA

4 age of mutations

5 MRCA distribution in subpopulations

6 mutation distribution in subpopulations

7 subpopulation TMRCA

-c tree picture file name (default: tree.ps)

-C [gene population size] [generation size] (for tree picture)

-2 calculate exact likelihood from 2 ancestors

-l multiple file summary [0 or 1] (default 1)

-b generate batch commands for multiple trees

-P calculate average pairwise differences

-x maximum events in one simulation run (default - 500)

-y maximum types in the ancestry of the sample (default - 30)

To run, genetree needs a data file, with the data in the tree format, a mutation rate, the number of replications or runs and a random number generator seed.

Example: The file named tree contains the data

1 : 1 0

2 : 0

3 : 2 0

1 : 3 1 0

Typing genetree tree 1.0 100000 771

puts the following output on the screen:

1 : 1 0

2 : 0

3 : 2 0

1 : 3 1 0

Estimated likelihood of the tree 9.2433e-04, se 3.9629e-05

Mean TMRCA 1.5203e+00, sd 7.0484e-01

Elapsed time = 1 sec

genetree tree 1.0 1000 771

Software version 7.7, 12/3/98

Date 3-13-1998, time 13:33:20

This is a sample from a panmictic population, there is no population subdivision.

Example 2: Typing the command

genetree tree1 2.0 100000 6652 -m mig

produces

1 2 : 2 0

0 2 : 1 0

1 1 : 3 1 0

0 1 : 0

Estimated likelihood of the tree 1.1607e-04, se 1.2648e-06

Mean TMRCA 1.3311e+00, sd 5.5821e-01

Elapsed time = 160 sec

genetree tree1 2.0 100000 6652 -m mig

Backward migration rate matrix

0.000 1.000

1.000 0.000

Software version 8.0, 28/03/98

Date 4-3-1998, time 12:09:16

This is a sample from a subdivided population which has two subpopulations. This is the default output for genetree. More or less output can be specified by using the -o option. To redirect the output to a file in UNIX the user should use the unix command >.

For example

genetree tree 1.0 100000 771 > tree.out

redirects the screen output to a file called tree.out. All output files are placed in the current directory unless otherwise directed. If a command in the syntax line, in the option explanations below, is surrounded by bracket () then this implies that this option is only necessary when a the sample is from a subdivided population. The migration matrix that is used in most of the examples below is contained in the file mig.

It is

0.0 1.0

1.0 0.0

-@ If there are many options required it is often convenient to write some or all of these

commands into a separate file, an input file. This shortens the command line considerably.

Syntax:genetree tree_file theta runs seed @input_file

The @ operator tells genetree to look in the file called input_file for the option parameters.

Example:genetree data6 1.0 100000 616 -2 @input

The file input contains

-m mig

-e exp1

-o 4

The order of the options does not matter.

-j Lets the user enter data in the form described in Data Preparation and Format . It is then changed into the tree format with 0 denoting the ancestral base. Note this option only takes the seq_file when the haplotypes are in the binary sequence form and assumes that 0 is the ancestral type at a site. All sites must be polymorphic otherwise genetree will not run and display the error message:

Site i must be a segregating site.

Syntax:genetree tree_file -j seq_file

Example:genetree tree.dat -j seq.dat

The file seq.dat contains

0 1 : 1 0 0 0

1 2 : 0 0 0 0

2 3 : 0 1 0 0

2 1 : 1 0 0 1

The file tree.dat has the gene tree

0 1 : 1 0

1 2 : 0

2 3 : 2 0

2 1 : 3 1 0

-J Converts sequence data from a more general form, using bases or binary data, into a tree and allows specification of the ancestor bases. Note: ancestor bases must be specified if bases are used and all sites must be polymorphic.

Syntax:genetree tree_file -J seq_file anc_file

Example:genetree tree.dat -J seq.dat anc.dat

The files seq.dat

0 1 : A . T

1 2 : A . C

1 2 : C T T

and anc.dat

A . T

produce tree.dat

0 1 : 0

1 2 : 3 0

1 2 : 2 1 0

Note: The dot may indicate a missing base (in which case it is assumed to be different to the known base) or an insertion or deletion.

-z Generates all rooted trees from a given tree in the file tree.dat. These can then used by the -b option to calculate the likelihood of the unrooted tree. The probability of the unrooted tree is simply the sum of all the rooted tree probabilities.There is no need to use the sequence file. All the rooted trees are deduced from the rooted tree that is already in the tree format, in the file tree.dat. All possible trees are generated in the files tree_dat.i, i=0,..,s, where s is the number of polymorphic sites or alternatively if the original tree file was tree then the rooted trees will be in the files tree.i, i=0,..,s,.