PUZZLE Logo TREE-PUZZLE Manual PPUZZLE Logo
Maximum likelihood analysis for nucleotide, amino acid, and two-state
data
Version 5.0
October 2000
Copyright 1999-2000 by Heiko A. Schmidt, Korbinian Strimmer, Martin
Vingron, and Arndt von Haeseler
Copyright 1995-1999 by Korbinian Strimmer and Arndt von Haeseler
Heiko A. Schmidt, email: h.schmidt@dkfz-heidelberg.de, Theoretical
Bioinformatics, DKFZ, Im Neuenheimer Feld 280, D-69124 Heidelberg,
Germany.
Korbinian Strimmer, email: korbinian.strimmer@zoo.ox.ac.uk, Department
of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK.
Martin Vingron, email: vingron@dkfz-heidelberg.de, Theoretical
Bioinformatics, DKFZ, Im Neuenheimer Feld 280, D-69124 Heidelberg,
Germany.
Arndt von Haeseler, email: haeseler@eva.mpg.de, Max-Planck-Institute
for Evolutionary Anthropology, Inselstr. 22, D-04103 Leipzig, Germany.
The official name of the program has been changed to TREE-PUZZLE to
avoid legal conflict with the Fraunhofer Gesellschaft. We are sorry for
any inconvenience this may cause to you. Any reference to PUZZLE in
this package is only colloquial and refers to TREE-PUZZLE.
TREE-PUZZLE is a computer program to reconstruct phylogenetic trees
from molecular sequence data by maximum likelihood. It implements a
fast tree search algorithm, quartet puzzling, that allows analysis of
large data sets and automatically assigns estimations of support to
each internal branch. TREE-PUZZLE also computes pairwise maximum
likelihood distances as well as branch lengths for user specified
trees. Branch lengths can also be calculated under the
clock-assumption. In addition, TREE-PUZZLE offers a novel method,
likelihood mapping, to investigate the support of a hypothesized
internal branch without computing an overall tree and to visualize the
phylogenetic content of a sequence alignment. TREE-PUZZLE also conducts
a number of statistical tests on the data set (chi-square test for
homogeneity of base composition, likelihood ratio to test the clock
hypothesis, Kishino-Hasegawa test). The models of substitution provided
by TREE-PUZZLE are TN, HKY, F84, SH for nucleotides, Dayhoff, JTT,
mtREV24, BLOSUM 62, VT, WAG for amino acids, and F81 for two-state
data. Rate heterogeneity is modelled by a discrete Gamma distribution
and by allowing invariable sites. The corresponding parameters can be
inferred from the data set.
TREE-PUZZLE is available free of charge from
http://www.tree-puzzle.de/ (TREE-PUZZLE home page)
http://www.dkfz-heidelberg.de/tbi/tree-puzzle/ (TREE-PUZZLE home
page mirror at DKFZ)
http://iubio.bio.indiana.edu/soft/molbio/evolve (IUBio archive
www, USA)
ftp://iubio.bio.indiana.edu/molbio/evolve (IUBio archive ftp,
USA)
ftp://ftp.ebi.ac.uk/pub/software (European Bioinformatics
Institute, UK)
ftp://ftp.pasteur.fr/pub/GenSoft (Institut Pasteur, France)
TREE-PUZZLE is written in ANSI C. It will run on most personal
computers and workstations if compiled by an appropriate C compiler.
The tree reconstruction part of TREE-PUZZLE has been parallelized using
the Message Passing Interface (MPI) library standard (Snir et al., 1998
and Gropp et al., 1998). If desired to run TREE-PUZZLE in parallel you
need an implementation of the MPI library on your system as well.
Please read the installation section for more details.
We suggest that this documentation should be read before using
TREE-PUZZLE the first time. If you do not have the time to read this
manual completely please do read at least the sections Input/Output
Conventions and Quick Start below. Then you should be able to use the
TREE-PUZZLE program, especially if you have some experience with the
PHYLIP programs. The other sections should then be read at a later
time.
To find out what's new in version 5.0 please read the Version History.
_______________________________________________________________________
Legal Stuff
TREE-PUZZLE 5.0 is (c) 1999-2000 Heiko A. Schmidt, Korbinian Strimmer,
Martin Vingron, and Arndt von Haeseler.
Earlier PUZZLE versions were (c) 1995-1999 by Korbinian Strimmer and
Arndt von Haeseler.
The software and its accompanying documentation are provided as is,
without guarantee of support or maintenance. The whole package is
licensed under the GNU public license, except for the parts indicated
in the sources where the copyright of the authors does not apply.
Please see http://www.opensource.org/licenses/gpl-license.html for
details.
Installation
Introduction
TREE-PUZZLE is an ANSI C application to reconstruct phylogenetic trees
from molecular sequence data by maximum likelihood. It implements a
fast tree search algorithm, quartet puzzling, that allows analysis of
large data sets and automatically assigns estimations of support to
each internal branch. Rate heterogeneity (invariable sites plus Gamma
distributed rates) is incorporated in all models of substitution
available (nucleotides: SH, TN, HKY, F84, and submodels; amino acids:
Dayhoff, JTT, mtREV24, BLOSUM 62, VT, and WAG; two-state data: F81).
All parameters including rate heterogeneity can be estimated from the
data by maximum likelihood approaches. TREE-PUZZLE also computes
pairwise maximum likelihood distances as well as branch lengths for
user specified trees. In addition, TREE-PUZZLE offers a novel method,
likelihood mapping, to investigate the support of internal branches
without computing an overall tree.
Input/Output Conventions
Quick Start
-
Prepare your sequence input file and, optionally, your tree input file.
Then start the TREE-PUZZLE program. TREE-PUZZLE will choose
automatically the nucleotide or the amino acid mode. If more than 85%
of the characters (not counting the - and ?) in the sequences are A, C,
G, T, U or N, it will be assumed that the sequences consists of
nucleotides. If your data set contains amino acids TREE-PUZZLE suggests
whether you have amino acids encoded on mtDNA or on nuclear DNA, and
selects the appropriate model of amino acid evolution. If your data set
contains nucleotides the default model of sequence evolution chosen is
the HKY model. Parameters need not to be specified, they will be
estimated by a maximum likelihood procedure from the data. If
TREE-PUZZLE detects a usertree file stated at the command line or one
called "intree" it automatically switches to the input tree mode.
-
Then, a menu (PHYLIP "look and feel") appears with default options set.
It is possible to change all available options. For example, if you
want to incorporate rate heterogeneity you have to select option "w" as
rate heterogeneity is switched off by default. Then type "y" at the
input prompt and start the analysis. You will see a number of status
messages on the screen during computation. When the analysis is
finished all output files (e.g., "outfile", "outtree", "outdist",
"outqlist", "outlm.eps", "outpstep", "outptlist" or
"INFILENAME.puzzle", "INFILENAME.tree", "INFILENAME.dist",
"INFILENAME.qlist", "INFILENAME.eps", "INFILENAME.pstep",
"INFILENAME.ptorder") will be in the same directory as the input files.
-
To obtain a high quality picture of the output tree (including node
labels) you might want to use use the TreeView program by Roderic Page.
It is available free of charge and runs on MacOS and MS-Windows. It can
be retrieved from http://taxonomy.zoology.gla.ac.uk/rod/treeview.html.
TreeView understands the CLUSTAL W treefile conventions, reads
multifurcating trees and is able to simultaneously display branch
lengths and support values for each branch. Open the
"INFILENAME.tree"/"outtree" file with TreeView, choose "Phylogram" to
draw branch lengths, and select "Show internal edge labels".
-
On a Unix you can use the TreeTool program to display and manipulate
TREE-PUZZLE trees (See
ftp://rdp.life.uiuc.edu/pub/RDP/programs/TreeTool for precompiled Sun
executables. A version that runs on Linux has been prepared by Anders
Holmberg from the Dept. of Biochemistry at the Royal Institute of
Technology, Stockholm).
Models of Sequence Evolution
-
Here we give a brief overview over the models implemented in
TREE-PUZZLE. Formulas are written in TeX style.
-
Models of Substitution
-
The substitution process is modelled as reversible time homogeneous
stationary Markov process. If the corresponding stationary nucleotide
(amino acid) frequencies are denoted pi_i the most general rate matrix
for the transition from nucleotide (amino acid) i to j can be written
as
| Q_{ij} pi_j for i != j
R_{ij} = |
| - Sum_m Q_{im} pi_m for i == j
-
The matrix Q_{ij} is symmetric with Q_{ii} == 0 (diagonals are zero).
For nucleotides the most general model built into TREE-PUZZLE is the
Tamura-Nei model (TN, Tamura and Nei, 1993). The matrix Q_{ij} for this
model equals
| 4*t*gamma/(gamma+1) for i -> j pyrimidine transition
|
Q_{ij} = | 4*t/(gamma+1) for i -> j purine transition
|
| 1 for i -> j transversion
-
The parameter gamma is called the "Y/R transition parameter" whereas t
is the "Transition/transversion parameter". If gamma is equal to 1 we
get the HKY model (Hasegawa et al., 1985). Note, the ratio of the
transition and transversion rates (without frequencies) is kappa = 2*t.
There is a subtle but important difference between the
transition-transversion parameter, the expected transition-transversion
ratio, and the observed transition transversion ratio. The
transition-transversion parameter simply is a parameter in the rate
matrix. The expected transition-transversion ratio is the ratio of
actually occurring transitions to actually occurring transversions
taking into account nucleotide frequencies in the alignment. Due to
saturation and multiple hits not all substitutions are observable.
Thus, the observed transition-transversion ratio counts observable
transitions and transversions only. If the base frequencies in the HKY
model are homogeneous (pi_i = 0.25) HKY further reduces to the Kimura
model. In this case t is identical to the expected
transition/transversion ratio. If t is set to 0.5 the Jukes-Cantor
model is obtained. The F84 model (as implemented in the various PHYLIP
programs, Felsenstein, 1984) is a special case of the Tamura-Nei model.
-
For amino acids the matrix Q_{ij} is fixed and does not contain any
free parameters. Depending on the type of input data four different
Q_{ij} matrices are available in TREE-PUZZLE. The Dayhoff (Dayhoff et
al., 1978) and JTT (Jones et al., 1992) matrices are for use with
proteins encoded on nuclear DNA, the mtREV24 (Adachi and Hasegawa,
1996) matrix is for use with proteins encoded on mtDNA, and the BLOSUM
62 (Henikoff and Henikoff, 1992) and the WAG model (Whelan and Goldman)
are for more distantly related amino acid sequences. The WAG matrix has
been infered from a database of 3905 globular protein sequences,
forming 182 distinct gene families spanning a broad range of
evolutionary distances (Whelan and Goldman). The VT model is based an
new estimator for amino acid replacement rates, the resolvent method.
The VT matrix has been computed from a large set alignments of varying
degree of divergence. Hence VT is for use with proteins of distant
relatedness as well (Mueller and Vingron, 2000).
-
For doublets (pairs of dependent nucleotides) the SH model (Schoeniger
and von Haeseler, 1994) is implemented in TREE-PUZZLE. The
corresponding matrix Q_{ij} reads
| 2*t for i -> j transition substitution
|
Q_{ij} = | 1 for i -> j transversion substitution
|
| 0 for i -> j two substitutions
-
The SH model basically is a F81 model (Felsenstein, 1981) for single
substitutions in doublets.
-
Models of Rate Heterogeneity
Available Options
All options can be selected and changed after TREE-PUZZLE has read the
input file. Depending on the input files options are preselected and
displayed in a menu ("PHYLIP look and feel"):
GENERAL OPTIONS
b Type of analysis? Tree reconstruction
k Tree search procedure? Quartet puzzling
v Approximate quartet likelihood? No
u List unresolved quartets? No
n Number of puzzling steps? 1000
j List puzzling step trees? No
o Display as outgroup? Gibbon
z Compute clocklike branch lengths? No
e Parameter estimates? Approximate (faster)
x Parameter estimation uses? Neighbor-joining tree
SUBSTITUTION PROCESS
d Type of sequence input data? Nucleotides
m Model of substitution? HKY (Hasegawa et al. 1985)
t Transition/transversion parameter? Estimate from data set
f Nucleotide frequencies? Estimate from data set
RATE HETEROGENEITY
w Model of rate heterogeneity? Uniform rate
Quit [q], confirm [y], or change [menu] settings:
-
By typing the letters shown in the menu you can either change settings
or enter new parameters. Some options (for example "m" and "w") can be
invoked several times to switch through a number of different settings.
The parameters of the models of sequence evolution can be estimated
from the data by a variety of procedures based on maximum likelihood.
The analysis is started by typing "y" at the input prompt. To quit the
program type "q".
-
The following table lists in alphabetical order all TREE-PUZZLE
options. Be aware, however, not all of them are accessible at the same
time:
-
Gamma rate heterogeneity parameter alpha. This is the so-called shape
parameter of the Gamma distribution.
b
-
Type of analysis. Allows to switch between tree reconstruction by
maximum likelihood and likelihood mapping.
c
-
Number of rate categories (4-16) for the discrete Gamma distribution
(rate heterogeneity).
d
-
Data type. Specifies whether nucleotide, amino acid sequences, or
two-state data serve as input. The default is automatically set by
inspection of the input data. After TREE-PUZZLE has selected an
appropriate data type (marked by 'Auto:') the 'd'-option changes the
type in the following order: selected type -> Nucleotides -> Amino
acids -> automatically selected type.
e
-
Approximation option. Determines whether an approximate or the exact
likelihood function is used to estimate parameters of the models of
sequence evolution. The approximate likelihood function is in most
cases sufficient and is faster.
f
-
Base frequencies. The maximum likelihood calculation needs the
frequency of each nucleotide (amino acid, doublet) as input.
TREE-PUZZLE estimates these values from the sequence input data. This
option allows specification of other values.
g
-
Group sequences in clusters. Allows to define clusters of sequences as
needed for the likelihood mapping analysis. Only available when
likelihood mapping is selected ("b" option).
h
-
Codon positions or definition of doublets. For nucleotide data only. If
the TN or HKY model of substitution is used and the number of sites in
the alignment is a multiple of three the analysis can be restricted to
each of the three codon positions and to the 1st and 2nd positions. If
the SH model is used this options allows to specify that the 1st and
2nd codon positions in the alignment define a doublet.
i
-
Fraction of invariable sites. Probability of a site to be invariable.
This parameter can be estimated from the data by TREE-PUZZLE (only if
the approximation option for the likelihood function is turned off).
j
-
List puzzling steps trees. Writes all intermediate trees (puzzling step
trees) used to compute the quartet puzzling tree into a file, either as
a list of topologies ordered by number of occurrences (*.ptorder), or
as list about the chronological occurrence of the topologies (*.pstep),
or both.
k
-
Tree search. Determines how the overall tree is obtained. The topology
is either computed with the quartet puzzling algorithm or is defined by
the user. Maximum likelihood branch lengths will be computed for this
tree. Alternatively, a maximum likelihood distance matrix only can also
be computed (no overall tree).
l
-
Location of root. Only for computation of clock-like maximum likelihood
branch lengths. Allows to specify the branch where the root should be
placed in an unrooted tree topology. For example, in the tree
(a,b,(c,d)) l = 1 places the root at the branch leading to sequence a
whereas l=5 places the root at the internal branch.
m
-
Model of substitution. The following models are implemented for
nucleotides: the Tamura-Nei (TN) model, the Hasegawa et al. (HKY)
model, and the Schoeniger & von Haeseler (SH) model. The SH model
describes the evolution of pairs of dependent nucleotides (pairs are
the first and the second nucleotide, the third and the fourth
nucleotide and so on). It allows for specification of the
transition-transversion ratio. The original model (Schoeniger & von
Haeseler, 1994) is obtained by setting the transition-transversion
parameter to 0.5. The Jukes-Cantor (1969), the Felsenstein (1981), and
the Kimura (1980) model are all special cases of the HKY model.
For amino acid sequence data the Dayhoff et al. (Dayhoff) model, the
Jones et al. (JTT) model, the Adachi and Hasegawa (mtREV24) model, the
Henikoff and Henikoff (BLOSUM 62), the Mueller and Vingron (VT), and
the Whelan and Goldman (WAG) substitution model are implemented in
TREE-PUZZLE. The mtREV24 model describes the evolution of amino acids
encoded on mtDNA, and BLOSUM 62 is for distantly related amino acid
sequences, as well as the VT model. After TREE-PUZZLE has selected an
appropriate amino acid substitution model (marked by 'Auto:') the
'm'-option changes the model in the following order: selected model ->
Dayhoff -> JTT -> mtREV24 -> BLOSUM62 -> VT -> WAG -> automatically
selected model
For more information please read the section in this manual about
models of sequence evolution. See also option "w" (model of rate
heterogeneity).
n
-
If tree reconstruction is selected: number of puzzling steps. Parameter
of the quartet puzzling tree search. Generally, the more sequences are
used the more puzzling steps are advised. The default value varies
depending on the number of sequences (at least 1000).
If likelihood mapping is selected: number of quartets in a likelihood
mapping analysis. Equal to the number of dots in the likelihood mapping
diagram. By default 10000 dots/quartets are assumed. To use all
possible quartets in clustered likelihood mapping you have to specify a
value of n=0.
o
-
Outgroup. For displaying purposes of the unrooted quartet puzzling tree
only. The default outgroup is the first sequence of the data set.
p
-
Constrain the TN model to the F84 model. This option is only available
for the Tamura-Nei model. With this option the expected (!)
transition-transversion ratio for the F84 model have to be entered and
TREE-PUZZLE computes the corresponding parameters of the TN model (this
depends on base frequencies of the data). This allows to compare the
results of TREE-PUZZLE and the PHYLIP maximum likelihood programs which
use the F84 model.
q
-
Quits analysis.
r
-
Y/R transition parameter. This option is only available for the TN
model. This parameter is the ratio of the rates for pyrimidine
transitions and purine transitions. You do not need to specify this
parameter as TREE-PUZZLE estimates it from the data. For precise
definition please read the section in this manual about models of
sequence evolution.
s
-
Symmetrize doublet frequencies. This option is only available for the
SH model. With this option the doublet frequencies are symmetrized. For
example, the frequencies of "AT" and "TA" are then set to the average
of both frequencies.
t
-
Transition/transversion parameter. For nucleotide data only. You do not
need to specify this parameter as TREE-PUZZLE estimates it from the
data. The precise definition of this parameter is given in the section
on models of sequence evolution in this manual.
u
-
Show unresolved quartets. During the quartet puzzling tree search
TREE-PUZZLE counts the number of unresolved quartet trees. An
unresolved quartet is a quartet where the maximum likelihood values for
each of the three possible quartet topologies are so similar that it is
not possible to prefer one of them (Strimmer, Goldman, and von
Haeseler, 1997). If this option is selected you will get a detailed
list of all starlike quartets. Note, for some data sets there may be a
lot of unresolved quartets. In this case a list of all unresolved
quartets is probably not very useful and also needs a lot of disk
space.
v
-
Approximate quartet likelihood. For the quartet puzzling tree search
only. Only for very small data sets it is necessary to compute an exact
maximum likelihood. For larger data sets this option should always be
turned on.
w
-
Model of rate heterogeneity. TREE-PUZZLE provides several different
models of rate heterogeneity: uniform rate over all sites (rate
homogeneity), Gamma distributed rates, two rates (1 invariable + 1
variable), and a mixed model (1 invariable rate + Gamma distributed
rates). All necessary parameters can be estimated by TREE-PUZZLE. Note
that whenever invariable sites are taken into account the parameter
estimation will invoke the "e" option to use an exact likelihood
function. For more detailed information please read the section in this
manual about models of sequence evolution. See also option "m" (model
of substitution).
x
-
Selects the methods used in the estimation of the model parameters.
Neighbor-joining tree means that a NJ tree is used to estimate the
parameters. Quartet sampling means that a number of random sets of four
sequences are selected to estimate parameters.
y
-
Starts analysis.
z
-
Computation of clock-like maximum likelihood branch lengths. This
option also invokes the likelihood ratio clock test.
Other Features
-
For nucleotide data TREE-PUZZLE computes the expected
transition/transversion ratio and the expected pyrimidine
transition/purine transition ratio corresponding to the selected model.
Base frequencies play an important role in the calculation of both
numbers.
-
TREE-PUZZLE also tests with a 5% level chi-square-test whether the base
composition of each sequence is identical to the average base
composition of the whole alignment. All sequences with deviating
composition are listed in the TREE-PUZZLE report file. It is desired
that no sequence (possibly except for the outgroup) has a deviating
base composition. Otherwise a basic assumption implicit in the maximum
likelihood calculation is violated.
-
A hidden feature of TREE-PUZZLE (since version 2.5) is the employment
of a weighting scheme of quartets (Strimmer, Goldman, and von Haeseler,
1997) in the quartet puzzling tree search.
-
TREE-PUZZLE also computes the average distance between all pairs of
sequences (maximum likelihood distances). The average distances can be
viewed as a rough measure for the overall sequence divergence.
-
If more than one input tree is provided TREE-PUZZLE uses the
Kishino-Hasegawa test (1989) to check which trees are significantly
worse than the best tree.
-
If clock-like maximum-likelihood branch lengths are computed
TREE-PUZZLE checks with the help of a likelihood-ratio test
(Felsenstein, 1988) whether the data set is clock-like.
-
TREE-PUZZLE also detects sequences that occur more than once in the
data and that therefore can be removed from the data set to speed up
analysis.
-
If rate heterogeneity is taken into account in the analysis TREE-PUZZLE
also computes the most probable assignment of rate categories to
sequence positions, according Felsenstein and Churchill (1996).
Interpretation and Hints
-
Quartet Puzzling Support Values
The quartet puzzling (QP) tree search estimates support values for each
internal branch. They can be interpreted in much the same way as
bootstrap values (though they should not be confused with them).
Branches showing a QP reliability from 90% to 100% can be considered
very strongly supported. Branches with lower reliability (> 70%) can in
principle be also trusted but in this case it is advisable to check how
well the respective internal branch does in comparison to other
branches in the tree (i.e. check relative reliability). If you are
interested in a branch with a low confidence it is also important to
check the alternative groupings that are not included in the QP tree
(they are listed in the TREE-PUZZLE report file in *.** format). There
should be a substantial gap between the lowest reliability value of the
QP tree and the most frequent grouping that is not included in the QP
tree.
-
Percentage of Unresolved Quartets
TREE-PUZZLE computes the number and the percentage of completely
unresolved maximum likelihood quartets. An unresolved quartet is a
quartet where the maximum likelihood values for each of the three
possible quartet topologies are so similar that it is not possible to
prefer one of them (Strimmer, Goldman, and von Haeseler, 1997). The
percentage of the unresolved quartets among all possible quartets is an
indicator of the suitability of the data for phylogenetic analysis. A
high percentage usually results in a highly multifurcating quartet
puzzling tree. If you only have a few unresolved quartets we recommend
to invoke option "u" to get a list of all these quartets. In a
likelihood mapping analysis the percentage of completely unresolved
quartets is shown in the central region of the triangle diagram.
-
Automatic Parameter Estimation
-
Likelihood Mapping
Likelihood mapping (Strimmer and von Haeseler, 1997) is a method to
analyzethe support for internal branches in a tree without having to
compute an overall tree. Every internal branch in an a completely
resolved tree defines up to four clusters of sequences. Sometimes only
the relationship of these groups are of interest and not details of the
structure of the clusters themselves. Then a likelihood mapping
analysis is sufficient. The corresponding likelihood mapping triangle
diagrams (as contained in various output files generated by
TREE-PUZZLE) will illucidate the possible relationships in detail.
-
Batch Mode
Running TREE-PUZZLE from a Unix batch file is straightforward despite
the lack of command switches. For example, to run TREE-PUZZLE with a
the transition/transversion parameter equal to 10 the following lines
in a batch file are sufficient:
puzzle << !
t
10
y
!
All other parameters can also be accessed the same way.
Limits and Error Messages
-
TREE-PUZZLE has a built-in limit to allow data sets only up to 257
sequences in order to avoid overflow of internal integer variables. At
least 32767 sites should be possible depending on the compiler used.
Computation time will be the largest constraint even if sufficient
computer memory is available. If rate heterogeneity is taken into
account every additional category slows down the overall computation by
the amount of time needed for one complete run assuming rate
homogeneity.
-
If problems are encountered TREE-PUZZLE terminates program execution
and returns a plain text error message. Depending on the severity
errors can be classified into three groups:
-
"HALT " errors: Very severe. You should never ever see one of these
messages. If so, please contact the developers!
"Unable to proceed" errors: Harmless but annoying. Mostly memory errors
(not enough RAM) or problems with the format of the input files.
Other errors: Completely uncritical. Occur mostly when options of
TREE-PUZZLE are being set.
-
A standard machine (1996 Unix workstation) with 32 to 64 MB RAM
TREE-PUZZLE can easily do maximum likelihood tree searches including
estimation of support values for data sets with 50-100 sequences. As
likelihood mapping is not memory consuming and computationally quite
fast it can be applied to large data sets as well.
Are Quartets Reliable?
-
Quartets may be intrinsically one of the most difficult phylogenies to
resolve accurately (cf. Hillis, 1996). It has been asked whether this
is a problem for quartet puzzling because it works with quartets.
-
However, this is not true. According to Hillis' findings (Hillis,
1996), quartets can be hard, but extra information helps. That is, if
all you have are data on species (A, B, C, D) then it might be
relatively difficult to find the correct tree for them. But if you have
additional data (species E, F, G, ...) and try to find a tree for all
the species, then that part of the tree relating (A, B, C, D) will more
likely be correct than if you had just the data for (A, B, C, D). In
Hillis' big 'model' tree, there are many examples of subsets of 4
species which in themselves might be hard to resolve correctly, but
which are correctly resolved thanks to the (...large amount of...)
additional data. TREE-PUZZLE (quartet puzzling) also gains advantage
from extra data in the same way. It's 'understanding' or resolution of
the quartet (A, B, C, D) might be incorrect, but the information on the
relationships of (A, B, C, D) implicit in its treatment of (A, B, C,
E), (A, B, E, D), (A, E, C, D), (E, B, C, D), (A, B, C, F), (A, B, F,
D), (A, F, C, D), (F, B, C, D), (A, B, C, G), etc. etc. should overcome
this problem.
-
The facts about how well TREE-PUZZLE actually works have been
investigated in the Strimmer and von Haeseler (1996) and Strimmer,
Goldman, and von Haeseler (1997) papers. Their results cannot be
altered by Hillis' findings. Considered as a heuristic search for
maximum likelihood trees, quartet puzzling works very well.
-
(This section follows N. Goldman, personal communication).
Other Programs
There are a number of other very useful and widespread programs to
reconstruct phylogenetic relationships and to analyse molecular
sequence data that are available free of charge. Here are the URLS of
some web pages that provide links to most of them (including the PHYLIP
package and the MOLPHY and PAML maximum likelihood programs):
Joe Felsenstein's list of programs (well-organized and pretty
exhaustive):
http://evolution.genetics.washington.edu/phylip/software.html
"Tree of Life" software page:
http://phylogeny.arizona.edu/tree/programs/programs.html
European Bioinformatics Institute:
http://www.ebi.ac.uk/biocat/biocat.html
Acknowledgements
The maximum likelihood kernel of TREE-PUZZLE is an offspring of the
program NucML/ProtML version 2.2 by Jun Adachi and Masami Hasegawa
(ftp://sunmh.ism.ac.jp/pub/molphy). We thank them for generously
allowing us to use the source code of their program. We would also like
to thank the European Bioinformatics Institute (EBI), the Institut
Pasteur, and the University of Indiana (i.e. Don Gilbert) for kindly
distributing the TREE-PUZZLE program. We thank Stephane Bortzmeyer for
his with debugging of floating point exception errors. We also thank
Peter Foster for pointing out the inconsistency in the invariable site
models in respect to other programs. Finally we thank the Deutsche
Forschungsgemeinschaft (VI 160/3-1 and Ha 1628/4-1) and the
Max-Planck-Society for financial support.
References
-
Adachi, J., and M. Hasegawa. 1996. MOLPHY: programs for molecular
phylogenetics, version 2.3. Institute of Statistical Mathematics,
Tokyo.
-
Adachi, J., and M. Hasegawa. 1996. Model of amino acid substitution in
proteins encoded by mitochondrial DNA. J. Mol. Evol. 42: 459-468.
-
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of
evolutionary change in proteins. In: Dayhoff, M. O. (ed.) Atlas of
Protein Sequence Structure, Vol. 5, Suppl. 3. National Biomedical
Research Foundation, Washington DC, pp. 345-352.
-
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum
likelihood approach. J. Mol. Evol. 17: 368-376.
-
Felsenstein, J. 1984. Distance methods for inferring phylogenies: A
Justification. Evolution 38: 16-24.
-
Felsenstein, J. 1988. Phylogenies from molecular sequences: Inference
and reliability. Annu. Rev. Genet. 22: 521-565.
-
Felsenstein, J. 1993. PHYLIP (Phylogeny Inference Package) version
3.5c. Distributed by the author. Department of Genetics, University of
Washington, Seattle.
-
Felsenstein, J., and G.A. Churchill. 1996. A hidden Markov model
approach to variation among sites in rate of evolution. Mol. Biol.
Evol. 13: 93-104.
-
Gropp, W., S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W.
Saphir, and M. Snir. 1998. MPI - The Complete Reference: Volume 2, The
MPI Extensions. 2nd Edition, The MIT Press, Cambridge, MA.
-
Gu, X., Y.-X. Fu, and W.-H. Li. 1995. Maximum likelihood estimation of
the heterogeneity of substitution rate among nucleotide sites. Mol.
Biol. Evol. 12: 546-557.
-
Hasegawa, M., H. Kishino, and K. Yano. 1985. Dating of the human-ape
splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:
160-174.
-
Henikoff, S., J. G. Henikoff. 1992. Amino acid substitution matrices
from protein blocks. PNAS (USA) 89:10915-10919.
-
Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130-131.
-
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules.
In: Munro, H. N. (ed.) Mammalian Protein Metabolism, New York: Academic
Press, pp. 21-132.
-
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid
generation of mutation data matrices from protein sequences. CABIOS 8:
275-282.
-
Kimura, M. 1980. A simple method for estimating evolutionary rates of
base substitutions through comparative studies of nucleotide sequences.
J. Mol. Evol. 16: 111-120.
-
Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum
likelihood estimate of the evolutionary tree topologies from DNA
sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29:
170-179.
-
Mueller, T., and M. Vingron. 2000. Modeling Amino Acid Replacement. J.
Comp. Biol., to appear (preprint of the article)
-
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method
for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 1406-425.
-
Schoeniger, M., and A. von Haeseler. 1994. A stochastic model for the
evolution of autocorrelated DNA sequences. Mol. Phyl. Evol. 3: 240-247.
-
Snir, M., S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. 1998.
MPI - The Complete Reference: Volume 1, The MPI Core. 2nd Edition, The
MIT Press, Cambridge, MA.
-
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet
maximum likelihood method for reconstructing tree topologies. Mol.
Biol. Evol. 13: 964-969.
-
Strimmer, K., N. Goldman, and A. von Haeseler. 1997. Bayesian
probabilities and quartet puzzling. Mol. Biol. Evol. 14: 210-211.
-
Strimmer, K., and A. von Haeseler. 1997. Likelihood-mapping: a simple
method to visualize phylogenetic content of a sequence alignment. PNAS
(USA). 94:6815-6819.
-
Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide
substitutions in the control region of mitochondrial DNA in humans and
chimpanzees. Mol. Biol. Evol. 10: 512-526.
-
Tamura K. 1994. Model selection in the estimation of the number of
nucleotide substitutions. Mol. Biol. Evol. 11: 154-157.
-
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W:
Improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, positions-specific gap penalties and weight
matrix choice. Nucl. Acids Res. 22: 4673-4680.
-
Whelan, S. and Goldman, N. 2000. A new empirical model of amino acid
evolution. Manuscript in prep.
-
Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA
sequences with variable rates over sites: approximate methods. J. Mol.
Evol. 39:306-314.
Known Bugs
On Alpha based computers sometimes floating point exception errors
occur. Some of those result on a bug in the malloc routine in the
system routines of the Compaq operating system. We recomend to use the
GNU cc compiler (http://egcs.gnu.org), which does not use the system
malloc routine. For other occurrances of the floating point exception
we need datasets and information about the operating system to
reproduce and debug those errors.
Version History
-
The TREE-PUZZLE program has first been distributed in 1995 under the
name PUZZLE. Since then it has been continually improved. Here is a
list of the most important changes.
-
5.0 Puzzle tree reconstruction part parallelized using the MPI standard
(Message Passing Interface).
Possibility added to give input file and user tree file at the command
line. Output files renamed to the form PREFIX.EXTENSION, where PREFIX
is the input file name or, if used, the user tree file name. The
EXTENSION could be one of the following: puzzle (PUZZLE report), tree
(tree file), dist (ML distance file), eps (likelihood mapping output in
eps format), qlist (bad quartets), qstep (puzzling step tree IDs as
they occur in the analysis), or qtorder (sorted unique list of puzzling
step trees).
The likelihood value is added to the treefile as a leading comment ("[
lh=x.xxx ]") to the tree string.
VT (variable time) matrix (Mueller and Vingron, 2000) and WAG matrix
(Whelan and Goldman, 2000) added to the AA substitution models.
The Data type and AA-model options in the menu now show the
automatically set type/model first. These can now be changed using 'd'
or 'm' key in an order independent from the type/model selected. This
makes it possible to select a desired AA substitution model or data
type by piping letters to the standard input without knowing PUZZLE's
preselection.
Parameters are written to file when estimated before evaluation of the
quartets.
The inconsistency to respect to other programs in handling invariable
sites has been fixed.
Some minor bug fixes (e.g. the clockbug and another in the optimization
routine have been fixed).
4.0.2 Update to provide precompiled Windows 95/98/NT executables. In
addition: Internal rearrangement of rate matrices. Improved BLOSUM 62
matrix. Endless input loop for input files restricted to 10 trials.
Source code clean up to remove compile time warnings. Explicit quit
option in menu. Changes in NJ tree code. Updates of documentation
(address changes, correction of errors).
4.0.1 Maintenance release. Correction of mtREV matrix. Fix of the
"intree bug". Removal of stringent runtime-compatibility check to allow
out-of-the-box compile on alpha. More accurate gamma distribution
allowing 16 instead of 8 categories and ensuring a better alpha > 1.0.
Update of documentation (mainly address changes). More Unix-like file
layout, and change of license to GPL.
4.0 Executables for Windows 95/NT and OS/2 instead of MS-DOS.
Computation of clock-like branch lengths (also for amino acids and for
non-binary trees). Automatic likelihood ratio clock test. Model for
two-state sequences data (0,1) included. Display of most probable
assignment of rates to sites. Identification of groups of identical
sequences. Possibility to read multiple input trees. Kishino-Hasegawa
test to check whether trees are significantly different. BLOSUM 62
model of amino acid substitution (Henikoff-Henikoff, 1992). Use of
parameter alpha instead of eta = 1/(1+alpha) (for rate heterogeneity).
Improvements to user interface. SH model can be applied to 1st and 2nd
codon positions. Automatic check for compatible compiler settings.
Workaround for severe runtime problem when the gcc compiler was used.
3.1 Much improved user interface to rate heterogeneity (less confusing
menu, rearranged outfile, additional out-of-range check). Possibility
to read rooted input trees (automatic removal of basal bifurcation).
Computation of average distance between all pairs of sequences. Fix of
a bug that caused PUZZLE 3.0 to crash on some systems (DEC Alpha).
Cosmetic changes in program and documentation.
3.0 Rate heterogeneity included in all models of substitution (Gamma
distribution plus invariable sites). Likelihood mapping analysis with
Postscript output added. Much more sophisticated maximum likelihood
parameter estimation for all model parameters including those of rate
heterogeneity. Codon positions selectable. Update to mtREV24. New icon.
Less verbose runtime messages. HTML documentation. Better internal
error classification. More information in outfile (number of constant
positions etc.).
2.5.1 Fix of a bug (present only in version 2.5) related to computation
of the variance of the maximum likelihood branch lengths that caused
occasional crashes of PUZZLE on some systems when applied to data sets
containing many very similar sequences. Drop of support for non-FPU
Macintosh version. Corrections in manual.
2.5 Improved QP algorithm (Strimmer, Goldman, and von Haeseler, 1997).
Bug fixes in ML engine, computation of ML distances and ML branch
lengths, optional input of a user tree, F84 model added, estimation of
all TN model parameters and corresponding standard errors, CLUSTAL W
treefile convention adopted to allow to show branch lengths and QP
support values simultaneously, display of unresolved quartets, update
of mtREV matrix, source code more compatible with some almost-ANSI
compilers, more safety checks in the code.
2.4 Automatic data type recognition, chi-square-test on base
composition, automatic selection of best amino acid model, estimation
of transition-transversion parameter, ASCII plot of quartet puzzling
tree into the outfile.
2.3 More models, many usability improvements, built-in consensus tree
routines, more supported systems, bug fixes, no more dependencies of
input order. First EBI distributed version.
2.2 Optimized internal data structure requiring much less computer
memory. Bug fixes.
2.1 Bug fixes concerning algorithm and transition/transversion
parameter.
2.0 Complete revision merging the maximum likelihood and the quartet
puzzling routines into one user friendly program. First electronic
distribution.
1.0 First public release, presented at the 1995 phylogenetic workshop
(15-17 June 1995) at the University of Bielefeld, Germany.
|