version 3.6
PROTPARS -- Protein Sequence Parsimony Method
(C) Copyright 1986-2002 by the University of Washington. Written by
Joseph Felsenstein. Permission is granted to copy this document
provided that no fee is charged for it and that this copyright notice
is not removed.
This program infers an unrooted phylogeny from protein sequences, using
a new method intermediate between the approaches of Eck and Dayhoff
(1966) and Fitch (1971). Eck and Dayhoff (1966) allowed any amino acid
to change to any other, and counted the number of such changes needed
to evolve the protein sequences on each given phylogeny. This has the
problem that it allows replacements which are not consistent with the
genetic code, counting them equally with replacements that are
consistent. Fitch, on the other hand, counted the minimum number of
nucleotide substitutions that would be needed to achieve the given
protein sequences. This counts silent changes equally with those that
change the amino acid.
The present method insists that any changes of amino acid be consistent
with the genetic code so that, for example, lysine is allowed to change
to methionine but not to proline. However, changes between two amino
acids via a third are allowed and counted as two changes if each of the
two replacements is individually allowed. This sometimes allows changes
that at first sight you would think should be outlawed. Thus we can
change from phenylalanine to glutamine via leucine in two steps total.
Consulting the genetic code, you will find that there is a leucine
codon one step away from a phenylalanine codon, and a leucine codon one
step away from glutamine. But they are not the same leucine codon. It
actually takes three base substitutions to get from either of the
phenylalanine codons TTT and TTC to either of the glutamine codons CAA
or CAG. Why then does this program count only two? The answer is that
recent DNA sequence comparisons seem to show that synonymous changes
are considerably faster and easier than ones that change the amino
acid. We are assuming that, in effect, synonymous changes occur so much
more readily that they need not be counted. Thus, in the chain of
changes TTT (Phe) --> CTT (Leu) --> CTA (Leu) --> CAA (Glu), the middle
one is not counted because it does not change the amino acid (leucine).
To maintain consistency with the genetic code, it is necessary for the
program internally to treat serine as two separate states (ser1 and
ser2) since the two groups of serine codons are not adjacent in the
code. Changes to the state "deletion" are counted as three steps to
prevent the algorithm from assuming unnecessary deletions. The state
"unknown" is simply taken to mean that the amino acid, which has not
been determined, will in each part of a tree that is evaluated be
assumed be whichever one causes the fewest steps.
The assumptions of this method (which has not been described in the
literature), are thus something like this:
-
Change in different sites is independent.
-
Change in different lineages is independent.
-
The probability of a base substitution that changes the amino acid
sequence is small over the lengths of time involved in a branch of
the phylogeny.
-
The expected amounts of change in different branches of the
phylogeny do not vary by so much that two changes in a high-rate
branch are more probable than one change in a low-rate branch.
-
The expected amounts of change do not vary enough among sites that
two changes in one site are more probable than one change in
another.
-
The probability of a base change that is synonymous is much higher
than the probability of a change that is not synonymous.
That these are the assumptions of parsimony methods has been documented
in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b,
1988b). For an opposing view arguing that the parsimony methods make no
substantive assumptions such as these, see the works by Farris (1983)
and Sober (1983a, 1983b, 1988), but also read the exchange between
Felsenstein and Sober (1986).
The input for the program is fairly standard. The first line contains
the number of species and the number of amino acid positions (counting
any stop codons that you want to include).
Next come the species data. Each sequence starts on a new line, has a
ten-character species name that must be blank-filled to be of that
length, followed immediately by the species data in the one-letter
code. The sequences must either be in the "interleaved" or "sequential"
formats described in the Molecular Sequence Programs document. The I
option selects between them. The sequences can have internal blanks in
the sequence but there must be no extra blanks at the end of the
terminated line. Note that a blank is not a valid symbol for a
deletion.
The protein sequences are given by the one-letter code used by
described in the Molecular Sequence Programs documentation file. Note
that if two polypeptide chains are being used that are of different
length owing to one terminating before the other, they should be coded
as (say)
HIINMA*????
HIPNMGVWABT
since after the stop codon we do not definitely know that there has
been a deletion, and do not know what amino acid would have been there.
If DNA studies tell us that there is DNA sequence in that region, then
we could use "X" rather than "?". Note that "X" means an unknown amino
acid, but definitely an amino acid, while "?" could mean either that or
a deletion. The distinction is often significant in regions where there
are deletions: one may want to encode a six-base deletion as "-?????"
since that way the program will only count one deletion, not six
deletion events, when the deletion arises. However, if there are
overlapping deletions it may not be so easy to know what coding is
correct.
One will usually want to use "?" after a stop codon, if one does not
know what amino acid is there. If the DNA sequence has been observed
there, one probably ought to resist putting in the amino acids that
this DNA would code for, and one should use "X" instead, because under
the assumptions implicit in this parsimony method, changes to any
noncoding sequence are much easier than changes in a coding region that
change the amino acid, so that they shouldn't be counted anyway!
The form of this information is the standard one described in the main
documentation file. For the U option the tree provided must be a rooted
bifurcating tree, with the root placed anywhere you want, since that
root placement does not affect anything.
The options are selected using an interactive menu. The menu looks like
this:
Protein parsimony algorithm, version 3.6
Setting for this run:
U Search for best tree? Yes
J Randomize input order of sequences? No. Use input order
O Outgroup root? No, use as outgroup species 1
T Use Threshold parsimony? No, use ordinary parsimony
C Use which genetic code? Universal
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, VT52, ANSI)? (none)
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Print out steps in each site No
5 Print sequences at all nodes of tree No
6 Write out trees onto tree file? Yes
Are these settings correct? (type Y or the letter for one to change)
The user either types "Y" (followed, of course, by a carriage-return)
if the settings shown are to be accepted, or the letter or digit
corresponding to an option that is to be changed.
The options U, J, O, T, W, M, and 0 are the usual ones. They are
described in the main documentation file of this package. Option I is
the same as in other molecular sequence programs and is described in
the documentation file for the sequence programs. Option C allows the
user to select among various nuclear and mitochondrial genetic codes.
There is no provision for coping with data where different genetic
codes have been used in different organisms.
In the U (User tree) option, the trees should not be preceded by a line
with the number of trees on it.
Output is standard: if option 1 is toggled on, the data is printed out,
with the convention that "." means "the same as in the first species".
Then comes a list of equally parsimonious trees, and (if option 2 is
toggled on) a table of the number of changes of state required in each
position. If option 5 is toggled on, a table is printed out after each
tree, showing for each branch whether there are known to be changes in
the branch, and what the states are inferred to have been at the top
end of the branch. If the inferred state is a "?" there will be
multiple equally-parsimonious assignments of states; the user must work
these out for themselves by hand. If option 6 is left in its default
state the trees found will be written to a tree file, so that they are
available to be used in other programs.
If the U (User Tree) option is used and more than one tree is supplied,
the program also performs a statistical test of each of these trees
against the best tree. This test, which is a version of the test
proposed by Alan Templeton (1983) and evaluated in a test case by me
(1985a). It is closely parallel to a test using log likelihood
differences due to Kishino and Hasegawa (1989), and uses the mean and
variance of step differences between trees, taken across positions. If
the mean is more than 1.96 standard deviations different then the trees
are declared significantly different. The program prints out a table of
the steps for each tree, the differences of each from the best one, the
variance of that quantity as determined by the step differences at
individual positions, and a conclusion as to whether that tree is or is
not significantly worse than the best one.
The program is derived from MIX but has had some rather elaborate
bookkeeping using sets of bits installed. It is not a very fast program
but is speeded up substantially over version 3.2.
__________________________________________________________________
TEST DATA SET
5 10
Alpha ABCDEFGHIK
Beta AB--EFGHIK
Gamma ?BCDSFG*??
Delta CIKDEFGHIK
Epsilon DIKDEFGHIK
__________________________________________________________________
CONTENTS OF OUTPUT FILE (with all numerical options on)
-
Protein parsimony algorithm, version 3.6
-
requires a total of 16.000
-
steps in each position:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 3 1 5 3 2 0 0 2 0
10! 0
-
From To Any Steps? State at upper node
( . means same as in the node below it on tree)
-
1 ANCDEFGHIK
1 2 no ..........
2 Gamma yes ?B..S..*??
2 3 yes ..?.......
3 4 yes ?IK.......
4 Epsilon maybe D.........
4 Delta yes C.........
3 Beta yes .B--......
1 Alpha maybe .B........
-
+--Epsilon
+--4
+--3 +--Delta
! !
+--2 +-----Gamma
! !
1 +--------Beta
!
+-----------Alpha
-
remember: this is an unrooted tree!
-
requires a total of 16.000
-
steps in each position:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 3 1 5 3 2 0 0 2 0
10! 0
-
From To Any Steps? State at upper node
( . means same as in the node below it on tree)
-
1 ANCDEFGHIK
1 2 no ..........
2 3 maybe ?.........
3 4 yes .IK.......
4 Epsilon maybe D.........
4 Delta yes C.........
3 Gamma yes ?B..S..*??
2 Beta yes .B--......
1 Alpha maybe .B........
-
+--Epsilon
+-----4
! +--Delta
+--3
! ! +--Gamma
1 +-----2
! +--Beta
!
+-----------Alpha
-
remember: this is an unrooted tree!
-
requires a total of 16.000
-
steps in each position:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 3 1 5 3 2 0 0 2 0
10! 0
-
From To Any Steps? State at upper node
( . means same as in the node below it on tree)
1 ANCDEFGHIK
1 3 no ..........
3 4 yes ?IK.......
4 Epsilon maybe D.........
4 Delta yes C.........
3 2 no ..........
2 Gamma yes ?B..S..*??
2 Beta yes .B--......
1 Alpha maybe .B........
|