CS566 Tutorial Presentation
Phylogenetic Analysis
Dayong Guo
Introduction
Phylogenetics is the study of evolutionary relatedness among various species, populations, or among a set of sequences. It was firstly stated by Ernst Haeckel in his theory of "Ontogeny recapitulates phylogeny"[1]. Besides the study of morphology or phenotype with traditional definitions and concepts, molecular analysis with modern computational tools has shown their unique strength in phylogenetics since the DNA, RNA or protein sequence data are naturally discretely defined. However, it is difficult to infer the phylogenetic tree from multiple sequence alignments because the ambiguity of insertion or deletion. Therefore, several computational algorithms have been developed to build phylogenetic trees with the input of multiple sequences. The most commonly used types of algorithms includedistance-matrix methods (e.g. neighbor-joining), maximum parsimony, maximum likelihood and Bayesian inference, etc.The PHYLIP[2] (PHYLogeny Inference Package) is one of the most popular tools for phylogenetic analysis. It includes parsimony, distance matrix and likelihood methods. Therefore, we can practice and compare these algorithms in PHYLIP with some input datasets.
Dataset
Both an artificial dataset and a real dataset were used.The original sequences were firstly aligned with ClustalX
Artificial dataset from our textbook page 303:
>SeqA
ACGCGTTGGGCGATGGCAAC
SeqB
ACGCGTTGGGCGACGGTAAT
>SeqC
ACGCATTGAATGATGATAAT
>SeqD
ACACATTGAGTGATAATAAT
Real dataset from NCBI: sequences of bone morphogenetic protein 2 protein (BMP2) from mouse, rat, human and frog. BMP2 is a conserved protein with ~90% identity among species.
human BMP2
MVAGTRCLLALLLPQVLLGGAAGLVPELGRRKFAAASSGRPSSQPSDEVLSEFELRLLSMFGLKQRPTPS
RDAVVPPYMLDLYRRHSGQPGSPAPDHRLERAASRANTVRSFHHEESLEELPETSGKTTRRFFFNLSSIP
TEEFITSAELQVFREQMQDALGNNSSFHHRINIYEIIKPATANSKFPVTRLLDTRLVNQNASRWESFDVT
PAVMRWTAQGHANHGFVVEVAHLEEKQGVSKRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHKRE
KRQAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVN
SVNSKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR
rat BMP2
MVAGTRCLLVLLLPQVLLGGAAGLIPELGRKKFAGASRPLSRPSEDVLSEFELRLLSMFGLKQRPTPSKD
VVVPPYMLDLYRRHSGQPGALAPDHRLERAASRANTVLSFHHEEAIEELSEMSGKTSRRFFFNLSSVPTD
EFLTSAELQIFREQMQEALGNSSFQHRINIYEIIKPATASSKFPVTRLLDTRLVTQNTSQWESFDVTPAV
MRWTAQGHTNHGFVVEVAHLEEKPGVSKRHVRISRSLHQDEHSWSQVRPLLVTFGHDGKGHPLHKREKRQ
AKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVNSVN
SKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR
mouse BMP2
MVAGTRCLLVLLLPQVLLGGAAGLIPELGRKKFAAASSRPLSRPSEDVLSEFELRLLSMFGLKQRPTPSK
DVVVPPYMLDLYRRHSGQPGAPAPDHRLERAASRANTVRSFHHEEAVEELPEMSGKTARRFFFNLSSVPS
DEFLTSAELQIFREQIQEALGNSSFQHRINIYEIIKPAAANLKFPVTRLLDTRLVNQNTSQWESFDVTPA
VMRWTTQGHTNHGFVVEVAHLEENPGVSKRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHKREKR
QAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVNSV
NSKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR
>frog BMP2
MVAGIHSLLLLLFYQVLLSGCTGLIPEEGKRKYTESGRSSPQQSQRVLNQFELRLLSMFGLKRRPTPGKN
VVIPPYMLDLYHLHLAQLAADEGTSAMDFQMERAASRANTVRSFHHEESMEEIPESREKTIQRFFFNLSS
IPNEELVTSAELRIFREQVQEPFESDSSKLHRINIYDIVKPAAAASRGPVVRLLDTRLVHHNESKWESFD
VTPAIARWIAHKQPNHGFVVEVTHLDNDKNVPKKHVRISRSLTPDKDNWPQIRPLLVTFSHDGKGHALHK
RQKRQARHKQRKRLKSSCRRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTL
VNSVNTNIPKACCVPTELSAISMLYLDENEK
Methods
For both the artificial input and real input sequences, the following steps are followed to generate final phylogenetic trees. Fitch-Margoliash algorithm is used representing the distance method. Protpars algorithm is used representing the parsimony method. Then, the results are further compared and discussed.
1)Input dataset of FASTA sequences is loaded into CLUSTALW (from EBI) to generate alignment file in PHYLIP-format;
2)The alignment file is loaded into PHYLIP programs, to generate output of file of matrix and tree. Options of algorithms include:
Distance methods:
Dnadist DNA distance matrix calculation
Protdist Protein distance matrix calculation
Fitch Fitch-Margoliash tree drawing method without molecular clock
Kitsch Fitch-Margoliash tree drawing method with molecular clock
Neighbor Neighbor-Joining and UPGMA tree drawing method
Character based methods
Dnapars DNA parsimony
Dnapenny DNA parsimony using branch-and-bound
Dnaml DNA maximum likelihood without molecular clock
Dnamlk DNA maximum likelihood with molecular clock
Protpars Protein parsimony
Proml Protein maximum likelihood
3)Draw the tree with the result from previous step. Options include:
Drawgram Draws a rooted tree
Drawtree Draws an unrooted tree
Retree Interactive tree-rearrangement
Results
1. Using the artificial sequences with distance method Fitch-Margoliash tree drawing without molecular clock:
1)Alignment by CLUSTALW is:
4 20
SeqC ACGCATTGAA TGATGATAAT
SeqD ACACATTGAG TGATAATAAT
SeqA ACGCGTTGGG CGATGGCAAC
SeqB ACGCGTTGGG CGACGGTAAT
2) Distance matrix generated by protdist.exe:
4
SeqC 0.000000 0.146148 0.497602 0.387604
SeqD 0.146148 0.000000 0.574539 0.456486
SeqA 0.497602 0.574539 0.000000 0.220676
SeqB 0.387604 0.456486 0.220676 0.000000
3) Tree generated by fitch.exe:
4 Populations
Fitch-Margoliash method version 3.66
__ __ 2
\ \ (Obs - Exp)
Sum of squares = /_ /_ ------
2
i j Obs
Negative branch lengths not allowed
+------SeqD
!
! +--SeqB
1------2
! +------SeqA
!
+-SeqC
remember: this is an unrooted tree!
Sum of squares = 0.00014
Average percent standard deviation = 0.37228
Between And Length
------
1 SeqD 0.10906
1 2 0.29559
2 SeqB 0.05363
2 SeqA 0.16705
1 SeqC 0.03709
(SeqD:0.10906,(SeqB:0.05363,SeqA:0.16705):0.29559,SeqC:0.03709);
4) Using drawtree.exe:
2. Using the artificial sequences with Character based method Protpars Protein parsimony:
1)Alignment by CLUSTALW is:
4 20
SeqC ACGCATTGAA TGATGATAAT
SeqD ACACATTGAG TGATAATAAT
SeqA ACGCGTTGGG CGATGGCAAC
SeqB ACGCGTTGGG CGACGGTAAT
2)Using protpars.exe to generate tree:
Protein parsimony algorithm, version 3.66
One most parsimonious tree found:
+--SeqB
+--3
+--2 +--SeqA
! !
1 +-----SeqD
!
+------SeqC
remember: this is an unrooted tree!requires a total of 14.000
(((SeqB,SeqA),SeqD),SeqC);
3)Using drawtree.ext:
3. Using the real sequences with distance method Fitch-Margoliash tree drawing without molecular clock:
1) Alignment by CLUSTALW is:
4 400
rat MVAGTRCLLV LLLPQVLLGG AAGLIPELGR KKFAGAS--R PLSRPSEDVL
mouse MVAGTRCLLV LLLPQVLLGG AAGLIPELGR KKFAAASS-R PLSRPSEDVL
human MVAGTRCLLA LLLPQVLLGG AAGLVPELGR RKFAAASSGR PSSQPSDEVL
frog MVAGIHSLLL LLFYQVLLSG CTGLIPEEGK RKYTESG--R SSPQQSQRVL
SEFELRLLSM FGLKQRPTPS KDVVVPPYML DLYRRHSGQ- ---PGALAPD
SEFELRLLSM FGLKQRPTPS KDVVVPPYML DLYRRHSGQ- ---PGAPAPD
SEFELRLLSM FGLKQRPTPS RDAVVPPYML DLYRRHSGQ- ---PGSPAPD
NQFELRLLSM FGLKRRPTPG KNVVIPPYML DLYHLHLAQL AADEGTSAMD
HRLERAASRA NTVLSFHHEE AIEELSEMSG KTSRRFFFNL SSVPTDEFLT
HRLERAASRA NTVRSFHHEE AVEELPEMSG KTARRFFFNL SSVPSDEFLT
HRLERAASRA NTVRSFHHEE SLEELPETSG KTTRRFFFNL SSIPTEEFIT
FQMERAASRA NTVRSFHHEE SMEEIPESRE KTIQRFFFNL SSIPNEELVT
SAELQIFREQ MQEALGN-SS FQHRINIYEI IKPATASSKF PVTRLLDTRL
SAELQIFREQ IQEALGN-SS FQHRINIYEI IKPAAANLKF PVTRLLDTRL
SAELQVFREQ MQDALGNNSS FHHRINIYEI IKPATANSKF PVTRLLDTRL
SAELRIFREQ VQEPFESDSS KLHRINIYDI VKPAAAASRG PVVRLLDTRL
VTQNTSQWES FDVTPAVMRW TAQGHTNHGF VVEVAHLEEK PGVSKRHVRI
VNQNTSQWES FDVTPAVMRW TTQGHTNHGF VVEVAHLEEN PGVSKRHVRI
VNQNASRWES FDVTPAVMRW TAQGHANHGF VVEVAHLEEK QGVSKRHVRI
VHHNESKWES FDVTPAIARW IAHKQPNHGF VVEVTHLDND KNVPKKHVRI
SRSLHQDEHS WSQVRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC
SRSLHQDEHS WSQIRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC
SRSLHQDEHS WSQIRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC
SRSLTPDKDN WPQIRPLLVT FSHDGKGHAL HKRQKRQARH KQRKRLKSSC
KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ
KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ
KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ
RRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ
TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR
TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR
TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR
TLVNSVNTNI PKACCVPTEL SAISMLYLDE NEK------
2) Distance matrix by protdist.exe:
4
rat 0.000000 0.038781 0.081053 0.335587
mouse 0.038781 0.000000 0.078105 0.326765
human 0.081053 0.078105 0.000000 0.317470
frog 0.335587 0.326765 0.317470 0.000000
3) Tree generated by fitch.exe:
4 Populations
Fitch-Margoliash method version 3.66
__ __ 2
\ \ (Obs - Exp)
Sum of squares = /_ /_ ------
2
i j Obs
Negative branch lengths not allowed
+mouse
!
! +------frog
1-2
! +-human
!
+rat
remember: this is an unrooted tree!
Sum of squares = 0.00030
Average percent standard deviation = 0.54531
Between And Length
------
1 mouse 0.01776
1 2 0.02722
2 frog 0.28449
2 human 0.03298
1 rat 0.02102
(mouse:0.01776,(frog:0.28449,human:0.03298):0.02722,rat:0.02102);
4) drawtree.exe
4. Using the real sequences with character based method Protpars.exe protein parsimony:
1) Alignment by CLUSTALW is the same as 3.1.
2) Using protpars.exe to generate tree:
Protein parsimony algorithm, version 3.66
One most parsimonious tree found:
+--frog
+--3
+--2 +--human
! !
1 +-----mouse
!
+------rat
remember: this is an unrooted tree!
requires a total of 227.000
(((frog,human),mouse),rat);
4)drawtree.exe
Discussion
For the short and simple artificial dataset, the distance method showed the similar result to the parsimony method, which is also very close to the original result on text book page 303. However, the two methods with real dataset of BMP2 sequences generated trees with very different lengths. The distance method made frog the most distant group, and rat vs. mouse as closest groups. The parsimony tree drew rat far away from other groups. According to the biological evidence, the distance model should fit the evolutional history better with the BMP2 example. Since the parsimony algorithm requires very high homology among sequences, the improper structure of parsimony tree could be the result of relatively lower than the required homology of BMP2 proteins among species.
The PHYLIP program set includes multiple algorithms and options providing convenience and flexibility. And, different versions enable performance on various OS platforms. However, the text-based interface is not friendly. And, there is no online service.
Reference
1.Haeckel, E., Riddle of the Universe at the Close of the Nineteenth Century. 1866.
2.Felsenstein, J., PHYLIP. 2006.