CS566 Tutorial Presentation

Phylogenetic Analysis

Dayong Guo

Introduction

Phylogenetics is the study of evolutionary relatedness among various species, populations, or among a set of sequences. It was firstly stated by Ernst Haeckel in his theory of "Ontogeny recapitulates phylogeny"[1]. Besides the study of morphology or phenotype with traditional definitions and concepts, molecular analysis with modern computational tools has shown their unique strength in phylogenetics since the DNA, RNA or protein sequence data are naturally discretely defined. However, it is difficult to infer the phylogenetic tree from multiple sequence alignments because the ambiguity of insertion or deletion. Therefore, several computational algorithms have been developed to build phylogenetic trees with the input of multiple sequences. The most commonly used types of algorithms includedistance-matrix methods (e.g. neighbor-joining), maximum parsimony, maximum likelihood and Bayesian inference, etc.The PHYLIP[2] (PHYLogeny Inference Package) is one of the most popular tools for phylogenetic analysis. It includes parsimony, distance matrix and likelihood methods. Therefore, we can practice and compare these algorithms in PHYLIP with some input datasets.

Dataset

Both an artificial dataset and a real dataset were used.The original sequences were firstly aligned with ClustalX

Artificial dataset from our textbook page 303:

>SeqA

ACGCGTTGGGCGATGGCAAC

SeqB

ACGCGTTGGGCGACGGTAAT

>SeqC

ACGCATTGAATGATGATAAT

>SeqD

ACACATTGAGTGATAATAAT

Real dataset from NCBI: sequences of bone morphogenetic protein 2 protein (BMP2) from mouse, rat, human and frog. BMP2 is a conserved protein with ~90% identity among species.

human BMP2

MVAGTRCLLALLLPQVLLGGAAGLVPELGRRKFAAASSGRPSSQPSDEVLSEFELRLLSMFGLKQRPTPS

RDAVVPPYMLDLYRRHSGQPGSPAPDHRLERAASRANTVRSFHHEESLEELPETSGKTTRRFFFNLSSIP

TEEFITSAELQVFREQMQDALGNNSSFHHRINIYEIIKPATANSKFPVTRLLDTRLVNQNASRWESFDVT

PAVMRWTAQGHANHGFVVEVAHLEEKQGVSKRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHKRE

KRQAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVN

SVNSKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR

rat BMP2

MVAGTRCLLVLLLPQVLLGGAAGLIPELGRKKFAGASRPLSRPSEDVLSEFELRLLSMFGLKQRPTPSKD

VVVPPYMLDLYRRHSGQPGALAPDHRLERAASRANTVLSFHHEEAIEELSEMSGKTSRRFFFNLSSVPTD

EFLTSAELQIFREQMQEALGNSSFQHRINIYEIIKPATASSKFPVTRLLDTRLVTQNTSQWESFDVTPAV

MRWTAQGHTNHGFVVEVAHLEEKPGVSKRHVRISRSLHQDEHSWSQVRPLLVTFGHDGKGHPLHKREKRQ

AKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVNSVN

SKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR

mouse BMP2

MVAGTRCLLVLLLPQVLLGGAAGLIPELGRKKFAAASSRPLSRPSEDVLSEFELRLLSMFGLKQRPTPSK

DVVVPPYMLDLYRRHSGQPGAPAPDHRLERAASRANTVRSFHHEEAVEELPEMSGKTARRFFFNLSSVPS

DEFLTSAELQIFREQIQEALGNSSFQHRINIYEIIKPAAANLKFPVTRLLDTRLVNQNTSQWESFDVTPA

VMRWTTQGHTNHGFVVEVAHLEENPGVSKRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHKREKR

QAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVNSV

NSKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR

>frog BMP2

MVAGIHSLLLLLFYQVLLSGCTGLIPEEGKRKYTESGRSSPQQSQRVLNQFELRLLSMFGLKRRPTPGKN

VVIPPYMLDLYHLHLAQLAADEGTSAMDFQMERAASRANTVRSFHHEESMEEIPESREKTIQRFFFNLSS

IPNEELVTSAELRIFREQVQEPFESDSSKLHRINIYDIVKPAAAASRGPVVRLLDTRLVHHNESKWESFD

VTPAIARWIAHKQPNHGFVVEVTHLDNDKNVPKKHVRISRSLTPDKDNWPQIRPLLVTFSHDGKGHALHK

RQKRQARHKQRKRLKSSCRRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTL

VNSVNTNIPKACCVPTELSAISMLYLDENEK

Methods

For both the artificial input and real input sequences, the following steps are followed to generate final phylogenetic trees. Fitch-Margoliash algorithm is used representing the distance method. Protpars algorithm is used representing the parsimony method. Then, the results are further compared and discussed.

1)Input dataset of FASTA sequences is loaded into CLUSTALW (from EBI) to generate alignment file in PHYLIP-format;

2)The alignment file is loaded into PHYLIP programs, to generate output of file of matrix and tree. Options of algorithms include:

Distance methods:

Dnadist DNA distance matrix calculation

Protdist Protein distance matrix calculation

Fitch Fitch-Margoliash tree drawing method without molecular clock

Kitsch Fitch-Margoliash tree drawing method with molecular clock

Neighbor Neighbor-Joining and UPGMA tree drawing method

Character based methods

Dnapars DNA parsimony

Dnapenny DNA parsimony using branch-and-bound

Dnaml DNA maximum likelihood without molecular clock

Dnamlk DNA maximum likelihood with molecular clock

Protpars Protein parsimony

Proml Protein maximum likelihood

3)Draw the tree with the result from previous step. Options include:

Drawgram Draws a rooted tree

Drawtree Draws an unrooted tree

Retree Interactive tree-rearrangement

Results

1. Using the artificial sequences with distance method Fitch-Margoliash tree drawing without molecular clock:

1)Alignment by CLUSTALW is:

4 20

SeqC ACGCATTGAA TGATGATAAT

SeqD ACACATTGAG TGATAATAAT

SeqA ACGCGTTGGG CGATGGCAAC

SeqB ACGCGTTGGG CGACGGTAAT

2) Distance matrix generated by protdist.exe:

4

SeqC 0.000000 0.146148 0.497602 0.387604

SeqD 0.146148 0.000000 0.574539 0.456486

SeqA 0.497602 0.574539 0.000000 0.220676

SeqB 0.387604 0.456486 0.220676 0.000000

3) Tree generated by fitch.exe:

4 Populations

Fitch-Margoliash method version 3.66

__ __ 2

\ \ (Obs - Exp)

Sum of squares = /_ /_ ------

2

i j Obs

Negative branch lengths not allowed

+------SeqD

!

! +--SeqB

1------2

! +------SeqA

!

+-SeqC

remember: this is an unrooted tree!

Sum of squares = 0.00014

Average percent standard deviation = 0.37228

Between And Length

------

1 SeqD 0.10906

1 2 0.29559

2 SeqB 0.05363

2 SeqA 0.16705

1 SeqC 0.03709

(SeqD:0.10906,(SeqB:0.05363,SeqA:0.16705):0.29559,SeqC:0.03709);

4) Using drawtree.exe:

2. Using the artificial sequences with Character based method Protpars Protein parsimony:

1)Alignment by CLUSTALW is:

4 20

SeqC ACGCATTGAA TGATGATAAT

SeqD ACACATTGAG TGATAATAAT

SeqA ACGCGTTGGG CGATGGCAAC

SeqB ACGCGTTGGG CGACGGTAAT

2)Using protpars.exe to generate tree:

Protein parsimony algorithm, version 3.66

One most parsimonious tree found:

+--SeqB

+--3

+--2 +--SeqA

! !

1 +-----SeqD

!

+------SeqC

remember: this is an unrooted tree!requires a total of 14.000

(((SeqB,SeqA),SeqD),SeqC);

3)Using drawtree.ext:

3. Using the real sequences with distance method Fitch-Margoliash tree drawing without molecular clock:

1) Alignment by CLUSTALW is:

4 400

rat MVAGTRCLLV LLLPQVLLGG AAGLIPELGR KKFAGAS--R PLSRPSEDVL

mouse MVAGTRCLLV LLLPQVLLGG AAGLIPELGR KKFAAASS-R PLSRPSEDVL

human MVAGTRCLLA LLLPQVLLGG AAGLVPELGR RKFAAASSGR PSSQPSDEVL

frog MVAGIHSLLL LLFYQVLLSG CTGLIPEEGK RKYTESG--R SSPQQSQRVL

SEFELRLLSM FGLKQRPTPS KDVVVPPYML DLYRRHSGQ- ---PGALAPD

SEFELRLLSM FGLKQRPTPS KDVVVPPYML DLYRRHSGQ- ---PGAPAPD

SEFELRLLSM FGLKQRPTPS RDAVVPPYML DLYRRHSGQ- ---PGSPAPD

NQFELRLLSM FGLKRRPTPG KNVVIPPYML DLYHLHLAQL AADEGTSAMD

HRLERAASRA NTVLSFHHEE AIEELSEMSG KTSRRFFFNL SSVPTDEFLT

HRLERAASRA NTVRSFHHEE AVEELPEMSG KTARRFFFNL SSVPSDEFLT

HRLERAASRA NTVRSFHHEE SLEELPETSG KTTRRFFFNL SSIPTEEFIT

FQMERAASRA NTVRSFHHEE SMEEIPESRE KTIQRFFFNL SSIPNEELVT

SAELQIFREQ MQEALGN-SS FQHRINIYEI IKPATASSKF PVTRLLDTRL

SAELQIFREQ IQEALGN-SS FQHRINIYEI IKPAAANLKF PVTRLLDTRL

SAELQVFREQ MQDALGNNSS FHHRINIYEI IKPATANSKF PVTRLLDTRL

SAELRIFREQ VQEPFESDSS KLHRINIYDI VKPAAAASRG PVVRLLDTRL

VTQNTSQWES FDVTPAVMRW TAQGHTNHGF VVEVAHLEEK PGVSKRHVRI

VNQNTSQWES FDVTPAVMRW TTQGHTNHGF VVEVAHLEEN PGVSKRHVRI

VNQNASRWES FDVTPAVMRW TAQGHANHGF VVEVAHLEEK QGVSKRHVRI

VHHNESKWES FDVTPAIARW IAHKQPNHGF VVEVTHLDND KNVPKKHVRI

SRSLHQDEHS WSQVRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC

SRSLHQDEHS WSQIRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC

SRSLHQDEHS WSQIRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC

SRSLTPDKDN WPQIRPLLVT FSHDGKGHAL HKRQKRQARH KQRKRLKSSC

KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ

KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ

KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ

RRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ

TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR

TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR

TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR

TLVNSVNTNI PKACCVPTEL SAISMLYLDE NEK------

2) Distance matrix by protdist.exe:

4

rat 0.000000 0.038781 0.081053 0.335587

mouse 0.038781 0.000000 0.078105 0.326765

human 0.081053 0.078105 0.000000 0.317470

frog 0.335587 0.326765 0.317470 0.000000

3) Tree generated by fitch.exe:

4 Populations

Fitch-Margoliash method version 3.66

__ __ 2

\ \ (Obs - Exp)

Sum of squares = /_ /_ ------

2

i j Obs

Negative branch lengths not allowed

+mouse

!

! +------frog

1-2

! +-human

!

+rat

remember: this is an unrooted tree!

Sum of squares = 0.00030

Average percent standard deviation = 0.54531

Between And Length

------

1 mouse 0.01776

1 2 0.02722

2 frog 0.28449

2 human 0.03298

1 rat 0.02102

(mouse:0.01776,(frog:0.28449,human:0.03298):0.02722,rat:0.02102);

4) drawtree.exe

4. Using the real sequences with character based method Protpars.exe protein parsimony:

1) Alignment by CLUSTALW is the same as 3.1.

2) Using protpars.exe to generate tree:

Protein parsimony algorithm, version 3.66

One most parsimonious tree found:

+--frog

+--3

+--2 +--human

! !

1 +-----mouse

!

+------rat

remember: this is an unrooted tree!

requires a total of 227.000

(((frog,human),mouse),rat);

4)drawtree.exe

Discussion

For the short and simple artificial dataset, the distance method showed the similar result to the parsimony method, which is also very close to the original result on text book page 303. However, the two methods with real dataset of BMP2 sequences generated trees with very different lengths. The distance method made frog the most distant group, and rat vs. mouse as closest groups. The parsimony tree drew rat far away from other groups. According to the biological evidence, the distance model should fit the evolutional history better with the BMP2 example. Since the parsimony algorithm requires very high homology among sequences, the improper structure of parsimony tree could be the result of relatively lower than the required homology of BMP2 proteins among species.

The PHYLIP program set includes multiple algorithms and options providing convenience and flexibility. And, different versions enable performance on various OS platforms. However, the text-based interface is not friendly. And, there is no online service.

Reference

1.Haeckel, E., Riddle of the Universe at the Close of the Nineteenth Century. 1866.

2.Felsenstein, J., PHYLIP. 2006.