1

1. Analyses of concatenated chloroplast protein sequences

a) The data set used for the analyses presented in Fig. 2

The file “protein_alignment.txt” shows the sequence alignment of the proteins encoded by the 53 genes that are common to the chloroplast genomes of Mesostigma viride, three other green algae, three land plants and Cyanophora paradoxa. All of the gaps and unambiguously aligned regions were eliminated from the alignment to produce the data set that was used for the phylogenetic analyses presented in Fig. 2. This data set is presented in PHYLIP format in the file “protein_dataset.txt”; it contains a total of 10,629 amino acid positions of which 5,001 are variable. Following are the number of amino acid positions associated with each of the proteins in the data set: AtpA (486), AtpB (466), AtpE (122), AtpF (157), AtpH (80), CcsA (205), ClpP (179), PetA (276), PetB (215), PetD (160), PetG (37), PetL (28), PsaA (745), PsaB (732), PsaC (80), PsaI (30), PsaJ (35), PsbA (345), PsbB (506), PsbC (461), PsbD (343), PsbE (69), PsbF (33), PsbH (54), PsbI (36), PsbJ (39), PsbK (39), PsbL (38), PsbM (29), PsbN (42), PsbT (29), RbcL (475), Rpl2 (250), Rpl14 (113), Rpl16 (132), Rpl20 (106), Rpl36 (36), RpoA (175), RpoB (803), RpoC1 (408), RpoC2 (550), Rps2 (213), Rps3 (159), Rps4 (149), Rps7 (155), Rps8 (104), Rps11 (121), Rps12 (121), Rps14 (89), Rps18 (51), Rps19 (92), Ycf4 (170), and Ycf9 (61). The file “aa_composition.xls” reports the amino acid composition of the concatenated proteins from each of the taxa examined.

b) Effect of outgroup selection on support for T1

Even when the Cyanophora sequences were excluded from the data set of 10,629 amino acid positions or when other outgroup sequences were used to root the green algal phylogeny, T1 was the best topology observed. To test the influence of outgroup, we conducted maximum likelihood (ML) analyses on three independent data sets of 50 concatenated proteins containing 9,715 amino acid positions per taxon. The selected outgroups were the chloroplast proteins of Cyanophora, those of the red alga Porphyra purpurea, and a combination of the latter proteinswith those of the cyanobacterium Synechocystis sp. PCC6803. T1 was supported by RELL bootstrap values greater than 97.8% in the analyses of the three data sets, and both T2 and T3 were significantly worse than T1 in the analyses with the individual Cyanophora and Porphyra proteins. In all trees inferred from the data set containing the combined outgroup sequences, Mesostigma formed a monophyletic group with green plants, with this prasinophyte being found either at the most supported position (T1) or within Streptophyta (T3).

c) Effect of constant site removal on support for T1

We found that the presence of invariable sites in the data set of 10,629 positions has little effect on the tree building conclusions derived from the analyses presented in Fig. 2. After removal of all 5,628 constant sites (of which 5,585 are variable as estimated by SPLITSTREE) from this data set, bootstrap support for T1 remained at 100% in symmetric distance analyses under the Dayhoff et al. and JTT-F models, with either a uniform or a gamma-distributed rate of substitution. In ML analysis, we noted only a negligible increase in bootstrap support (99.38% versus 99.81%) after removal of all constant sites. Consistent with the latter observation, bootstrap support for T1 in ML analyses was not significantly affected by removal of increasing proportions (0 to1, in 11 even steps) of constant sites from the data set of 10,629 positions (see table). The confidence limit of tree topologies under the Kishino-Hasegawa (KH) test remained essentially the same; T2 and T3 proved to be significantly worse than T1 in all analyses.

Effect of removal of constant sites on support for T1 in ML analyses

Proportion of
constant sites
in the data set / RELL Bootstrap
support for T1 / Is T2 significantly worse
than T1 in the KH test? / Is T3 significantly worse
than T1 in the KH test?
ln L / P < 0.01 / P < 0.05 / ln L / P < 0.01 / P < 0.05
0 / 0.9981 / -43.1 ± 14.4 / yes / yes / -43.4 ± 14.5 / yes / yes
0.1 / 0.9971 / -48.8 ± 16.5 / yes / yes / -49.2 ± 16.6 / yes / yes
0.2 / 0.9971 / -52.9 ± 18.4 / yes / yes / -53.9 ± 18.4 / yes / yes
0.3 / 0.9969 / -56.6 ± 20.0 / yes / yes / -58.0 ± 20.1 / yes / yes
0.4 / 0.9963 / -59.7 ± 21.5 / yes / yes / -61.4 ± 21.6 / yes / yes
0.5 / 0.9965 / -62.3 ± 22.9 / yes / yes / -64.3 ± 22.9 / yes / yes
0.6 / 0.9945 / -64.6 ± 24.1 / yes / yes / -67.0 ± 24.2 / yes / yes
0.7 / 0.9942 / -66.6 ± 25.3 / yes / yes / -69.3 ± 25.3 / yes / yes
0.8 / 0.9923 / -68.4 ± 26.3 / no / yes / -71.3 ± 26.4 / yes / yes
0.9 / 0.9938 / -70.0 ± 27.4 / no / yes / -73.2 ± 27.4 / yes / yes
1* / 0.9938 / -71.5 ± 28.3 / no / yes / -74.9 ± 28.3 / yes / yes

*This analysis was carried out independently from that shown in Fig. 2.

2. Analyses of concatenated chloroplast rRNA gene sequences

The file “rRNA_alignment.txt” shows the alignment of the chloroplast small and large subunit rRNA gene sequences from the same organisms that were used for the protein analyses presented in Fig. 2. All of the gaps and unambiguously aligned regions were eliminated from this alignment to produce the data set presented in PHYLIP format in the file “rRNA_dataset.txt”. This data set contains a total of 4,016 sites, of which 2,990 are constant and 2,687 are invariable (as estimated by SPLITSTREE). Analyses of the data with the ML (NUCML, PAUP and PUZZLE), maximum parsimony (PAUP) and distance (symmetric and asymmetric, PAUP) methods all revealed that T1 is the best topology. The bootstrap values supporting this topology were greater than 94%, and removal of invariable sites in LogDet and ML analyses did not result in significant changes. In NUCML analysis of the complete data set, T1 and T3 were recovered in 99.0% and 1.0% of RELL bootstrap samples, respectively, and T2 was not detected. The KH test (PUZZLE and NUCML) revealed that T1 is significantly better (P < 0.05) than T3 under models of uniform and gamma-distributed rates of substitution, and removal of all invariable or constant sites (PAUP and NUCML) did not affect the ability of the data to discriminate between these two topologies.

3. Analyses of actin-coding sequences

In favouring the placement of Mesostigma within Streptophyta (a position corresponding to T3), the actin gene trees reported in ref. 10 contrast with the chloroplast trees inferred in our study. Close inspection of the chloroplast and actin trees reveals that their topologies differ only with respect to the placement of the outgroup taxon (Cyanophora). To determine if there is a real conflict between these trees, we tested whether the actin data can provide unequivocal support for T3 under the KH test.

We analysed the actin-coding sequences of most of the taxa examined in ref. 10. Of the 23 actin-coding sequences analysed in the latter study, only two (those of Microthamnion kuetzingianum and Glaucocystis nostochinearum) are not publicly available in databases. The 21 available sequences were unambiguously aligned, and selection of first and second codon positions yielded a data set of 730 nucleotides as reported in ref. 10. Analysis of this data set with the maximum likelihood (NUCML, PAUP and PUZZLE), maximum parsimony (PAUP) and distance (PAUP) methods gave essentially the same results as those reported in ref. 10: trees compatible with T3 proved to be the best supported topologies, with bootstrap values higher than 80%. In NUCML analyses, trees compatible with T1 accounted for a significant proportion (12.7%) of RELL bootstrap samples.

The KH test was carried out under the HKY model of nucleotide substitution, with either a uniform or a gamma-distributed rate of substitution across sites. These analyses revealed that trees compatible with T3 are not significantly better (P ≥ 0.30; uniform rate,  ln L = -13.8 ± 12.1; gamma-distributed rate,  ln L = -7.6 ± 5.9) than those compatible with T1, indicating that the actin data set cannot distinguish between these alternative hypotheses and that there is no real conflict between the chloroplast and actin data sets. The T2 and T3 topologies were also compared under the KH test; trees compatible with T3 proved to be significantly better than those compatible with T2 at P < 0.05 ( ln L = -21.1 ± 10.1) when the substitution rate was uniform (but not when the rate was heterogeneous;  ln L = -8.5 ± 5.4). This result is consistent with the parsimony (MacClade) analysis reported in ref. 10, which revealed a significant cost (10 additional steps in the maximum parsimony tree of 499 steps) for repositioning Mesostigma at the base of Chlorophyta (a topology compatible with T2).