SUPPLEMENTARY TEXT

Nonparametric Shimodaira-Hasegawa tests. We conducted Shimodaira-Hasegawa (SH) tests of tree topologies (45) within a ML framework using PAUP*4.0b10. For a given dataset, the SH test uses the difference in log likelihoods of competing topologies as the test statistic, and the null distribution of the test statistic is obtained using nonparametric bootstrapping of reestimated log likelihoods (RELL). The SH test is relatively conservative and is a statistically appropriate test of topologies when some of the topologies are generated from the data at hand (18). To avoid potential bias towards higher levels of significance due to small numbers of topologies (53), we added 100 random topologies to each test. We used the SH test for 3 different analyses:

Analysis 1. To test the overall congruence of tree topologies for the 27 diverse strains, we obtained optimal models and ML trees for 5 datasets including the 7 MLST genes (the MLST dataset), the 7 SAS genes (the SAS dataset), the 14 MLST and SAS genes (the COMBINED dataset), the 2 housekeeping genes that flank the agr locus (the FLANKING dataset), and the 4 agr genes of the P2 operon (the CODING dataset) (Supplementary Table 2). For each of these 5 datasets, we compared the 5 ML trees plus 100 random trees separately generated for each dataset, with the SH test using 1000 nonparametric bootstrap replicates of RELL. Statistical significance was based at the P<0.05 level.

Analysis 2. To test particular relationships between agr groups and clone phylogeny, we formulated 3 constraint trees that represent the competing hypotheses of agr evolution. Hypothesis A proposes that the species is subdivided into 5 groups that correspond to agr groups. We identified 5 distinct agr sequence clusters in this study, the 4 recognized agr groups plus a novel agr group, here called agr I/IV. Hypothesis B proposes that the species is subdivided into 3 groups that correspond to agr groups. In this case, agrs I, IV, and I/IV were lumped into a single group that reflects their relative sequence similarity. Hypothesis C was formulated from the COMBINED tree, and proposes that the species is subdivided into 2 groups that each consist of multiple agr groups. The 3 constraint trees were formulated as follows. First, we ran a MP analysis of the COMBINED dataset and retained only those trees compatible with each constraint. Second, the parameters of the optimal model for the COMBINED dataset was estimated on the most parsimonious tree compatible with each constraint. Third, we ran a ML analysis with the estimated parameters and retained only those trees compatible with each constraint. Finally, for the COMBINED dataset, we compared the 3 ML constraint trees plus 100 random trees generated from the COMBINED dataset, with the SH test as outlined above.

Analysis 3. To test the congruence of the CODING tree topologies from the separate agr groups with their corresponding COMBINED tree topologies, we obtained optimal models and ML trees for 6 additional datasets including separate CODING and COMBINED datasets for the 14 strains of agrs I, IV and I/IV, the 7 strains of agr II, and the 6 strains of agr III (Supplementary Table 2). For each of these 6 datasets, we compared the ML tree from the CODING dataset, plus the ML tree from the COMBINED dataset, plus 100 random trees separately generated for each dataset, with the SH test as outlined above.

Sequence simulations and parametric bootstrap tests. Since the relationship between agr groups and clone phylogeny is tested by reference to an inferrred phylogenetic tree, it is helpful to understand some statistical properties of the datasets that underlie the tree. In particular, how much error is in our datasets, and could the tree be improved if more data were collected? These questions can be approached by Monte Carlo simulation of sequences that have evolved according to the same optimal models and ML trees with the same branch lengths as that of the real datasets (42). Additionally, sequences that are simulated under particular hypotheses, can be used in parametric bootstrap tests to compare with sequences from the real datasets (20). We used Seq-Gen 1.2.6 (38) to simulate sequences for 2 different analyses:

Analysis 1. To study statistical error and consistency in our datasets, we simulated replicates of our datasets assuming the same optimal models and ML trees as the real datasets. For the MLST, SAS, and COMBINED datasets, the parameters of the optimal models were reoptimized on the ML trees. For each dataset, the reoptimized model parameters and ML tree with branch lengths was taken as input by Seq-Gen to simulate 100 datasets of the same sequence length as the real dataset. For each of the 100 simulated datasets, we obtained an ML tree using the reoptimized model, a neighbor-joining starting tree, and nearest-neighbor-interchange (NNI) branch swapping. Statistical error was quantified as the proportion of simulated trees with an identical topology to that used as input for the simulations. Statistical consistency can only be determined with infinitely long sequences (42), so the results presented here should be viewed as a trend over sequence lengths likely to be used in real studies. To determine whether the datasets would converge on the correct tree as more data is added, we simulated a series of 100 datasets that included 2x, 4x, and 8x the sequence length of the real datasets and calculated their statistical error.

Analysis 2. To more thoroughly test the relationship between agr groups and clone phylogeny, we simulated datasets assuming that the 3 hypotheses outlined above were true. The ML trees most compatible with the 3 hypotheses, used previously for the SH tests, were reused for this analysis. For each hypothesis, 100 datasets were simulated with the reoptimized model parameters and the ML tree with branch lengths. A test statistic () was calculated from the real data as the difference in log likelihood of an unconstrained tree and a tree constrained to each hypothesis. The null distribution of the test statistic was obtained from the simulated datasets.