Effective population size shapes the patterns of molecular evolution in Drosophila

Background: A corollary of the nearly neutral theory of molecular evolution is that effective population size determines the efficiency of natural selection. Population size also seems to play a key role in the evolution of genome complexity [7]. Multispecies data on nucleotide diversity can be used to test the expected relationship between population size and selection efficiency at the DNA level. Using nucleotide data from Drosophila species, here we study the levels of synonymous polymorphism among species and the possible causes of their differences.

Results: Differences inat the levels of synonymous polymorphism among Drosophila species were accurately assessed showing that around 22% of the variation in synonymous polymorphism is explained by true differences among species. The differences in synonymous polymorphism are not greatly affected by differences in gene functions, phylogenetic relationships among species or demography. Selection efficiency seems be positively correlated with the levels of synonymous polymorphism: 1) codon bias is positively correlated with the levels of synonymous polymorphism; 2) synonymous polymorphism was found to be negatively correlated with the proportions of repetitive sequences in the genomes; and 3) there are significant differences in the proportion of adaptive substitutions in ortologuesortologue genes of pairs of species with significant differences in the levels of synonymous polymorphism.

Conclusions: Differences in the levels of synonymous polymorphism are likely due to differences in long-term effective population size. These differences seems to have implications forin the molecular evolution of the Drosophila species, as it is suggested by the positive relationships found among the levels of synonymous polymorphism and (1) evidence of positive selection on codifying DNA and (2) the amount of bias for preferred codons. Moreover, species with low synonymous polymorphism levels have a higher proportion of repetitive sequences in their genomes, suggesting a lower efficiency of natural selection in species with smaller population size. Effective population size seems to affect the patterns of molecular evolution even within a group of closely related species such as those belonging to the Drosophila genus.

Background

The neutral theory of molecular evolution claims that mostthe majority of evolutionary changes at the molecular level are not caused by Darwinian natural selection acting on advantageous mutants, but by a random fixations of selectively neutral variants through the cumulative effect of sampling drift on the continued input of new mutations. According to the neutral theory, the rate of mutant substitutions in evolution is equal to the neutral mutations rate, and this rate is independent of the population size and environmental conditions [8]. The neutral theory assumes that mutations can be classified as neutral or deleterious. However, Ohta [9] proposed that mostthe majority of “neutral mutations” are indeed slightly deleterious, rather than strictly neutral. In this nearly neutral theory, the fate of a mutation depends on the relative forces of selection and drift [10]. This observation meansmakes that the evolutionary rate is higher in smaller populations than in larger populations, because a slightly deleterious mutation behaves as selectively neutral when the product of the effective population size by the selection coefficient (Nes) is smaller than unity, while it may be efficiently selected if Nes is larger than unity [11]. The effective population size (Ne) determines the degree to which gene frequencies are faithfully transmitted across generations [7]. Estimates of the effective population size can be obtained from the neutral nucleotide variation, since from neutral theory θ ≈ 4Neµ (where θ is the average heterozygosity per site for neutral mutations; Neis the effective population size and µ is the neutral mutation rate per nucleotide site per generation [11]).

In recentthe last years, slightly advantageous as well as slightly deleterious amino acid substitutions have been showned to be occurring in protein evolution [10, 12, 13]. Bustamante et al. [12] showed that the mean selection coefficients are very weak for Arabidopsis (Nes around -2 to 0.5) and Drosophila (Nes around 0.5 to 3), giving support to the nearly neutral theory [10]. Moreover, Fay et al. [14] and Eyre-Walker et al. [15] showed that a high fraction of amino acid mutations are slightly deleterious in mammals and Drosophila. However, the proportion of substitutions that have been driven by positive selection seems be high in humans and Drosophila (around 30–50 %), showing that the fraction of positively selected mutations is not as small as would be expected by neutral theory [13, 16-19]. Overall, both genetic drift and positive selection are the driving forces of molecular evolution, but the general relative impact of both forces in shaping the observed patterns of nucleotide variability is still under discussion.

Effective population is a key factor in the nearly neutral theory of molecular evolution, because the fate of a mutation is determined by the product Nes. Akashi [3] showed that for weakly selected silent mutations, the differences in codon bias between D. melanogaster and D. simulans could be explained by differences in effective population sizes. He estimated a coefficient of selection acting on preferred codons in D. simulansof nearly of Nes ≈ 2. Based on the observation that D. simulans seems to have a larger population size higher than D. melanogaster[20], Akashi [3] concluded that selection for preferred codons is inefficient in D. melanogaster. Furthermore, Vicario et al [21] found differences in the levels of genomic codon bias in Drosophila species, suggesting differences in effective population size as a possible explanation.

A recent study to detect positively selected genes in the genomes of humans and chimpanzees showed that although the rate of nonsynonymous substitution in humans is higher than in chimpanzees, the number of genes with evidence of positive selection is substantially smaller in humans [22]. These results are explained by the reduced efficacy of natural selection in humans because of their smaller long- term effective population size.

Recently, a more far-reaching and fascinating new role has been given to the effective population size in relation with the evolution of genome complexity. Lynch and Conery [7] have proposed that the changes in genome complexity from prokaryotes to multicellular eukaryotes, including gene number, intron abundance and mobile genetic elements, emerged passively in response to long-term population-size reductions that accompanied the increases inof organism size. According to this hypothesis, much of the restructuring of eukaryotic genomes was initiated by non-adaptive processes, which in turn provided novel substrates for the secondary evolution of phenotypic complexity by natural selection [7].

Multispecies data on nucleotide diversity is a very useful source of information that could be used to test the expected relationship between population size and selection at the DNA level. Drosophila species hasd been the focus of genetic and evolutionary studies for decades, and now the availability of the sequences of 12 genomes of these species makes them the focus of evolutionary genomics [21]. The availability of nucleotide polymorphism data in these species (DPDB “Drosophila Polymorphism Data Base”[23, 24] allow us to test the hypothesis that population effective size and selection efficiency are positively correlated. Since synonymous polymorphism is thought to be mainly neutral (πs ≈ θ), the differences detected in the levels of synonymous polymorphism among species could be due to differences in effective population size. Therefore, in this paper we first have assessed the differences in the levels of synonymous polymorphism among Drosophila species. Then, we evaluate different possible causes that could explain the detected differences. The results indicatindicateed that effective population size seems to be a majormain factor into explaining the detected differences detected in the levels of synonymous polymorphism among species. Furthermore, we, wehave tested the hypothesis that the selection efficiency is positively correlated with effective population size by relating the levels of synonymous polymorphism with i) the bias for for preferred codons, ii) genome content, and iii) amount of adaptive substitutions. The results show that both variables, the levels of codon bias and the proportion of adaptive substitutions, seem to take higher values in species with higher levels of synonymous polymorphism. Moreover, species with low synonymous polymorphism have a higher proportion of repetitive sequences in their genomes, suggesting a diminished efficiency of selection in species with smaller population size. Our results provide strong evidence in favour of the importance of effective population size as a factor determining the patterns of molecular evolution even within a group of closely related species such as those belonging to the Drosophila genus.

Results

Differences in the levels of synonymous polymorphism among species

DPDB (Drosophila Polymorphism DataBase [23, 24]) provides polymorphism data by gene and species in the Drosophila genus, giving several evaluations of the quality of the estimates and additional information onof the genes, such as the coordinates to the D. melanogaster genome and Gene Ontology classification of the genes. A lLink to the source information of the sequences in GenBank and EMBL, and the possibility ofto reanalyzing anye whatever gene-specie dataset is also provided in the web page.

We performed a data filtering from DPDB (see methods) obtaining a dataset with 751 estimates of synonymous polymorphism (πs) [25], belonging to 482 different genes of 15 Drosophila species. The accession number of DPDB and the data of πs and Tajima’s D (Tajima 1989) by species and genes are listed in Supplementary Table 1. Figure 1 graphs the different mean πs values per species.

To assess the differences in synonymous polymorphism levels among Drosophila species, we performed an unbalanced variance analysis of variance. Since different types of genes are included in the 15 analyzed species, the datasets by species could contain a biased sample of genes with high codon bias or with high rate of selective sweeps which would decrease the level of synonymous polymorphism in a particular species. Therefore, a two-factor analysis (gene and species) was performed. The result shows that significant differences were found for the species factor, but neither the gene factor nor the interaction between species and genes were significant (Table 1). Since the data of D. melanogaster alone account for the 40% of the dataset (Figure 1), the same analysis was performed excluding the data of this species, showing a similar result (Table 1).

To evaluate which species differed inon their πs values, we performed a post hoc pairwise Tukey test estimating the differences among pairs of the species. Because the effect of a single gene among the 482 different genes analyzed can be undetected by the variance analysis of variance, we also performed a paired T test among ortologuesortologue genes of pairs of species (Table 2). The results of the Tukey test showed that the pairs of species dmel-dsim, dsim-dmir, dpse-dmir, dyak-dsim, dmir-dmel, dsim-dsan, dmir-dari, dmir-dame, dmir-dkik and dkik-dsan are significantly different in their levels of πs. The rResults of the paired T test were coincident, showing that the pairs of species dmel-dsim (N=114), dsim-dyak (N=10) and dsim-dmir (N=10) are significantly different in theirs πs values (Table 2). Significant differences detected by Tukey test were borderline significant for the pair of species dpse-dmir (N=9) and dmir-dmel (N=10) and not significant for dsim-dsan (N=5) by paired T Test (Table 2). Moreover, paired T test detected significant differences which were undetected by Tukey test between the pairs of species dsim-dsec, dsec-dmau, dmel-dsec and dmel-dpse (Table 2). Finally, there are significant differences between pairs of species detected by Tukey test, dmir-dari; dmir-dame, dmir-dkik, dmir-dmoj and dkik-dsan, that could not be tested by paired T tests because not data for ortologuesortologue genes are available for these pairs of species. These results are consistent with the previous ANOVA, indicating that the differences in πs among species are not due to the different sampled genes within each species, because the paired T test compares the same set of genes for each pair of species.

The differences in the levels of πs between dmel and dsim were highly significant both by Tukey test and paired T test. African and derived populations of D. melanogaster and D simulans seem to have different levels of synonymous polymorphism [26-29]. Therefore, wWe therefore grouped the gene sequences of the genes by their origin and recalculated the πs values by gene. We obtained a dataset of 91 ortologuesortologue genes from non Africannon-African populations and 23 ortologuesortologue genes from African populations of D. melanogaster and D. simulans. We detected significant differences among πs values in Non-African populations of these two species, but the differences were not significant in African populations (Table 2).

All comparisons in Table 2, including dsim and dmel with other species, are based oninnon Africannon-African populations of these species.

Various dDifferent causal factors could lead toproduce the detected differences detected in the levels of πs among species in the aboveprevious analysis: 1) Different selective constraints in diverse functional categories of genes of Drosophila species could affect the πs levels by selectivelyon acting on linked non-synonymous sites [19, 21, 30]; 2) species belonging to different groups of species have different average levels of recombination rates [24, 33] which could affect the levels of synonymous polymorphism [31]; and 3) deviations of the neutral equilibrium by demographic events, such as population bottlenecks, population structure or population expansion could influence the level of synonymous polymorphism [34, 35].

Differences in selective constraint among genes. We performed ten different ANOVAs testing the model πs = species + GO. Where GO is a component that sorting sorts genes as belonging or not to one of ten different categories of biological process at level 3 of the Gene Ontology (see Methods). The results are summarized in Table 4. While, the species component has a significant effect onin the levels of πs; the differences in the levels of πs due to differences in functional categories only were detected for genes belonging to functional categories related towith metabolic processes (cellular metabolism; macromolecule metabolism and primary metabolism) and sexual reproduction. These genes have significantly lower πs values than genes from other functional categories (Supplementary Results, Table 2).

The functional category effect suggests different selective constraints forof these genes. To test this assumption, the levels of constraints (πn / πs) were investigated for the four categories, showing that the genes involved in metabolic processes are significantly more constrained than the rest of genes (average πn / πs = 0.13 vs. 0.28); while the genes involved in sexual reproduction are less constrained than the other genes (average πn / πs = 0.30 vs. 0.22), although in this last case the differences were not significant. Nevertheless, the percentage of the variance explained by the differences between species was 14.38 % and the mean percentage of the variance explained by the functional differences was 2.6 %, demonstrating once more that the variation in the levels of πs is mainly due to proper differences in species (Table 3).

Taxonomic relationships among species. We performed a fully nested ANOVA to test whetherif the differences in the levels of synonymous polymorphism could be explained by differences inof the subgenus or taxonomic groups of species. The differences in the levels of synonymous polymorphism explained by differences in subgenus were null (Table 5), while differences among species groups explained a very low percentage of the total variance (1.42%) and the differences among species within groups explained a significant percentage of the total variation in the levels of synonymous polymorphism (14.2%; Table 5); indicating that the differences in the levels of synonymous polymorphism could be attributed mainly to differences among species within of the groups of species.

The results in the previous section show that some percentage of the variance of πs is explained by differences in the selective constraints of the genes. Therefore, to get a more accurate percentage of the variance explained by the differences among species, data were grouped by the functional category of the genes, and the fully nested ANOVA was performed for each functional categories. In all analysis the species component explained a higher percentage of the variance (Supplementary Results, Table 3). The average percentage of the variance in the levels of πs, explained by the differences among species for 10 datasets when genes were grouped by their function was 22% (Supplementary Results, Table 3).

Deviations from the neutral equilibrium. To evaluate whetherif the differences in the levels of synonymous polymorphism could be due to deviation of the neutral equilibrium of the populations, we tested differences in the statistics D statistics of the Tajima test [34]. Tajima’s D statistics measure the deviation from the expected selection-drift equilibrium in a population with constant size. Deviation from the neutral equilibrium could be produced by selective events, such as selective sweeps or demographic events, such as bottlenecks or population expansion [34]. However, demographic events will generate an overall deviation across of all the genomes[13]. Therefore, if the levels of synonymous polymorphism are affected by demographic events in a species, the mean value of Tajima’s D would be different for this species. The mean values of D are shownpresented in the Table 1. As withIn the same way like for πs, we performed an unbalanced ANOVA to detect differences among mean values of D by species and a post hoc Tukey and a paired T tests for ortologuesortologue genes to detect differences among pairs of species. The ANOVA showed significant differences in the mean values of D among species (F (14,736) = 3.01, p<0.001). Tukey test showed that the differences could be mainly attributed to D. miranda, which have a mean value of D significantly different to the rest of the species. Moreover, differences were found between the pair of species dyak-dsim. Paired T test confirmed this last difference. However, the differences in Tajima’s D betweenamongD. miranda and the rest of the species were not detected by paired T test. An examination of the data of this species showed that the differences in the values of D in D. miranda are due to 16 genes from the NeoX chromosome of this species. Bachtrog and Andolfatto [2] found that genes from NeoX chromosome seems to be deviated from equilibrium, maybe due to a higher rate of adaptive substitutions in this chromosome combined with demographic events [2]. The Tukey analysis without these genes do not shows significant differences in the Tajima’s D betweenamongD. miranda and the rest of species.