How Many Unique Names Are There?

Supporting online material

How many unique names are there?

In the main text, we discuss briefly how many unique species names there might be. Table S1 summarizes recent estimates. We include this review of previous literature here because it emphasizes the importance of working from data where the issues of synonymy are resolved. Simply, working from species lists that have not resolved synonymy can lead to estimates of total species numbers that differ by a factor of two or more.

The data

The World Checklist of Selected Flowering Plant Families (WCSP) (2008) is an authoritative, fully synonymised database of seed plant species that has been, and is continuously under, peer review. A download of the database was received in November 2008 containing all accepted species so far available in the checklist; that is all species currently in the database that have been formally described and are considered to be both biologically and nomenclaturally valid – henceforth “unique species”. For each species we extracted the earliest data of publication. For example, Anacamptis morio (L.) R. M. Bateman, Pridgeon & M. W. Chase (Bateman et al. 1997), was first described by Linnaeus (Linné 1753) in 1753, who placed it in the genus Orchis. We therefore use Linnaeus as the authority and the date of earliest description as 1753.

Details of the statistical methods

The predicted values come from multiplying the actual number of taxonomists active in each five-year interval by the predicted number of species described per taxonomist. As an example, imagine a hypothetical interval with three unique species discovered (S1, S2, S3). If three taxonomists (A1, A2, A3) authored the publication naming S1, and the publications naming S2 and S3 were each authored by different single authors (A4, A5, respectively), then we consider there to be five taxonomists working during that interval.

The predicted numbers of species per taxonomists are the product of two relationships. One is the number of unknown species — the total number of species in the family. The second is the changing efficiency of taxonomists. This is an uncorrected value — essentially a potential efficiency — that is corrected downwards to get the realized value as the supply of unknown species declines.

Several of the experts from whom we solicited opinions on the number of remaining species doubted that taxonomic efficiency had increased. In fact, empirically, it does for nearly all the analyses we performed, though the increase is not always marked. It need not be and the slope of the line can be zero. Examples are Figures 1 and 2 in the main text. Of course, the observed numbers of species per taxonomist eventually decrease as the supply of unknown species declines.

Why did we not use a linear model?

First, consider what it takes to linearise the problem. The basic equation is

Si = Ti*(a + bYi)*(ST – Ci)(1) or

Si/Ti = (a + bYi)*(ST – Ci)

Where, a, b and are ST are constants, Ti is taxonomists, Yi is year, and Ci is cumulative number of species Si.

Assume there is no year effect, i.e. b = 0

Si = Ti*a*(ST – Ci)

Si = aSTTi – aTiCi

Which is of the form y = oT +  1TC — which is a linear regression

So, y =0 — no species are described — is when  o = –  1C, or C = –  o/  1 — this is the estimate of ST.

Doing thiscalculation for orchids gives an estimate of ~34,000 species — rather higher than any of the estimates we obtained using our more complex model. We ran such analyses on all the data and that comparison is typical.

The associated plots of predicted species per taxonomist would show a continually declining ratio — as inevitably it must eventually. This has two related failings. The first is that it does not capture the increase in species per taxonomist apparent in many of the analyses — the non-monocots of Fig 1, especially. The second is that gives too shallow a slope to the last century of observations and, consequently, produces high estimates of total numbers of species.

Second, consider the full model, equation (1) written out in full

Si α Ti*(a + bYi)*(ST – Ci)or

Si α ST aTi + bSTTi Yi –aTi*Ci - –bTi Yi Ci(2)

Isn’t this a linear model? Well no! Equation (2) has only three unknown parameters a, b, and ST, but four terms. Given this constraint and the fact that there are three combinations of parameters — with fairly obvious values for all of them —a grid search followed by a steepest descent method to find the minimum sums of squares is an efficient way to estimate ST. (One can certainly use “of the shelf” statistical packages, of course.)

The issue of spiky data

Our data are “spiky” — in some intervals, there were a few important monographs written by few taxonomists — Schlecter for the orchids — with the species-to-taxonomist ratios were very much higher than either before or after. These spikes are typical of the earlier history of taxonomy when there were many unknown species and few taxonomists, not of the last three-quarters of a century.

Initially, we minimized the arithmetic sums of squares. The estimates we obtained were very similar to the ones reported in the main text. Nonetheless, the total residual sums of squares tends to be dominated by one or two of these spikes. So, for the results in the main text, we logarithmically transform the data.

This transformation creates special difficulties for families with relatively few species — roughly one thousand or fewer. In the first few five-year intervals, there are often very few species described — and sometimes no species at all — and this creates large residuals on the logarithmic scale. We removed this problem by analyzing those five-year intervals after 1760 for which there were a cumulative total of 40 or more described species.

We notice that spikes become less frequent in more recent decades as the number of taxonomists increases and, thus, as the influence of individual taxonomists declines. It may also reflect research assessment exercises that penalize those who do not publish annually.

Certainty about parameter values

The disadvantage of using non-linear models off-the-shelf statistical packages is that they provide only approximate standard errors of the estimates. We employ the familiar jack-knife procedure. As described in the text, we remove each 5-year interval in turn and show the effect on the estimate of the estimate of unknown species. The estimates are encouragingly robust.

We report these estimates as their maximum and minimum and not as standard deviations — from which one might calculate confidence intervals — for two reasons. Firstly, there is no reason to think that these jack-knife estimates are normally distributed. Indeed, they are certainly not — they are bounded away from the known cumulative number of species, for example. Secondly, there is a sense in which the data are a complete enumeration of the available data and not a sample of it. Thus, removing a five-year interval with a spike of taxonomic activity, is not removing a point subject, as it were, to some vagaries of chance. Rather, it asks what happens to the estimate of total species when we count the species found in that interval in the cumulative species total, but not included as an observation for that interval. Such an estimate of total species may be unusually high, but informative nonetheless. By reporting such maximum values (and not just the standard error of all of them) we record such information.

We can also address the sensitivity of our analyses to the interval used. We addressed this by reanalyzing our data over ten year intervals. The differences in the estimates are small and our general conclusions are unchanged.

Why fits of species described per year are broadly quadratic in form

Notice that the decline in remaining species is very roughly a linear one — and would be exactly so, were the numbers of species described per year to be constant. (The numbers of selected non-monocots described per year have been roughly constant since about 1850; Figure 1, bottom left.) Were the numbers to be exactly constant, then one could replace equation (2) in the text — which depends on the remaining species, with this equation that depends on year:

SR = c – d*year,

And thus

Si/Ti α (a + b*year)*(c – d*year)(3)

This is a quadratic function in year, which is also a rough description of Figure 1 (top, red line). This explains why efforts such as Bebber et al. (2007) who fit polynomial curves to species versus year will sometimes provide sensible estimates, but equally will sometimes fail if the assumptions made in this paragraph do not hold.

Parameter sensitivity

Any non-linear model such as eq. 5 begs the question of how sensitive are the parameters estimated. One can readily calculate how changing the value of each of the three parameters alters the residual sums of squares. Approximately, a given proportionally change in total species numbers alters the residual sums of squares by ten times as much as such a change does in the other two parameters.

This has two important consequences. It gives us confidence that the estimate of the total number of species is a tightly constrained one. Second, it means that differences in the slope and intercept of the taxonomic efficiency do not greatly alter the residual sums of squares and, by extension; they do not affect the estimate of the total number of species.

An explanation of why descriptions across families may not approximate all families modelled together

In period one, let T taxonomists describe S species in family A and T taxonomists describe S species in family B. They could be the same or different taxonomists involved. In period two, U taxonomists describe R species in family A and the same number in family B and again they could be the same or different taxonomists. For both families, the number of species per taxonomist is

Period one:S/T

Period two:R/U

Suppose that these ratios are equal, i.e.

S/T = R/U or equivalently S/R = T/U(4)

and because they do not decrease over time, the predicted number of species in each family would be infinite.

The total number of species described for both families is 2S in period one and 2R in period two. Let the total number of taxonomists for period one be X1, and for period two be X2. The following inequalities hold.

T <X1 <2T and

U < X2 <2U(5)

We now expect that when we combine the families, the ratio of species per taxonomist will decrease, indicating that taxonomists are running out of species to describe, i.e.

2S/X1 > 2R/X2 or equivalently S/R > X1/X2

or, from (1) T/U > X1/X2

The smallest X1/X2 can be is T/2U, i.e. when X1 = T and X2 = 2U. This occurs when the same taxonomists describe species in both families in period one, but by period two each family has its own specialist taxonomist, a trend that is familiar.

Expert Opinion

To have numbers to compare to our model output, we solicited expert opinion from taxonomists known to be experts at describing species from certain families. The expert taxonomists did not provide details of the methodologies they employed. They likely varied widely, from estimations based on recent fieldwork, to extrapolations from datasets, to simple speculation.

Supporting References

Bateman, R., Pridgeon, A. & Chase, M. 1997 Phylogenetics of subtribe Orchidinae (Orchidoideae, Orchidaceae) based on nuclear ITS sequences. 2. Infrageneric relationships and reclassification to achieve monophyly of Orchis sensu stricto.Lindleyana12, 113-141.

Bebber, D., Marriott, F., Gaston, K., Harris, S. & Scotland, R. 2007 Predicting unknown species numbers using discovery curves. Proceedings of the Royal Society B: Biological Sciences274, 1651-1658.

Bramwell, D. 2002 How many plant species are there? Plant Talk28, 32-34.

Govaerts, R. 2001 How many species of seed plants are there? Taxon50, 1085-1090.

Govaerts, R. 2003 How many species of seed plants are there?: a response. Taxon, 583-584.

Groombridge, B. & Jenkins, M. 2000 Global biodiversity: Earth's living resources in the 21st century. Cambridge World Conservation Press.

Hammond, P. 1992 Species inventory. In Global biodiversity: status of the Earth’s living resources (ed. B. Groombridge), pp. 17–39. London.

Hawksworth, D. & Kalin-Arroyo, M. 1995 Magnitude and distribution of biodiversity. Global biodiversity assessment. Cambridge: Cambridge University Press.

Heywood, V., Brummitt, R., Culham, A. & Seberg, O. 2007 Flowering plant families of the world. Riichmond, Surrey: Royal Botanic Gardens, Kew.

Linné, C. 1753 Species plantarum. Imprensis Laurentii Salvii, Stockholm2, 970-971.

Mabberley, D. 1997 The Plant-Book: a portable dictionary of the vascular plants. Cambridge University Press.

May, R. 1990 How many species? Philosophical Transactions of the Royal Society of London B330, 293-304.

Paton, A., Brummitt, N., Govaerts, R., Harman, K., Hinchcliffe, S., Allkin, B. & Lughadha, E. 2008 Towards Target 1 of the Global Strategy for Plant Conservation: a working list of all known plant species-progress and prospects. Taxon57, 602-611.

Scotland, R. & Wortley, A. 2003 How many species of seed plants are there? Taxon52, 101-104.

Stebbins, G. 1974 Flowering plants: evolution above the species level. Cambridge: Harvard University Press.

Thorne, R. 2000 Classification and geography of dicotyledons. Botanical Review66, 441-650.

Thorne, R. 2002 How many species of seed plants are there? Taxon51, 511-512.

Wilson, E. 1992 The Diversity of Life. Cambridge: Harvard University Press.

World Checklist of Selected Plant Families 2008. The Board of Trustees of the Royal Botanic Gardens, Kew. Published on the Internet;

Wortley, A. & Scotland, R. 2004 Synonymy, sampling and seed plant numbers. Taxon53, 478-480.

Table S1. Estimates of the numbers of accepted names of flowering plants (angiosperms) and of seed plants (angiosperms and gymnosperms).

Reference and year / Estimate of accepted names / Taxonomic group
Scotland & Wortley (2003) / 223,300 / Seed plants
Stebbins (1974) / 231,413 / Seed plants
Hawksworth & Kalin-Arroyo (1995) / 240,000 / Seed plants
Wilson (1992) / 248,400 / Seed plants
Mabberley (1997) / 249,500 / Seed plants
Thorne (2000) / 257,400 / Seed plants
Thorne (2002) / 258,650 / Seed plants
May (1990); Hammond (1992); Groombridge & Jenkins (2000) / 270,000 / Seed plants
Heywood et al. (2007) / 284,372–293,575 / Flowering plants
Wortley & Scotland (2004) / 346,527 / Seed plants
Paton et al. (2008) / 352,282 / Flowering
Paton et al. (2008) / 379,881 / Seed plants, plus ferns and bryophytes
Bramwell (2002) / 421,968 / Seed plants
Govaerts (2001) / 422,127 / Seed plants
Govaerts (2003) / 446,600 / Seed plants