J.Gregori et al. Inferencewithviral diversityindices - Supplementary material 1

**Inference with viral quasispecies diversity indices**

Josep Gregori, Miquel Salicrú, Esteban Domingo, Alex Sanchez, Francisco Rodríguez-Frías, Josep Quer

**SUPPLEMENTARY MATERIAL**

## 1GLOSSARY

Quasispecies terms (Perales et al. 2012)

Consensus sequence: in a set of aligned nucleotide or amino acid sequences, the one that results from taking the most common residue at each position.

Complexity of a mutant spectrum: number of mutations and genomic sequences in a viral population. It is often quantified by the mutation frequency and the Shannon entropy.

Genetic distance: Usually the Hamming distance between pairs of sequences. Different evolutive models may be used to introduce corrections to the observed number of differences.

Master or dominant sequence: the genomic nucleotide sequence that dominates a mutant spectrum because of its superior fitness. It may or may not be identical to the consensus sequence. The most abundant genome may still be a minority relative to the ensemble of low frequency variants. Owing to the abundance of quasineutral mutations and epistatic interactions in viral genomes, there might be a large ensemble of sequences of almost identical fitness that compose a ‘master phenotype’.

Mutation frequency: the proportion of mutated sites in a population of viral genomes with respect to the dominant haplotype (Eq. VIII), or to the consensus sequence or a to given reference.

Mutation rate: the frequency of occurrence of a mutation during viral genome replication.

Mutant spectrum: the ensemble of mutant genomes that compose a viral quasispecies. It is also termed mutant swarm or mutant cloud.

Rate of evolution: the frequency of mutations that become dominant (i.e., are represented in the consensus sequence) as a function of time. It may refer to evolution within a host individual or upon epidemic expansion of a virus.

Viral quasispecies: a set of viral genomes that belongs to a replicative unit and subjected to genetic variation, competition, and selection, and which acts as a unit of selection. It has been extended to mean ensembles of similar viral genomes generated by a mutation–selection process.

Biodiversity terms

Complexity: Any index or a set of indices quantifying the variability of a viral population in a wide sense, including richness, diversity and heterogeneity.

Diversity index: A measure of compositional complexity expressing the degree of variation of forms in a community. A function of the frequencies of the different species (haplotypes), usually given by an entropy expression(Magurran, 2004). S and the Gini-Simpson index are examples of diversity indices.

Effective number of species: number of equally common species, estimated by the Hill numbers of different order(Jost, 2006; Chao & Jost, 2012).

Evenness index: The ratio of a diversity index *I to Imax. Where Imax is the value that I* would take if the abundances in the sample were equal. Sn is an example of evenness index. (Hill 1973).

Heterogeneity: A measure of diversity taking into account differences among individuals. It is a function of the number of haplotypes, their frequencies, and their differences. The mutation frequency and the nucleotide diversity are examples of heterogeneity measures.

Hill numbers: A function transforming a diversity index into an effective number of species. The richness is the Hill number of order 0. The exponential of the Shannon entropy is the Hill number of order 1, the inverse of the Gini-Simpson index is the Hill number of order 2(Hill, 1973; Chao & Jost, 2012).

Phylogenetic diversity: A measure of heterogeneity using the branches length in a phylogenetic tree as distances(Chao et al., 2010).

Richness: Number of species in a community(Magurran, 2004). In the context of a quasispecies the number of haplotypes, that is, different genomes found in the population. The number of polymorphic sites and the number of mutations may be considered richness indices as well.

True diversity: See effective number of species(Jost, 2006).

## 2diVERSITY INDICES AND INFERENCE

Shannon entropy

The Shannon entropy (S) was originally developed in the domain of information exchange(Shannon, 1948), and is related to the transmission capacity of a communication channel. It measures the average unpredictability or lack of information contained in a set of items in terms of its alphabet. It found soon its place in ecology(Hutcheson, 1970) and later in virology(Korber et al., 1994). In genetics S is used with two different approaches. The first is an analysis of diversity of each sequence position -columns in a multiple alignment- either of nucleotide or amino acid sequences, where the alphabet size -either 4 or 20- is known(Korber et al., 1994). Alternatively the analysis may be by genomes or haplotypes -rows of the multiple alignment- where the alphabet size is unknown(Pawlotsky et al., 1998), as is the case in ecology. In this work we use the second approach as a measure of the global quasispecies complexity. In this context it is a function of the number of haplotypes in the viral population and their relative frequencies, and its maximum likelihood estimator (MLE) is:

(I)

withpi the MLE of the relative frequency of each haplotype, ni the observed counts of the i-th haplotype, h the number of observed haplotypes, and N the sample size.

It is known that the Shannon entropy as a function of proportion estimates is negatively biased(Magurran, 2010), and Hutcheson(Hutcheson, 1970) provided an approximation to the bias by applying a Taylor series expansion

(II)

were S is the exact value, and H is the estimate of the number of haplotypes in the population. H is also negatively biased and may be corrected, among others, by the Chao 1 method(Chao, 1984)

(III)

where h is the number of haplotypes observed in the sample, and f1 and f2 are the number of singletons (haplotypes with a single copy) and doubletons (haplotypes with two copies) in the sample.

The delta method and the asymptotic normality of the multinomial distribution to the normal provide the means for statistic inference as established by Hutcheson(Hutcheson, 1970) for the Shannon entropy. Salicru and cols.(Salicru et al., 1993; Pardo et al., 1997) provided an elegant generalization for inference to a wide fan of entropy indices. The estimated variance of the Shannon entropy is given by:

(IV)

with this estimated variance we may compute confidence intervals or test the equality of Shannon entropies between two samples by statistical inference using the Z-test

(V)

Where P1 and P2 are the vectors of observed haplotype frequencies of the two samples to compare, Ho is the null hypothesis H1 is the alternative hypothesis, and Z is a statistic asymptotically distributed as the standard normal, with N1 and N2 the sample sizes.

**The normalized Shannon entropy**

In the ecology literature the Shannon entropy is normalized to the natural logarithm of the number of estimated species -the size of the alphabet- so that a population where all species are equally represented corresponds to a maximum entropy of 1, whereas a population with a single species is a population of minimum entropy, with Sn = 0. The proof is as follows:

where h is the number of species, and 1/h is the frequency of each species in the population.

(VI)

Then Sn varies from 0 to 1 and is a measure of evenness of the population.

By Taylor series expansion we find the estimated variance of Sn, assuming h as a constant, as(Salicru et al., 1993; Pardo et al., 1997):

(VII)

Sn is also asymptotically normal and the same inference as for S may be used.

Sn is not sensitive to heterogeneity, that is i.e. a viral population constituted of two haplotypes at 50%, with just one substitution between them, has the same Sn than a population with 100 haplotypes at 1% each, and a mean of 10 differences among them.

**The mutation frequency**

Mf is a heterogeneity measure that takes as reference either the most represented haplotype, also known as dominant or master sequence(Ramirez et al., 2013), or the consensus sequence(Cabot et al., 2000), and computes the number of observed differences of each individual genome with respect to this reference. The value is normalized to the total number of nucleotides sequenced. The higher its value the more dissimilar are the individuals in the population with respect to the reference.

(VIII)

where l is the amplicon sequence length, N the sample size (number of sequences), ni the observed counts of the i-th haplotype, m1i the number of substitutions between the i-th haplotype and the master sequence, which without loss of generality is taken as the first, and the vector bracket notation has been used in the term of the right.

By the delta method and the asymptotic normality of a multinomial we find the variance of Mf which may be used to obtain confidence intervals, or to compare the mutation frequencies of two samples by the help of the Z-test.

(IX)

(X)

Where P1 and P2 are the vectors of observed haplotype frequencies of the two samples to compare, M1 and M2 the vectors of Hamming distances of each haplotype respect to the reference in the respective sample, Ho is the null hypothesis, H1 the alternative hypothesis, and Z a statistic asymptotically distributed as the standard normal.

The t-test

When the sample size is mall, as in the CCSS case, the Welch t-test (Hutcheson 1970) should be preferred to the Z-test.

(XI)

where X applies both to S, Sn and Mf., and σX is given by IV, VII or IX.

**The nucleotide diversity**

The nucleotide diversity (Nei 1987) considers the differences between each pair of genomes in the population and is a more general measure of heterogeneity than the mutation frequency.

(XII)

Where mij is the number of differences between the i-th and the j-th haplotype, and the other variables are as in (VIII). The MLE estimator is biased, and the bias is a function of the sample size.

(XIII)

The variance of π may be found in (Nei 1987). As a quadratic form this index is asymptotically distributed as a sum of a normal and a linear combination of Chi-square distributions(Dik & Gunst, 1985), and the Z or the t test are of no application. A resampling test is a good choice in this case.

REFERENCES

Cabot,B. et al. (2000) Nucleotide and amino acid complexity of hepatitis C virus quasispecies in serum and liver. J Virol 74, 805-811.

Chao,A. (1984) Nonparametric estimation of the number of classes in a population. Scand. J. Statist., 11, 265-270.

Chao,A. et al. (2010) Phylogenetic diversity measures based on Hill numbers. Philos Trans R SocLond B BiolSci 365, 3599-3609.

Chao,A. and Jost,L. (2012) Diversity measures. In Hasting,A. and Gross,L. (eds.), Encyclopedia of Theoretical Ecology. University os California, Press, Berkeley.

Dik,J. and Gunst,M. (1985) The distribution of general quadratic forms in normal distributions. StatisticaNeederlandica, 39, 14-26.

Hill,M. (1973) Diversity and evenness: a unifying notation and its consequences. Ecology 54, 427-432.

Hutcheson,K. (1970) A test comparing diversities based on the Shannon formula. J TheorBiol 29, 151-154.

Jost,L. (2006) Entropy and diversity. Oikos 113, 363-375.

Korber,B.T. et al. (1994) Genetic differences between blood- and brain-derived viral sequences from human immunodeficiency virus type 1-infected patients: evidence of conserved elements in the V3 region of the envelope protein of brain-derived sequences. J. Virol., 68, 7467-7481.

Magurran,A. (2004) Measuring Biological Diversity, Oxford: Blackwell Publishing.

Magurran,A. and McGill,B.J. (eds) (2010). Biological diversity: frontiers in measurement and assessment. Oxford, UK.: Oxford University Press.

Nei, M. (1987). New York: Columbia UniversityPress.

Pardo,L. et al. (1997) Large sample behavior of entropy measures when parameters are estimated. Commun Statist Theory Meth 26, 483-501.

Pawlotsky,J.M. et al. (1998) Interferon resistance of hepatitis C virus genotype 1b: relationship to nonstructural 5A gene quasispecies mutations. J Virol 72, 2795-2805.

Perales,C. et al. (2012) The impact of quasispecies dynamics on the use of therapeutics. Trends Microbiol 20, 595-603.

Ramirez,C. et al. (2013) A comparative study of ultra-deep pyrosequencing and cloning to quantitatively analyze the viral quasispecies using hepatitis B virus infection as a model. Antiviral Res.

Salicru,M. et al. (1993) Asymptotic distributions of (h,Fi)-entropies. Comm Statist Theory Meth 22, 2015-2031.

Shannon,C. (1948) A mathematical theory of communication. Bell System Technical J 27, 379-423.

Nov. 2013