MALLAT Léo Rapport de stage de licence-2007

Laboratoire d’accueil: U550 HGID Necker, Maître de Stage: A.Alcais

Family-based association studies: Is a case-control analysis using the parents of the familial sample a validreplication study?

Introduction

Many common diseases are thought to result from the interactions between genetic and environmental factors. Such diseases include cancer, cardio-vascular, neurospychatric or infections. Genetic epidemiology is a field which aims at identifying the genes (and the polymorphisms of these genes) which have a significant influence on complex diseases, and the possible interactions of these genes with relevant environmental factors. The primary tools of genetic epidemiology are of a statistical nature (Abel and Dessein, 1998; Khoury et al., 1993; Lander and Schork, 1994) and combine epidemiological and genetic information, in particular genetic markers. Indeed, the recent establishment of the genetic map of the human genome based on highly polymorphic marker (Dib and al., 1996) and the growing availability of single nucleotide polymorphisms (SNPs, which are now several millions over all the human genome) have boosted the amount of genetic information that can be used by these genetic epidemiology studies. Genetic epidemiology includes two main kinds of approaches: linkage and association studies. Linkage studies are used to locate chromosomal regions that potentially contain the the gene(s) of interest. The general principle of linkage analysis is to search for chrosomal regions that segregate nonrandomly with the phenotype of interest within families. Association studies are used to investigate the role of polymorphisms (or alleles) of genes either by focusing on a few candidate regions or by using a genome-wide search.

Classical association studies are population-based case-control studies that compare the frequency of a given marker allele A, between unrelated affected (cases) and unaffected (controls) subjects (Khoury & al., 1993; Lander and Schork, 1994). The rationale for the association studies is that either the allele which is tested is itself the causal allele (later denoted as D) or is in linkage disequilibrium with D. Linkage disequilibrium means two conditions (i) genetic linkage between the marker locus and the disease locus (generally tight linkage and in particular the two loci can be within the same gene) and (ii) allele A is preferentially associated with allele D, i.e., the A-D haplotype is more frequent than expected by the respective frequencies of A and D (e.g., many present cases are due to an ancestral allele D and the ancestor who transmitted this allele was bearing the A-D haplotype). It should be noted that linkage alone (whatever its intensity), is not sufficientto lead to association and, therefore, that absence of association does not exclude linkage. One major problem with case-control design is that both genotype and haplotype frequencies can vary in-between geographic or ethnic populations. If the sample of cases and the sample of controls are not well matched for ethnicity then false positive association can occur because of the confounding effects of crypticpopulation stratification (Marchini, 2004). To circumvent the difficulties in selecting an appropriate control group, it is possible to use familial controls and to performfamily-based association study.

Several family-based association designs have been proposed. The most commonly used is the Transmission Disequilibrium Test (TDT) (Spielman et al. 1993). Its general principle is to search for a distorsion of the transmission of alleles from heterozygous parents to affected offspring. This strategy avoidsspurious gene-phenotype associations due to inappropriately chosen controls or population substructures.Therefore, the TDT really tests for linkarge disequilibrium, i.e. linkage between the marker and disease locus and allelic assiociationbetween A and D, so that its null hypothesis is no linkage or no allelic
association (Lander and Kruglyak 1995; Risch and Merikangas 1995).The TDT design is based on familial trios that consist in onefather, one mother and one affected offspring. In the context of a binary trait (e.g. affected/unaffected), the criterion for a trio to be included in the analysis is to contain an affected child. As mentionned above, complex diseases are almost by definition common diseases and it is therefore quite usual that a certain proportion of the parents is affected. As the parents are always genotyped and their clinical status is often known, they could be used as a sample of unrelated cases and controls. For that reason, it could be interesting to perform a case-control analysis on the parents of a TDT sample in order to replicate the findings that have been made using the classical TDT strategy.

However, it is not clear wether this case-control study using the sample of parents is really independent from the TDT analysis performed on a sample consisting of these parents and their affected children. Stated differently, it is unknown if the case-control analysis among the parents can be considered as a valid replication study. The goal of this work was therefore to investigate this issue through a large simulation study.

Material and Methods

Case-Control analyses

Case-control studies are classical tools in genetic epidemiology to test the association between a marker and a disease (ref). For a given SNP with two alleles A and B, they are based on the comparison of the distribution of genotypes (genotypic test) between patients (cases) and healthy individuals (controls). In order to compare the distribution of the genotypes, cases and controls are separated in three categories, each corresponding to one genotype as shown in table 1.

Table 1: Distribution of the genotypes according to the affected status

The two distributions are compared using a chi-square test with two degrees of freedom (df) which will be denoted as χ²cc

χ²cc = with cij=

Once the χ²cc has been computed, the principle of the test consists in comparing the χ²cc value to a tabulated distribution. A value greater than a given threshold (which is a function of the type I error that has been chosen by the investigators) allows us to reject the null hypothesis of no association between the marker and the disease with a given significance level (p-value). For example, for a χ²ccwith two df a threshold of 5.99 corresponds to a type I error of 0.05 and a χ² greater than this threshold allows us to conclude to an association with a p-value < 0.05.

The Transmission Disequilibrium Test (TDT)

Unlike case-control samples which consist in independent cases and controls the TDT sample consist in trios with two parents and one affected children. The general principle of this test is to search for a distortion of the transmission of alleles from the parents to the affected offspring. Under the null hypothesis of no linkage or no association between the marker and the disease locus, the heterozygous AB parents transmit either allele A or B with the same probability 0.5 to their affected children. As it is shown in table 2, the non-transmitted parental alleles can be considered as the controls of the transmitted alleles.

Table 2: Table of contingency of the TDT


As an example, n1 is the number of AA parents who transmitted an allele A to their affected child and did not transmit one allele A, while n2 is the number of AB parents who transmitted an allele B to their offspring and did not transmit their allele A. The appropriate test for such contingency table is a matched test known as the McNemar χ². It is only based on the number of informative transmissions n2 and n3, and asymptotically distributed as a χ² with one df. This χ² will be denoted as χ²tdt ..

χ²tdt =

Simulation of the families

We generated trios with both a disease locus and a marker locus. The disease locus was characterized by two alleles D and d of population frequencies q and 1-q, respectively. The effect of allele D in the population was modeled by three penetrances fdd, fdD and fDD which are the probabilities to present the disease for a subject with a dd, dD or DD genotype, respectively. The penetrances were parametrized in terms of genotypic relative risks (GRRs), γ1=fdD/fdd , γ2=fDD/fdd, [Risch and Merikangas, 1996]. These two parameters represent the relative risk of being affected for a person carrying one (dD genotype) or two (DD genotype) copies of this allele as compared to a person carrying no copy (dd genotype). A multiplicative model corresponds to γ1 =(γ2)² , while a dominant model is characterized by γ1 =γ2 ≥1 and a recessive model by γ1 =1 and γ2 ≥1. Table 3 provides complete characteristics of the two multiplicative (X1-X2), dominant (D1-D2) and recessive (R1-R2) models simulated. X1, D1, and R1 correspond to a modest genetic effect while X2, D2, and R2 correspond to a strong genetic effect.

Table 3: Parameters of simulated multiplicative, recessive and dominant models

An additional parameter to consider was the prevalence of the disease, i.e. the proportion of affected people in a population (denoted as F). We considered a common disease with a higher prevalence in the parents than in the children and set F=0.08 in the parents and F’=0.02 in the children. Once F, q, γ1, and γ2 have been set, fdd and the two other prenetrances can be derived from the following relationship:

F=[q² γ2 +2q(1-q)γ1+(1-q)²]fdd

For the marker locus, we considered two alleles A and B, of respective population frequencies P and (1-P). Under the null hypothesis of no linkage disequilibrium, the marker and the disease locus were unlinked and we fixed P=0.5. Under the alternative hypothesis we considered a perfect linkage disequilibrium between the marker and the disease locus, i.e., the marker and the disease locus were completely confounded.

Simulations followed a three-step design: (i) random generation of the parental genotypes at the marker and the disease locus with probabilities depending on q and P, (ii) random generation of the child genotypes following a Mendelian inheritance of the parental alleles, (iii) determination of the clinical status of the children and their parents. The clinical status was a binary variable (affected/unaffected), randomly obtained according to the genotype of the individual and the corresponding penetrance. The trio was retained only if the child was affected. Each replicate consisted in 400 trios with an affected child. For each replicate and all the models, we computed the classical χ²tdt, and the χ²cc on the parents only (i.e. cases corresponded to affected parents and controls to unaffected parents). For the dominant and recessive model, the TDT and genotypic tests could be extended and were performed after an appropriate encoding of the genotypes. For example, in the case-control study, if we consider a dominant allele B, then AB and BB genotypes are merged, and table 1 becomes a two by two table of contingency and χ²cc becomes a test with one df. Finally, we repeated the simulation process until the desired number of replicates was reached under each model.

Simulations under H0

Under the null hypothesis of no linkage disequilibrium the goal was to investigate whether or not the results of χ²cc were dependent of those of χ²tdt. In a first step we studied the distribution of χ²cc according to two levels of χ²tdt(≥ or < to a given threshold). We generated enough replicates to obtain 100,000 χ²tdtgreater than a threshold of 3.84 (the classical χ² threshold corresponding to type I error of 0.05). To obtain 100,000 χ²tdt≥3.84 under H0 we needed to generate about 2,000,000 replicates. We studied the proportions of χ²cc superior to different thresholds (corresponding to 0.05, 0.01 and 0.001 type I errors) according to the valueχ²TDT (≥ or <3.84). In a second step, we computed the correlations between χ²cc andχ²tdt and tested them using the non parametric Spearman coefficient of correlation.

Simulation under H1

Under the alternative hypothesis we investigated the power of χ²cc. The power was represented by the proportion of replicates leading to a χ²cc greater than a threshold corresponding to a given type I error. For example the power of the parental case-control study at 5% was the proportion of χ²cc ≥3.84. For all the models, 1,000 replicates were sufficient because the power of the tests were generally much greater than 20%.

We investigated the power of the parental case-control study in different situations including various prevalence values. Moreover, we tried to evaluate the impact of the TDT ascertainment (i.e. parents selected through their affected child) on the parental case-control analysis. Therefore, we conducted additional analyses using individuals that were not selected on the basis of the affected status of their child. For each replicate we recorded the Na and Nu numbers of affected and unaffected parents, respectively. An additional parental sample without children was then generated using the same procedure as described earlier without any selection, until Na affected and Nu unaffected individuals were obtained. Genotypic case-control analyses were performed using this new “unselected” sample. We also compared the power of the usual case-control analysis (“unselected” sample) and the TDT-selected case-control analysis by computing the asymptotic relative efficiency (ARE) as proposed by Pitman in 1948.

Results

Under H0

Under the null hypothesis of no linkage disequilibrium, parental case-control analyses are presented according to the χ²tdtvalue (≥ and <3.84). The mean number of affected parents varied from 65 to 78 depending on the genetic model. Table 4 shows the observed proportions of χ²cc exceeding the thresholds for a type I error of 0.05, 0.01 and 0.001 (α) for all the models. Asymptotic type I error and their prediction interval are given. It is noteworthy that among the replicates with χ²tdt≥3.84 all the values of χ²cc belong to the predictive interval. Among the replicates with χ²tdt<3.84 all but five observed values belonged to the predicted interval. Overall, under any model, the distribution of χ²cc is, compatible with the asymptotic distribution of a χ²whatever the outcome of χ²tdt.

Table 4: Distribution of false positive parental case-control analyses according to the results of the TDT

*: Prediction interval= α ± 1.96 [α(1- α)/n]1/2 ; with n=100,000 (χ²≥3.84) or n=1,900,000 (χ²3.84)

Moreover, we did not observe any significant Spearman correlation coefficient between the χ²tdtand the χ²cc values. The correlation coefficients varied from -0.06 to 0.03 and were never significant at the 0.05 level. These results were identical if the correlations were done separately among false positive TDT and negative TDT, and were also similar under all the genetic models (data not shown). In conclusion, whatever the genetic model, these data indicate that the distribution of the two tests can be considered as independent under the null hypothesis of no linkage disequilibrium.

Under H1

Under the hypothesis of perfect linkage disequilibrium between the marker and the disease locus, the power of the parental case-control analysis is reported on table 5. The study is limited to the genetic models with a modest effect because in the models with a strong genetic effect the power was saturated at 100% even for a type I error of 10-3.

Table 5: Power of the parental case-control analysis and of the TDT

We observe that the power of the TDT is greater than the one of the parental case-control analysis. This was expected because of the large difference in the number of affected people considered: 400 children for the TDT versus ~70 parents (for a parental prevalence of 0.08) for the parental case-control study. To confirm that this explanation was correct, we performedthe same study using different parental prevalences. The power of the parental case-control study according to the prevalence of the disease is shown in figure 1. Under the three models with modest genetic effect, we observed that the power of the parental case-control studies increased with the number of affected parentsto nearly reach the power of the TDT for a prevalence of 0.15.As expected, the power of the TDT remained constant whatever the prevalence.

Figure 1: Power of the TDT and of theparental case-control analysis as a function of the prevalence under the X1 (A), D1 (B) and R1 (C) models

Finally, in order to compare the power of a parental case-control analysis on the parents of a TDT-replicate to the power of a ‘classical’ case-control analysis, tests were performed on independent samples of individuals that were not ascertained through an affected child.Therefore, for each sample of the TDT (from which we used the parents to perform the case control study), we generated another parental sample, regardless of the children status. In order to avoid bias due to unequal numbers of affecteds, we selected the same number of affected and unaffected parentsin the two studies. Moreoverwe used the Pitman method to compare the efficiency of both tests. The AREwas calculated as the quotient of the efficiency of the case-control analysis divided by the efficiency of the case-control on an “unselected” sample. The mean, the standard deviation of χ²cc and χ²ccunselected, and the ARE are shown inTable6for the six genetic models.

Table 6: Mean (standard deviation) of the parental case-controlanalysis (CC) and the case-controlanalysis on an unselected sample (CCu) and ARE.

In average χ²ccis greater than the χ²ccunselected. The case-control analysis on TDT-selected people is generally more efficient that the usual case-control analysis especially under the models with strong genetic effect, maybe because there are more geneticcases among the affected people (because the fact of having an affected child must increase the probability to have geneticpredisposition to a disease). Under the alternative hypothesis, we also computed the Spearman correlation coefficients between the χ²cc and χ²tdt for each model and none were significant.

Discussion

The results presented here are in favor of an independence between the TDT and the parental case-control analysis whatever the genetic model. Indeed, the χ²cc were never influenced by the χ²tdt. This property demonstrates that the use parental case-control study as a replication study for TDT is a valid strategy. This is an important observation because a replication study is an absolute requisite to confirm the result of a primary analysis and subsequently decrease the number of false positive findings. The advantage of this usinf of the case-control study is that it is a replication 'for free', with no need of any additional sampling and genotyping efforts. In addition, we observed that the TDT ascertainment (i.e. the use of parents that have an affected child) improved the power of the case control analysis as compared to a ‘classical’ case control analysis using unselected individuals. We suppose that this improvement is due to an increased number of genetic cases among the affected parents, because they were recruited on the basis of one affected child. Therefore, not only the use of the TDT derived parental sample is a robust strategy but it is also a powerful one because such sample is likely to be enriched in genetic cases as compared to a classical case-control study.