Psychometric Problems in the Context of Genetic Association

Phenotyping in GWAS 1

SUPPLEMENTAL MATERIAL

Power in GWAS:

Lifting the curse of the clinical cut-off

Sophie van der Sluis1

Danielle Posthuma1

Michel G. Nivard2

Matthijs Verhage1

Conor V. Dolan2,3

1Complex Trait Genetics, Dept. Functional Genomics & Dept. Clinical Genetics, Center for Neurogenomics and Cognitive Research (CNCR), FALW-VUA, Neuroscience Campus Amsterdam, VU University medical center (VUmc). Email:

2 Biological Psychology, VU University Amsterdam.

3Department of Psychology, FMG, University of Amsterdam, Roeterstraat 15, 1018 WB Amsterdam, The Netherlands

The simulation

Conditional on the gene-effects, we assumed a normally distributed (N~(0,1)) underlying latent trait (i.e., the trait of interest that we wish to measure and for which we wish to identify the genetic background),and we randomly generated latent trait scores for Nsubj=5000 unrelated subjects. Ten causal variants were then simulated with minor allele frequency (MAF) of .2 (genotype groups coded 1/2/3 in the phenotype-creating simulation, with 1 corresponding to the homozygous minor allele genotype), which explained .2 to 2% of the variance in the latent trait score. Adding the gene-effects to the standard normal conditional latent trait score resulted in anapproximatelynormally distributed unconditional latent trait score with a mean of ~1 and a variance of ~1.09.

We then simulated 30 extreme items (Figure S1a) and 30 items that covered the entire phenotypic range (Figure S1b). The 30 extreme items mainly distinguish between cases and controls, i.e., between people with and without an extreme phenotype, while the 30 items that cover the entire phenotypic range also distinguish between subjects whose phenotype falls well within the normal range.

The reliability of these items, defined in the context of a latent factor model, ranged from 0.01 to 0.25(corresponding to factor loadings ranging from .1 to .5, respectively). In terms of the internal consistency (Cronbach’s alpha), this resulted in a 30-item instrument (either consisting of 30 extreme items, or 30 items covering the entire phenotypic range) with a reliability of .84, which is realistic for depression questionnaires (Beck, Steer & Garbin, 1988; Radloff, 1991, 1977) andscales that measure behavioural problems like ADHD(e.g. for Conners’ behavioural rating scales:Sparrow, 2010).

We simulated 3-point scale items (e.g. “0: does not apply”, “1: applies somewhat”, “2: does certainly apply”), with endorsement rates of the three answer categories depending on what is called “the difficulty of the item” in the context of Item response Modelling (IRT). In short, easy items (located at the left of the phenotypic scale, Figure S1b) are endorsed by almost everybody (i.e., most subjects score 2 on these items), while difficult items (or extreme items: on the right hand of the phenotypic scale, Figure S1a) are endorsed by only a small percentage of the general population (i.e., most subjects score 0 on these items). So for the 30 extreme items, 95-98% of the subjects in a normal population would score 0, 1.5-4 % would score 1, and only .5-1% of subjects in a general population would score 2 on these items. For the 30 items covering the entire phenotypic range, the endorsement rates of the three categories varied widely.Figure S2 plots the endorsement rates for the 30 extreme (Figure S2a) and the 30 items covering the entire phenotypic range (Figure S2b).

In fact, for every individual, continuous item scores were created first, which were subsequently categorized. The score on item j for subject i was calculated as

yij= λj*θi+ εij,

where λj denotes the factor loading of itemj, θi the latent trait score of subject i, and εij is the individual-specific residual of the item (i.e., the part of the item score that is not related to the subject’s latent trait score). The resulting continuous (normally distributed) item scores were subsequently categorized into 3 scores (0, 1, 2) depending on the endorsement rates illustrated in Figure S2.

The categorized item scores of the 30 extreme items and the 30 items covering the entire phenotypic scale, were subsequently summed to get the individuals’overall test scores: sum_skew (based on the 30 extreme items) and sum_tot (based on the 30 items across the scale).

Using a 2-parameter IRT model, we also calculated the individual subjects’ expected factor scores, either using their scores on the 30 categorized extreme items, or the 30 categorized items covering the entire phenotypic range.

In addition, the sum score based on the 30 extreme items (sum_skew) was dichotomized in two ways: we either used a “clinical” cut-off criterion such that the 12% subjects with the highest sum_skew scores were coded as 1, and the remainder of the sample as 0, or we coded the highest 50% scoring subjects as 1, and the lowest 50% as 0. We also categorized the sum_skew score into 3 categories, each covering approximately 33% of the sample.

Finally, the sum_skew scores were subjected to a square-root or a normal scores transformation. Note that these latter two transformations are often recommended before conducting an analysis like regression (which assumes normally distributed dependent variables) whenthe dependent variable of interest (in our case the sum_skew score) is not normally distributed. The distributions of all 10 discussed phenotypic operationalisations are illustrated in Figure S3.

All 10 phenotypic operationalizations were subsequently regressed on the 10 causal variants. In these regression analyses, the homozygous major allele genotype was coded 0 (i.e., carrying 0 minor alleles), the heterozygous genotype was coded 1, and the homozygous minor allele genotype was coded 2. We also included 1 genetic variant which was not related to the phenotype so that we could examine the type-I error (false positive) rate.

This entire data simulation+ analysis was repeated Nsim=2000 times, and for each simulated genetic variant we counted the number of timesthat it was picked up in the regression given a genome-wide criterion α of 1e-07 (i.e., the number of times out of Nsim=2000 that the observed p-value in the regression was < 1e-07).The results (in percentages) are shown in Table S1, and plotted in Figure S4. (Note that Figure S4 is similar to the published figure in the manuscript, except that a) it also includes the results for other operationalisations of the phenotype and b) the numbering of the models is different).

In short, the results show that the skewed sum score (2) performs worse than the latent trait score (1) and the normally distributed sum score (7), and the power decreases dramatically when the skewed sum score is categorized (3-5), especially when a clinical cut-off criterion is used (3). Clearly, when the trait is polygenic (rather than Mendelian), and cases and controls differ quantitatively, as stipulated in the common-trait common-variant hypothesis underlying GWAS, the test statistic associated with the correlation between the genotype and the case-control phenotype, is generally smaller compared to the test statistics associated with the other phenotypic measures. This is mainly due to the larger standard error of the estimate. Consequently, the power to detect the causal locus drops dramatically.

Phenotyping in GWAS 1

Table S1: Power in percentages (for MAF=.2 and factor loadings ranging from .1-.5)
Phenotypic operationalisations / Causal variant effect size (%variance explained)
0.2% / 0.4% / 0.6% / 0.8% / 1% / 1.2% / 1.4% / 1.6% / 1.8% / 2% / Check: 0 %
Latent trait (1) / 0.75 / 10.45 / 34.20 / 64.30 / 85.90 / 96.10 / 98.65 / 99.70 / 100.00 / 100.00 / 5.65
Sum extreme items (2) / 0.10 / 1.15 / 3.40 / 11.65 / 24.80 / 42.00 / 56.30 / 72.20 / 84.40 / 90.85 / 5.50
Dich (88% - 12%) (3) / 0.00 / 0.00 / 0.05 / 0.10 / 0.35 / 1.05 / 1.85 / 3.50 / 5.90 / 9.15 / 5.40
Dich (50%-50%) (4) / 0.00 / 0.15 / 1.00 / 3.65 / 6.20 / 16.10 / 25.35 / 36.00 / 49.25 / 59.65 / 5.25
Categorized (33%/33%33%) (5) / 0.00 / 0.65 / 2.50 / 7.20 / 17.00 / 31.20 / 44.90 / 56.80 / 71.90 / 79.85 / 5.30
Factor score extreme items (6) / 0.05 / 1.15 / 5.50 / 17.00 / 33.55 / 53.55 / 68.20 / 82.40 / 90.80 / 95.60 / 5.55
Sqrt transformation (9) / 0.10 / 1.05 / 4.15 / 13.35 / 27.35 / 46.05 / 58.95 / 74.60 / 86.30 / 92.20 / 5.50
Normal score transformation (0) / 0.10 / 0.95 / 4.35 / 14.00 / 28.20 / 48.05 / 61.40 / 75.85 / 87.20 / 93.10 / 5.65
Sum items covering entire phenotypic range (7) / 0.15 / 2.60 / 12.55 / 31.20 / 53.90 / 75.50 / 86.20 / 93.95 / 97.95 / 99.25 / 4.95
Factor score items covering entire phenotypic range(8) / 0.20 / 3.30 / 14.20 / 35.35 / 57.60 / 79.70 / 89.35 / 95.40 / 98.80 / 99.45 / 5.35
Note: Power in percentagesfor simulations using factor loadings ranging from .1-.5: the percentage of Nsim=2000 simulations that picked up the causal variants with MAF=.2 and with effect size varying from .2 to 2% (i.e., variance explained in the latent trait score). Between brackets the numbers corresponding to Figure S4. The last column shows the false positive rate given α=.05 (none of the false positive rates deviate significantly from the expected 5%)[1]. Nsubj=5000 is all 10 scenarios.

Phenotyping in GWAS 1

To assure generalizability of the simulation results to scenarios in which the MAF of the causal variants are not .2, we repeated the study with exactly the same simulation setting except changing minor allele frequencies to MAF=.05 (Figure S5 and Table S2) or to MAF=.5 (Figure S6 and Table S3). The main results remain the same: dichotomizing a skewed sum score using a clinical cut-off is very deleterious for the statistical power to detect causal variants. The difference in power between a skewed sumscore and a sum score based on items covering the entire trait range is less dramatic when MAF is really low (.05 versus .2 and .5).

In addition, we repeated the simulations with MAF=.2, but changed the settings for the factor loadings. In the original simulation, the factor loadings ranged between .1 and .5, which corresponds to inter-item correlations ranging between .01 and .25, which is rather low but results in a 30-item instrument with a realistic reliability of .84. We added two simulations: a) factor loadings ranging between .3 and .6 (inter-item correlations between .09 and .36, and reliability of the 30-item instrument of .91: Table S4 and Figure S7), and b) factor loadings ranging between .3 and .9 (inter-item correlations between .09 and .81, and reliability of the 30-item instrument of .98: Table S5 and Figure S8).Again, the main results remained the same: the power to detect a trait-associated SNP diminishes dramatically if categorization, and especially a clinical cut-off criterion,is used to dichotomize a skewed but continuous trait-measure before analysis. In addition, the power of the sum score based on items covering the entire trait range remains considerably higher compared to the sum score based on extreme items only.

The power of future GWAS studies could thus improve considerably if researchers would use phenotypic instruments that resolve individual difference in cases as well as controls, i.e., across the entire trait range. Practically, there are at least two ways in which current instruments could be adjusted.

1)One could complement the current extreme items with easy items and items of medium difficulty. This is, however, easier said than done. For example, suppose anattention deficit /hyperactivity scale like the Child Behavior Check List (CBCL, Achenbach, 1991) including items like “Often fails to pay close attention or makes careless mistakes”, “Often does not seem to listen when spoken to directly”, and “Often has difficulty organizing tasks and activities”. One could add items stating almost the opposite, e.g., “Pays meticulous attention” and “Listens carefully when spoken to”, but items of medium difficulty level are difficult to compose.

2)Rather than adding easy/medium items, one could adjust the items’ answer categories. For instance, answer categories for the attention deficit/hyperactivity items mentioned above are: “this item describes a particular child not at all / just a little / quite a bit / very much”. Swanson et al. (2006) suggested to change the reference of the scale and to rather ask: “compared to other children, does this child display the following behaviour far below average / below average / slightly below average / average / slightly above average / above average / far above average”. These authors developed the SWAN (Strengths and Weaknesses of ADHD symptoms and Normal behaviour scale), an instrument very much like the CBCL but due to the different rating scale, overall scores on the SWAN are approximately normally distributed, while the original CBCL scores are very skewed (see e.g. Polderman et al., 2007 for an illustration). That is, in asking teachers/parents to compare a child’s behaviour to the average of other children’s behaviour, and offering not only the option to display certain behaviour much more often, but also to display it much less often than average, overall scores on the SWAN are normally distributed in a general population sample.

For many traits/instruments, changing the answer option and frame of reference, rather than the actual items, is probably easier to implement in practice. Also, changing the rating scaleis directly applicable to many types of instruments. For instance, the answer option of depression items like “I felt sad”, “I felt lonely”, “I felt my life was a failure”, and “I felt people disliked me” are usually something like “Rarely / sometimes / occasionally / often”. Changing the answer options to “compared to other people, did you feel […] far less often/ less often/ slightly less often/ about as often/ slightly more often/ more often/ far more often?” is easy to implement. This resulting scale will still allow distinction between cases and controls, but also distinguishes between controls: some people do have feelings of loneliness or sadness but only as often as anyone, while other really hardly ever experience loneliness or sadness.

A drawback of this rescaling, however, could be that participants are not only asked to evaluate their own behaviour, but to also compare their behaviour to that of others. This, of course, requires some insight / knowledge about “average behaviour” and what is considered “average” may differ from person to person. However, answer option like “rarely” and “occasionally” are also open to subjective evaluation. Self-report instruments often suffer from this “frame-of-reference” dependency.

Important to note is that simple rescaling will not always result in normally distributed scores. For instance, schizophrenia symptoms like odd believes, unusual perceptual experiences, delusions, hallucinations, apathy, and catatonic behaviour are simply quite extreme and may not be very suited to evaluation on a “gradual” scale.

Whatever strategy one chooses to obtain more normally distributed test scores, the newly developed instruments of course require careful validation, standardization, and especially: close comparison to the original instruments for which valuable information is already available.

Phenotyping in GWAS 1

Table S2: Power in percentages (for MAF=.05 and factor loadings ranging from .1-.5)
Phenotypic operationalisations / Causal variant effect size (%variance explained)
0.2% / 0.4% / 0.6% / 0.8% / 1% / 1.2% / 1.4% / 1.6% / 1.8% / 2%
Latent trait (1) / 0.80 / 9.45 / 36.80 / 64.25 / 85.90 / 94.65 / 98.85 / 99.40 / 99.95 / 100.00
Sum extreme items (2) / 0.00 / 1.75 / 9.90 / 23.80 / 46.55 / 65.25 / 81.90 / 91.05 / 96.60 / 99.00
Dich (88% - 12%) (3) / 0.00 / 0.00 / 0.00 / 0.00 / 0.00 / 0.05 / 0.10 / 0.10 / 0.30 / 0.85
Dich (50%-50%) (4) / 0.00 / 0.45 / 2.15 / 6.60 / 14.40 / 24.35 / 36.00 / 52.45 / 69.50 / 78.70
Categorized (33%/33%33%) (5) / 0.00 / 0.70 / 4.50 / 12.30 / 26.55 / 43.15 / 58.00 / 71.65 / 84.85 / 91.65
Factor score extreme items (6) / 0.00 / 2.50 / 14.40 / 31.70 / 55.20 / 73.30 / 88.20 / 93.95 / 97.85 / 99.50
Sqrt transformation (9) / 0.10 / 2.40 / 13.25 / 31.05 / 53.00 / 71.50 / 86.55 / 93.00 / 97.90 / 99.30
Normal score transformation (0) / 0.05 / 2.30 / 12.85 / 30.60 / 52.70 / 70.60 / 86.65 / 92.85 / 97.75 / 99.35
Sum items covering entire phenotypic range (7) / 0.20 / 2.60 / 14.45 / 31.25 / 53.55 / 73.70 / 87.25 / 93.30 / 97.75 / 99.60
Factor score items covering entire phenotypic range(8) / 0.30 / 3.15 / 15.80 / 34.75 / 57.25 / 77.00 / 89.90 / 94.80 / 98.75 / 99.70
Note: Power in percentages: the percentage of Nsim=2000 simulations that picked up the causal variants with MAF=.05 and with effect size varying from .2 to 2% (i.e., variance explained in the latent trait score). Between brackets the numbers corresponding to Figure S5. Nsubj=5000 is all 10 scenarios.
Table S3: Power in percentages (for MAF=.5 and factor loadings ranging from .1-.5)
Phenotypic operationalisations / Causal variant effect size (%variance explained)
0.2% / 0.4% / 0.6% / 0.8% / 1% / 1.2% / 1.4% / 1.6% / 1.8% / 2%
Latent trait (1) / 0.90 / 9.05 / 34.80 / 65.60 / 85.90 / 95.10 / 98.75 / 99.70 / 100.00 / 99.95
Sum extreme items (2) / 0.00 / 0.30 / 1.85 / 5.05 / 13.00 / 22.95 / 36.15 / 47.60 / 60.90 / 71.60
Dich (88% - 12%) (3) / 0.00 / 0.00 / 0.00 / 0.55 / 0.60 / 1.55 / 2.75 / 3.85 / 7.20 / 10.90
Dich (50%-50%) (4) / 0.00 / 0.15 / 0.50 / 1.30 / 3.35 / 6.55 / 10.95 / 16.95 / 25.35 / 31.85
Categorized (33%/33%33%) (5) / 0.00 / 0.30 / 1.10 / 2.45 / 6.75 / 12.20 / 19.40 / 28.95 / 40.55 / 50.90
Factor score extreme items (6) / 0.00 / 0.50 / 3.15 / 7.85 / 19.10 / 32.60 / 46.00 / 60.75 / 73.25 / 83.10
Sqrt transformation (9) / 0.00 / 0.30 / 1.95 / 4.60 / 11.65 / 21.25 / 33.25 / 44.10 / 58.30 / 69.60
Normal score transformation (0) / 0.00 / 0.45 / 2.05 / 5.20 / 13.00 / 24.10 / 36.75 / 48.30 / 62.20 / 72.60
Sum items covering entire phenotypic range (7) / 0.25 / 2.50 / 11.80 / 29.75 / 52.05 / 70.70 / 85.60 / 91.75 / 97.10 / 98.95
Factor score items covering entire phenotypic range(8) / 0.25 / 2.75 / 13.80 / 33.35 / 58.00 / 75.50 / 89.00 / 94.60 / 98.35 / 99.35
Note: Power in percentages: the percentage of Nsim=2000 simulations that picked up the causal variants with MAF=.5 and with effect size varying from .2 to 2% (i.e., variance explained in the latent trait score). Between brackets the numbers corresponding to Figure S6. Nsubj=5000 is all 10 scenarios.
Table S4: Power in percentages (for MAF=.2 and factor loadings ranging from .3-.6)
Phenotypic operationalisations / Causal variant effect size (%variance explained)
0.2% / 0.4% / 0.6% / 0.8% / 1% / 1.2% / 1.4% / 1.6% / 1.8% / 2%
Latent trait (1) / 0.45 / 9.80 / 35.65 / 65.50 / 85.80 / 95.65 / 99.30 / 99.70 / 100.00 / 100.00
Sum extreme items (2) / 0.05 / 1.60 / 7.95 / 20.20 / 41.60 / 61.45 / 76.15 / 87.25 / 94.85 / 97.55
Dich (88% - 12%) (3) / 0.00 / 0.00 / 0.30 / 1.05 / 1.00 / 3.75 / 5.15 / 10.40 / 15.65 / 22.15
Dich (50%-50%) (4) / 0.05 / 0.45 / 2.65 / 8.75 / 16.85 / 31.15 / 46.15 / 61.35 / 74.90 / 83.05
Categorized (33%/33%33%) (5) / 0.05 / 1.05 / 7.50 / 17.10 / 34.35 / 53.25 / 71.30 / 81.75 / 90.20 / 94.95
Factor score extreme items (6) / 0.10 / 2.95 / 13.45 / 30.10 / 53.95 / 74.90 / 86.85 / 94.60 / 98.00 / 99.20
Sqrt transformation (9) / 0.10 / 3.00 / 12.00 / 28.75 / 51.80 / 71.65 / 85.05 / 93.10 / 97.50 / 99.20
Normal score transformation (0) / 0.10 / 3.10 / 12.35 / 29.05 / 52.05 / 72.10 / 85.60 / 93.25 / 97.80 / 99.25
Sum items covering entire phenotypic range (7) / 0.20 / 5.25 / 22.40 / 45.60 / 69.85 / 86.55 / 94.30 / 98.15 / 99.70 / 99.80
Factor score items covering entire phenotypic range(8) / 0.15 / 4.90 / 22.55 / 47.40 / 71.45 / 87.35 / 95.10 / 98.30 / 99.55 / 99.95
Note: Power in percentages: the percentage of Nsim=2000 simulations that picked up the causal variants with MAF=.5 and with effect size varying from .2 to 2% (i.e., variance explained in the latent trait score). Between brackets the numbers corresponding to Figure S6. Nsubj=5000 is all 10 scenarios.
Table S5: Power in percentages (for MAF=.2 and factor loadings ranging from .3-.9)
Phenotypic operationalisations / Causal variant effect size (%variance explained)
0.2% / 0.4% / 0.6% / 0.8% / 1% / 1.2% / 1.4% / 1.6% / 1.8% / 2%
Latent trait (1) / 0.55 / 10.10 / 32.50 / 64.95 / 85.70 / 96.70 / 98.65 / 99.80 / 100.00 / 100.00
Sum extreme items (2) / 0.10 / 2.20 / 8.70 / 25.20 / 49.10 / 68.45 / 83.15 / 92.25 / 96.45 / 98.50
Dich (88% - 12%) (3) / 0.00 / 0.05 / 0.30 / 0.95 / 2.30 / 6.25 / 11.00 / 16.60 / 23.30 / 31.85
Dich (50%-50%) (4) / 0.10 / 0.65 / 4.55 / 14.90 / 28.90 / 49.55 / 64.05 / 77.50 / 87.30 / 93.35
Categorized (33%/33%33%) (5) / 0.20 / 2.35 / 9.25 / 25.25 / 46.80 / 68.95 / 81.60 / 90.90 / 95.70 / 98.60
Factor score extreme items (6) / 0.30 / 4.35 / 17.75 / 41.95 / 67.50 / 86.65 / 94.60 / 98.00 / 99.30 / 99.85
Sqrt transformation (9) / 0.25 / 4.20 / 16.40 / 39.30 / 63.35 / 84.10 / 93.40 / 97.15 / 99.15 / 99.65
Normal score transformation (0) / 0.25 / 4.20 / 16.40 / 39.75 / 64.20 / 83.85 / 93.20 / 97.30 / 99.20 / 99.70
Sum items covering entire phenotypic range (7) / 0.60 / 7.80 / 26.90 / 56.05 / 79.95 / 93.10 / 97.80 / 99.60 / 100.00 / 99.95
Factor score items covering entire phenotypic range(8) / 0.50 / 7.80 / 27.45 / 56.95 / 81.15 / 94.15 / 97.85 / 99.60 / 100.00 / 99.95
Note: Power in percentages: the percentage of Nsim=2000 simulations that picked up the causal variants with MAF=.5 and with effect size varying from .2 to 2% (i.e., variance explained in the latent trait score). Between brackets the numbers corresponding to Figure S6. Nsubj=5000 is all 10 scenarios.

Phenotyping in GWAS 1

References:

Achenbach, T.M. (1991). Manual for the Child Behavior Checklist/4–18. Burlington, VT: University of Vermont, Department of Psychiatry.

Beck, A.T., Steer, R.A., & Garbin, M.G. (1988). Psychometric properties of the Beck Depression Inventory: Twenty-five years of evaluation. Clinical Psychology Review, 8, 77-100.

Polderman, T.J.C., Derks, E.M., Hudziak, J.J., Verhulst, F.C., Posthuma, D., & Boomsma, D.I. (2007). Across the continuum of attention skills: a twin study of the SWAN ADHD rating scale. Journal of Child Psychology and Psychiatry, 48(11), 1080-1087.

Radloff, L.S. (1977). The CES-D Scale : A Self-Report Depression Scale for Research in the General population. Applied Psychological Measurement, 1, 385-401.

Radloff, L.S. (1991). The use of the Center for Epidemiologic Studies Depression Scale in adolecscents and young adults. Journal of Youth and Adolescence, 20(2), 149-166.

Sparrow, E.P. (2010). Essentials of Conners’ Behavior Assessments. John Wiley & Sons, Inc., Hoboken, New Jersey.

Swanson, J.M., Schuck, S., Mann, M., Carlson, C., Hartman,K., Sergeant, J.A., Clevinger,W.,Wasdell,M., & McCleary, R. (2006). Categorical and dimensional definitions and evaluations of symptoms of ADHD: The SNAP and SWAN Rating Scales. Retrieved May 2006 from