Hola Susi, Hola Dani

MENDELIAN AND COMPLEX

GENETIC DISORDERS

(MASTER IN GENETICS AND GENOMICS)

COMPUTER SESSIONS

Bru Cormand

OBJECTIVE

To run a case-control association study between a psychiatric disorder (ADHD) an a candidate gene (NTF3, encoding neurotrophic factor 3).We will use real data(Ribasés et al 2008. Biological Psychiatry 63:935-945)

BIOINFORMATIC TOOLS

Free Access software that can be run on-line or locally(Windows):

- GENETIC POWER CALCULATOR (GPC)

- HAPLOVIEW

- STRUCTURE

- SNPassoc (R)

COURSE MATERIALS:

GENETIC POWER CALCULATOR (GPC)

Shaun Purcell

Harvard Medical School

Purcell S, Cherny SS, Sham PC. Genetic Power Calculator:

design of linkage and association genetic mapping studies of complex

traits. Bioinformatics 2003;19(1):149-150

This site provides automated power analysis for variance components (VC) quantitative trait locis (QTL) linkage and association tests in sibships, and other common tests.

The main parameters that must be specified by the user are

High risk allele frequency, for 'A' allele. Typically, this would be rare, say under 0.10.
The disease prevalance in the general population (K).
The genotypic relative risks for the 'Aa' and 'AA' genotypes relative to the baseline 'aa' genotype risk. This risk is calculated from the parameters above. That is, the prevalence K equals f(AA)r(AA) + f(Aa)r(Aa) + f(aa)r(aa) where f() is the genotype frequency and r() is the genotypic risk, or P(disease|genotype). Rearranging gives a formula for r(aa) in terms of the other parameters. The genotypic relative risks for the 'Aa' and 'AA' genotypes equal r(Aa)/r(aa) and r(AA)/r(aa) respectively.
Sample size: As well as specifying the number of affected individuals (cases) the user must specify the control:case ratio. If this equals one, then there are as many controls as cases. If this were half, and there were 200 cases, there would be 100 controls, etc.
This procedure is for a marker B in linkage disequilibrium with the test locus A. To specify power at the test locus, set the LD measure (d-prime) to 1 and the allele frequencies of A and B equal.

The output gives the baseline genotypic risk r(aa) and also the genotypic odds ratios for the 'Aa' and 'AA' genotypes (will be very similar to the genotypic relative risks for rare diseases). We assume the biallelic Aa locus is the true causative polymorphism.

Power is given for various values of alpha for the user-specified sample size. Also, required sample size is given for various levels of alpha for the user-specified power. Note that the number of cases for a specific power refers to the number of affected individuals along with the appropriate number of controls as specified by the control:case ratio.

The output also gives the expected allele and genotype frequencies for cases and controls. A chi-squared test statistic (and associated power at alpha=0.05) is given for a test of Hardy-Weinberg equilibrium in cases and controls (the presence of H-W disequilibrium in cases but but not controls can be indicative of an association).

INPUT:

HAPLOVIEW

Jeffrey Barrett, Julian Maller and David Bender.

Broad Institute (MIT, Harvard, Whitehead Institute)

Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005;21(2):263-5

Haploview is designed to simplify and expedite the process of haplotype analysis by providing a common interface to several tasks relating to such analyses. Haploview currently supports the following functionalities:

LD & haplotype block analysis
haplotype population frequency estimation
single SNP and haplotype association tests
permutation testing for association significance
implementation of Paul de Bakker's Tagger tag SNP selection algorithm.
automatic download of phased genotype data from HapMap
visualization and plotting of PLINK whole genome association results including advanced filtering options

Haploview is fully compatible with data dumps from the HapMap project and the Perlegen Genotype Browser. It can analyze thousands of SNPs (tens of thousands in command line mode) in thousands of individuals.

INPUT:

The input files can be generated automaticallyfrom ENSEMBL ( containing genotypes at thousands of SNPs in different populations. Alternatively, we can build the files ourselves (with extensions .ped and .info) and the following format:

Sample.ped

Family Indiv.Father Mother Sex Aff.status marker1 marker2…

1 10 0 1 1 1 1 2 3

1 20 0 1 1 1 2 2 2

1 30 0 1 2 2 1 2 2

1 40 0 2 2 3 1 1 3

…

Sex (1=male, 2=female), Aff. status (1=healthy, 2=patient)

sample.info

marker1 19345

marker2 19799

marker3 …

STRUCTURE

Daniel Falush, Matthew Stephens, Jonathan Pritchard, Peter Donnelly, William Wen

The University of Chicago

Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data.Genetics 2000;155(2):945-59

The program structure is a free software package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs.

INPUT:

ID individualmarker 1marker 2…marker n

Pepet132

123

Pepita222

122

Ferran111

133

SNPassoc v. 1.9-2 (R>=3.0.0)

Juan R González, Lluís Armengol, Elisabet Guinó, Xavier Solé, and Víctor Moreno

Centre de Regulació Genòmica (CRG)

González JR, Armengol L, Solé X, Guinó E, Mercader JM, Estivill X, Moreno V. SNPassoc: an R package to perform whole genome association studies.

Bioinformatics 2007;23(5):644-5

This package carries out most common analysis when performing whole genome association studies. These analyses include descriptive statistics and exploratory analysis of missing values, calculation of Hardy-Weinberg equilibrium, analysis of association based on generalized linear models (either for quantitative or binary traits), and analysis of multiple SNPs (haplotype and epistasis analysis). Permutation test and related tests (sum statistic and truncated product) are also implemented.

INPUT1 (GENOTYPES):

PairIndividualSubdiagn.SexCase/CtrlSNP1SNP2…

110111213

122123112

…

Subdiagnosis (0=healthy, 1=combined, 2=inatent, 3=hyoperactive/impulsive, 4=residual), Sex (1=male, 2=female), Case/Ctrl (1=healthy, 2=patient)

INPUT2 (ALLELES):

PairIndividualSubdiagn.SexCase/CtrlSNP1SNP2…

110111111

110112233

122123311

122121122

…

Subdiagnosis (0=healthy, 1=combined, 2=inatent, 3=hyoperactive/impulsive, 4=residual), Sex (1=male, 2=female), Case/Ctrl (1=healthy, 2=patient)

INPUT3 (INSTRUCTIONS)

> tdah<-read.table("C:/NTF3.txt",header=TRUE)

This script includes the data from the file “NTF3.txt” (prepared in the above format) into a database called“tdah”, understandable by the program SNPassoc. In case there is a row in the upper part of the file, displaying the column headings, this has to be indicated with “header=TRUE”.

> tdah[1:11,1:11]

Visualization of the contents of the database “tdah” that we have created (rows and columns 1 to 10).

> tdah2<-setupSNP(tdah,6:11,sep="")

To convert columns 6 to 11 from the file “tdah” into a SNP format (the two alleles separated by a “/”) that is understandable for the following scripts.

> tableHWE(tdah2,CC)

To test (chi-square) Hardy-Weinberg equilibrium at SNPs contained in tdah2. If we wish to run separate analyses in cases and controls we need to indicate “CC” (the column where we establish who is a case and who is a control). This option also produces combined case+control calculations.

> association(CC~snp(NTF3rs6332),data=tdah2)

Here we test (Chi-square) the existence of differences in the allele/genotype frequenciesfor a given SNP (NTF3rs6332) between cases and controls. Different genetic models are considered (codominant, dominant, recessive, overdominant, additive). If we wish to analyse the data under a specific model, this has to be indicated (> association(CC~snp(NTF3rs6332),data=tdah2, model="dominant").The output includes the calculation of the frequencies and an association p-value.

> association(CC~sex+snp(NTF3rs6332,data=tdah2)

Idem as above, but in this case we consider gender as a covariate, i.e. we correct by gender. ‘sex’ is the heading of the column where we indicate the gender of the different individuals.

> WGassociation (CC,data=tdah2,model="all")

We test (chi-square) the existence of differnces in the allele and genotype frequencies of a series of SNPs (in this case all the SNPs included in the database ‘tdah2’) between cases and controls. Different genetic models are considered (codominant, dominant, recessive, overdominant, additive). The output only releases p-values. If we are interested in a specific genetic model, we should change ‘all’ to ‘dominant’, ‘recessive’, ‘codominant’, etc.