Biostatistics 237 /Biomathematics 207B/HG207B
March 9, 2004
Account name: m237
Password: winter2002, win2002
Laboratory #9:
FAMILY BASED ASSOCIATION TESTS: TDT AND GAMETE COMPETITION
(A) The data sets:
This exercise has two parts. In part I, we will run the TDT and the gamete competition on angiotensin I-converting enzyme (ACE) as a qualitative trait. The ACE gene is located on 17q23. When running the TDT or the gamete competition for qualitative traits, we will consider anyone with an ACE level of less than -0.648 to be affected. The data set for part I consists of extended and nuclear families from Oxford phenotyped for ACE and genotyped for the insertion-deletion (ID) polymorphism and the highly informative polymorphism in the neighboring growth hormone (GH). As a prelude to part I we will run the combining_alleles option of Mendel 5.0 to reduce the number of GH alleles and avoid sparse data problems.
In part II we will use the gamete_competition as a test of family based association with a quantitative trait. The data set for part II consists of extended and nuclear families from Jamaica phenotyped for ACE. We will examine 3 SNPs located within the ACE locus. The data consist of ACE levels on 405 people and SNP data on 489 people in 83 pedigrees. We will first run each SNP separately and then we will use the SNPs in combination.
Copy the following data sets from the F:\class\bio237 folder to your directory:
Part I files:
pedoxf.in
locoxf.in
mapoxf.in
concomb.in
contdt.in
congam.in
Part II files:
Consnp.in
locsnp.in
mapsnp.in
varace.in
pedsnp.in
Part I: TDT and gamete competition for a qualitative trait
(B) Reducing the number of GH alleles.
To avoid having a very large number of possible cells many with no data, we will combine alleles in the GH. This is absolutely necessary for the gamete competition. Otherwise it will run extremely slowly. The TDT will run reasonably well without collapsing the number of alleles, however because of their discrete nature, very sparse data will lead to false inference with both methods. Combining very rare alleles will avoid this problem.
The control in this case, concomb.in, is the following:
concomb.in:
!input files
LOCUS_FILE=LOCoxf.IN
MAP_FILE=MApoxf.IN
PEDIGREE_FILE=PEDoxf.IN
! reading input
AFFECTED=1
AFFECTED_LOCUS_OR_FACTOR=ACE
READ_PEDIGREE_RECORDS = F
pedigree_list_read=true
allele_separator=-
male =1
female=2
!new output
new_pedigree_file=pednew.in
new_locus_file=locnew.in
! analysis options
ANALYSIS_OPTION=Combining_alleles
OUTPUT_FILE=comb.out
Maximum_combined_alleles=7
This Mendel option creates a new locus file and corresponding pedigree file so it is important to specify the new file names. Combining_alleles uses the allele frequencies in the locus file to determine which alleles will be combined. The program combines alleles until there are no more than the maximum number of alleles (user specified) and they are at least as frequent as the minimum allele frequency (also user specified). The defaults are a maximum of 10 alleles and a minimum allele frequency of 0.05. The minimum number of alleles is 2, even if one of them has an allele frequency less than the specified minimum allele frequency.
Run the combining allele option of Mendel 5.0 using Gregor by reading in this control file, writing out the control.in file and selecting the option "Run Mendel". Examine the new pedigree file and locus file and note the changes.
The new pedigree file is formatted (the top line gives the fortran format) and the number of alleles at the GH locus have been reduced.
(5(1X,A8),(T51,3(1X,A8),:))
1 1 1 1-2 20-20
2
1 2 2 1-2 8-8
1
The new locus file decodes the combined alleles.
GH Autosome 7 0
6 0.12378 ORIGINAL ALLELE NUMBERS: 6 9 14
7 0.08870 ORIGINAL ALLELE NUMBERS: 3 4 5 7 12
8 0.10916 ORIGINAL ALLELE NUMBERS: 1 2 8 10 18
11 0.10039 ORIGINAL ALLELE NUMBERS: 11 16
13 0.07115 ORIGINAL ALLELE NUMBERS: 13 15
19 0.13158 ORIGINAL ALLELE NUMBERS: 17 19
20 0.37524 ORIGINAL ALLELE NUMBERS: 20
Note, for example that alleles 1, 2, 10 and 18 have all been combined with allele 8.
(C) Running the TDT
The control file now uses the new pedigree and locus file. The pedigree file is formatted so we no longer have the command pedigree_list_read=true. Instead we use the default, pedigree_list_read=false.
contdt.in:
!input files
LOCUS_FILE=LOCnew.IN
MAP_FILE=MAPoxf.IN
PEDIGREE_FILE=PEDnew.IN
! reading input
AFFECTED=1
AFFECTED_LOCUS_OR_FACTOR=ACE
READ_PEDIGREE_RECORDS = F
allele_separator=-
male =1
female=2
!new output
new_pedigree_file=pednew.in
new_locus_file=locnew.in
! analysis options
ANALYSIS_OPTION=TDT
OUTPUT_FILE=TDT.out
Summary_File=TDTsum.out
samples=100000
In the pedigree file males are designated with 1 and female with 2. Affecteds (ACE less than -0.648) are designated as 1 and unaffecteds as 2. Because the pvalues are estimated by Monte Carlo simulation we need to specify the number of samples. The default is 10,000 but we have increased the number to 100,000.
Run the TDT option of Mendel 5.0 using Gregor by reading in this control file, writing out the control.in file and selecting the option "Run Mendel". There will be two output files, a summary file and a full output file. Examine them both. Note that the pvalue is given as 0.0000. The actual pvalue is not 0.0000. It is reported as such because none of the 100,000 samples gave a statistic that was as extreme or more extreme than the observed statistic. You should report the pvalue as "less than 1x10-5" (< 1/samples) in this case.
(D) Running the gamete competition on a qualitative trait.
We will now analyze the data in pednew.in using the gamete competition. The gamete competition uses data from all the affecteds in the pedigree rather than the just the trios with affected children. It allows for missing data.
The control file, congam.in has the following form:
!input files
LOCUS_FILE=LOCnew.IN
MAP_FILE=MApoxf.IN
PEDIGREE_FILE=PEDnew.IN
! reading input
AFFECTED=1
AFFECTED_LOCUS_OR_FACTOR=ACE
READ_PEDIGREE_RECORDS = F
allele_separator=-
male =1
female=2
!new output
new_pedigree_file=pednew.in
new_locus_file=locnew.in
! analysis options
ANALYSIS_OPTION=gamete_competition
model=2
OUTPUT_FILE=gam.out
Summary_File=gamsum.out
The notable differences between this control file and the one for the TDT are:
(1) no samples specified (asymptotic pvalues only)
(2) There are model options. Models 1 and 2 are for qualitative traits. Models 3 and 4 are for quantitative traits. Models 1 and 3 use the allele frequencies given in the locus file. Models 2 and 4 jointly estimate the allele frequencies.
Run the gamete competition option of Mendel 5.0 using Gregor by reading in this control file, writing out the control.in file and selecting the option "Run Mendel". Again there will be two output files, a summary file and a full output file. Examine them both and compare the results with the results for the TDT.
PART II: Running the gamete competition on a quantitative trait.
(E) The input files.
The control file, Consnp.in contains:
!input files
MAP_FILE = mapsnp.in
PEDIGREE_FILE = Pedsnp.in
variable_file=varace.in
LOCUS_FILE = locsnp.in
! output files
SUMMARY_FILE = Sumsnp.out
OUTPUT_FILE = Mendsnp.out
! instructions to read input
map_list_read=true
MALE = 1
FEMALE = 2
quantitative_trait=ACE
! analysis specific information
analysis_option=Gamete_competition
MODEL = 4
Transform = STANDARDIZE::ACE
Because we are running a quantitative trait and we want to jointly estimate the allele frequencies, the model option is 4. We need to specify a variable_file and the name of the quantitative trait. We asked that the trait be standardized (subtracting off the mean and dividing by the variance) although it isn't necessary in this case because ACE values have already been standardized in the process of adjusting for age and sex differences.
There are some changes in the locus file and the pedigree file the first part of the lab. The SNPs have already been combined for you into a single locus. Two of the 8 haplotypes were estimated to be very rare so they were combined with other haplotypes. The markers are treated as non-codominant so we must specify the relationship of the phenotypes to the genotypes in the locus file.
t469 AUTOSOME 627
ATA 0.40190
ATG 0.00780
ACA 0.06740
ACG 0.18310
TEA 0.01340 TEA is TTA+TCA
TEG 0.32640 TEG is TTG+TCG
Note that because 122 denotes A/A T/C A/G, a double heterozygote, we need to specify that there are two haplotype configurations that are consistent with the multilocus genotype.
111 1
ATA/ATA
.
.
122 2
ATA/ACG
ATG/ACA
We will also run the SNPs as single loci. These have also been coded with a single number designation so we need to "decode" them in the locus file.
Snp4 AUTOSOME 2 3
A 0.80000
T 0.20000
1 1
A/A
2 1
A/T
3 1
T/T
Finally, I have used the Fortran format to read in single loci snp4, snp6 and snp9 as well as the multilocus SNP genotype for SNPs 4,6, and 9 combined.
(3X,I5,A8)
(16X,3A8,7X,2A1,T69,5X,3A1,T69,2(2X,2A8))
10 1
1 1 -0.395
2 2 112 112 122 122 -1.788
This is an "old style" MENDEL pedigree file. The first fortran format statement reads in number of individuals in the pedigree and the family id number. The second fortran format statement reads the information for each individual. There are data for 4 multilocus snp combinations and we want only to use the last one. We could set this up through the map file, but here I have just skipped over all the data I didn't want to include using T69 (tab to the 69th column). I first read in the data for the individual SNPs then I return to the same column position and read in the data as a multilocus SNP.
(F) Run Mendel 5.0 using the Gregor interface. Load in Consnp.in, write a new control.in file and run.
The output :
There is a summary file that should look like:
MARKER P-VALUE MAX OMEGA FREQ ALLELE MIN OMEGA FREQ ALLELE
NAME NAME NAME
Snp4 0.00000 1.07786 0.33534 T 0.00000 0.66466 A
Snp6 0.00000 0.00000 0.57608 C -1.21367 0.42392 T
Snp9 0.00000 1.40464 0.51465 G 0.00000 0.48535 A
t469 0.00000 1.52939 0.32207 TEG 0.00000 0.40515 ATA
And a more complete output file with the actual test statistics, all parameter estimates and their standard errors. The statistics are:
THE LIKELIHOOD RATIO TEST STATISTIC IS 0.4917E+02 AT LOCUS Snp4.
THE LIKELIHOOD RATIO TEST STATISTIC IS 0.6006E+02 AT LOCUS Snp6.
THE LIKELIHOOD RATIO TEST STATISTIC IS 0.7655E+02 AT LOCUS Snp9.
THE LIKELIHOOD RATIO TEST STATISTIC IS 0.8129E+02 AT LOCUS t469.
(G) NO Homework - Please start working on your project data. In next week's laboratory I will reserve time at the end for you to get help running Mendel with your project data if you are having problems.
3