A Quality Control Algorithm for Filtering SNPs in GWAS

Monnat Pongpanich ()

December 7, 2009

List of acronyms

MAF: minor allele frequency

MSP: missing proportion

HWE: Hardy Weinberg Equilibrium

cs: affected individuals

cn: unaffected individuals

all: combined affected and unaffected individuals

Background

This program filters SNPs for quality control purpose based on SNP’s QC variables value. It requires one input dataset and will give a list of filtered SNP as the final output.

Input: The input dataset has N rows by 8 columns. Each row represents SNP and the 8 columns are 1) SNP rsid 2) MSPcs 3) MSPcn 4) MSPall 5) MAFcs 6) MAFcn 7) MAFall 8) HWE P value respectively. The header for the input dataset is required and the column must be in the order mentioned above. Each column is separated by space. An example of input file is given in “input” folder.

Output: The output has one column: filtered SNP rsid.

Running

This program can be run in two modes: step by step and batch mode.

To run the program type “./main Rbindir Rsourcedir Cbindir

Rbindir is the directory of R bin folder

Rsourcedir is the directory of Rsource folder of this program

Cbindir is the directory of Cbinary folder of this program

For example, at the program installed path,

Type “A” if you want to run in step by step mode. Type “B” if you want to run in batch mode.

A. Step by step mode has 11 steps

Step 1: Type “1” to run this step. This step calculates log(MAF pool), MSPcs/MAFcs, and MSPcn/MAFcn. Zero MAF will be replaced with minimum of MAFall besides 0.

Input parameters:

1) Full path of input file

2) Directory that will keep the output files from this step

Output:

1) “preinitial.txt” file has 3 extra calculated columns from the input file.

Step2: Type “2” to run this step. This step will pre-clean SNPs that has HWE P value less than a specified cutoff value.

Input parameters:

1) Full path of input file

2) Directory that keeps the output files from step 1

3) Directory that will keep the output files from this step

4) Cutoff for HWE P value

Output:

1) “freqall.txt” has one columns: MAFall which zero value has not been replace.

2) “freqnozero.txt” has three columns: MAFcs, MAFcn, MAFall which zero value has been replaced.

3) “initial.txt” has eight columns: id, rsid, MSPcs, MSPcn, MSPall, log(freqall), MSPcs/MAFcs, MSPcn/MAFcn

Step 3: Type “3” to run this step. This step will apply PCA on 6 variables: MSPcs, MSPcn, MSPall, log(MAFall), MSPcs/MAFcs, MSPcn/MAFcn

Input parameter:

1) Directory that keeps the output files from step 2

2) Directory that will keep the output files from this step

Output:

1) “pca.txt” has three columns: id, pc1, pc2

2) “importance.txt” gives the standard deviations of the principal components

3) “rotation.txt” gives the matrix of variable loadings

Step 4: Type “4” to run this step. This step calculates k-th nearest neighbour distance for each SNP.

Input parameters:

1) Directory that keeps the output files from step 3

2) Directory that will keep the output files from this step

3) Minimum number of points in the neighbourhood (MinPts; four is recommended)

Output:

1) “fourdist.txt” has one column: k-th nearest neighbour distance

Step 5: Type “5” to run this step. This step generates k-nearest neighbour graph used for initial guess of changing point step.

Input parameter:

There are two r values that can be calculated: “suggested r” and “upper r”. Suggested r is the value we recommend to use in DBSCAN and upper r is the upper limit of r. User can fine-tune r value but we recommend not to go beyond upper r value.

1) Directory that keeps the output files from step 4 for finding suggested r or directory that keeps the output file from step 8 for finding upper r

2) Directory that will keep the output files from this step

3) Enter “1” if you want to calculate suggested r in step 6

Enter “2” if you want to calculate upper r in step 6

Output:

1) k-th nearest neighbour graph (“fourdist.tif” for suggested r and “fourdistupper.tif” for upper r).

Step 6: Type “6” to run this step. This step will calculate r and change point.

Input parameters:

1) Directory that keeps the output files from step 4 for finding suggested r or directory that keeps the output file from step 8 for finding upper r

2) Directory that will keep the output files from this step

3) In some cases, there are two curves in the graph. We only want the change point in the second curve; therefore, we exclude points around the first curve region. For example, in the figure below, the starting point is 270,000

4) In some cases, there are outlier points at the tip. We exclude them so that it will not influence the change point answer. For example, in the following figure, the end SNP is 472,000 (with the staring SNP is 270,000)

The following parameters are initial guess

5) Initial guess for y-intercept

6) Slope of the first line

7) Slope of the second line

8) Change point

9) sd

10) Enter “1” if you want to calculate suggested r

Enter “2” if you want to calculate upper r

Output:

1) “epssuggested.txt” (for suggested eps) or “epsupperbound.txt” (for upper r) file that contain value of y-intercept (alpha), slope of line 1 (beta1), slope of line 2 (beta2), change point (cp), sd (sd), r value (eps), the value of function to be minimized (objective), an integer code: 0 indicates successful convergence (convergence), number of iteration performed (iterations).

User must make sure that it has successful convergence (0) before continuing the next step. If the program does not converge, change the initial guess.

Step 7: Type “7” to run this step. This step run DBSCAN

Input arguments:

1) Directory that keeps the output files from step 3

2) Directory that will keep the output files from this step

3) r value (answer from step 6)

4) Minimum number of points in the neighbourhood (four is recommended)

Output:

1) “dbscan.txt” has two columns: snpid and cluster id

Step 8: Type “8” to run this step. This step will prepare the file needed for calculating upper r. Then to find upper r, user has to re-run step 5 and 6 but enter 2 for the last parameter.

Note: If user does not want to find upper r. This step can be omitted.

Input parameters:

1) Directory that keeps the output files from step 6

2) Directory that keeps the output files from step 4

3) Directory that will keep the output files from this step

Output:

1) “fourdistupper.txt” has one column: k-th nearest neighbour distance of point that is in the right of change point in step 6

Step 9: Type “9” to run this step. This step will prepare the file needed for monotonicity step.

Input parameters:

1) Directory that keeps the output files from step 7

2) Directory that keeps the output files from step 2

3) Directory that will keep the output files from this step

4) Number of people for calculating epsilon in cases. Epsilon is calculated from #of people/# of cases. For example, if user enter 1 and there are 1944 cases, the epsilon for cases is 1/1944

5) Number of people for calculating epsilon in controls. Epsilon is calculated from #of people/# of controls. For example, if user enter 1 and there are 3004 controls, the epsilon for control is 1/3004

6) Number of people for calculating epsilon in pool. Epsilon is calculated from #of people/( # of cases + # of controls). For example, if user enter 1 and there are 1944 cases, 3004 controls, the epsilon for pool is 1/(1944+3004)

We recommend enter 1 for all epsilon

7) Number of cases in the study

8) Number of controls in the study

Output:

1) “largestcluster.txt” has 10 columns: id, rsid, MSPcs, MSPcn, MSPall, MAFcs, MAFcn, MAFall, MSPcs/MAFcs, MSPcn/MAFcn. It contains only SNPs that are in the largest cluster from DBSCAN

2) “othercluster.txt” has 10 columns: same as largestcluster.txt but contains SNPs which are not kept in largestcluster.txt

Step 10: Type “10” to run this step. This step finds which SNP should also be kept for monotonicity.

Input parameters:

1) Directory that keeps the output files from step 9

2) Directory that will keep the output files from this step

Output:

1) “bringback.txt” has one column: id of snp that will also be kept

Step 11: Type “11” to run this step. This step finalize the answer: filtered SNPs

Input parameters:

1) Directory that keeps the output files from step 7

2) Directory that keeps the output files from step 10

3) Directory that will keep the output files from this step

Output:

1) “filteredSNPs.txt” has one column: rsid of kept SNPs

Type “0” to go back to the main menu

B. Batch mode

Batch mode will ask for full path of input file, the directory that will keep all files generated from this program. It will then stop at a certain point to ask for required parameters needed at that moment, e.g., HWE P value, number of neighbour (MinPts). The explanation for these parameters is in step by step section. It will automatically exit the program when it is done.

Note: please make sure before entering r value that the program converges.