SOP How I calculate the EHH significance

This SOP describes, how I calculate, whether the EHH of a given haplotype differs from the EHH of other haplotypes significantly.

In Sweep

I load the data of the chromosomal region against which I want to test. The resulting significance level depends on the used region. I commonly use a region of 2.5million base pairs each up- and downstream surrounding my region of interest (total of 5Mbp).

I alter the default setting for core selection:
- I select “Set Cores” at the lower left corner of the Sweep window.
- I deselect “No more than -20- SNPs”.
- I close the “Set Cores” window.

I leave the further default settings unaltered (e.g. on distance 300Kb).

I select the pull-down menu “File”, then “Export data” and then “EHH vs. Frequency data”.

I select a place to save my file and a filename, e.g. “downloadname_EHH_vs_frequency”. I do not give the file an extension.

I select the pull-down menu “Tools” and then “EHH significance calculator”. A window pops up.

I leave the “Number of Frequency Bins: -20-“ unaltered.

In the window’s lower left I press the “+”-button. A window pops up.

I open the file I prepared.

In the appearing list, I select the file I prepared.

I press “Calculate” at the lower right of the window. A window pops up.

I select a place to save the new file and a filename, e.g. “downloadname_EHH_significance”.

I am ready and can close calculator and Sweep.

In Excel

I open the last file in Excel (rightclick, open / open with, then select Excel). A window pops up.

To correctly import the data,
- I select “Weiter” (equals “further”) on the first screen,
- I select “Weiter” (equals “further”) on the second screen,
- I press “Weitere…” on the third screen and then
I select “.” as “Dezimaltrennzeichen”
I select “ “ (Spacebar) as 1000er-Trennzeichen
I deselect “Nachstehendes Minuszeichen für negative Zahlen”
…I then press OK.
- I then select “Fertigstellen” (equals “Finish”)

I save the file as xls-file using the same file name as above.

SNPs of interest

I can find my SNP of interest using Excel’s search-function (ctrl+f).

I have to note that the “Genes in Region” are supposedly falsely attributed as Sweep recurs on older genome built-versions.

Region of the gene of interest

-> compare the columns 40 to 43 in the sample file on SLC12A3

To find the region of my gene of interest, I correct the positions of the linkage blocks:
- I access my gene of interest in NCBI dbSNP (
enter the gene,
select the tab “Human: ____”,
select any SNP,
under “Gene view” select “in gene region” and press “Go”.
- I look for any SNP with the H-Symbol (H for HapMap) in the “Validation” column.
- I look for this SNP in the EHH_significance.xls-file.
- When I have found one SNP from dbSNP in the xls-file, I look for SNPs at the start or end of its linkage blocks (columns 8 and 9 in the xls-file) and whether those are given in dbSNP. I proceed with one of these SNPs.
- I copy the chromosomal position of one of these SNP from the second dbSNP-column to the EHH_significance.xls-file.
- Using the chromosomal position, I calculate corrected positions for the start and end bases of the linkage blocks.

I search for my gene of interest in NCBI dbGene ( I find the chromosomal position under “Genomic conext” at the end of the “Sequence”-line.

Back to the EHH_significance.xls-file, I can identify, which linkage blocks refer to my gene of interest.

p-values

I find the p-values in the last four columns.
I note that the logarithms of the p-values are positive numbers, despite from calculation they need to be negative.

In the SLC12A3 sample, the third most frequent haplotype in the linkage block from SNPs 260-267is significantly selected. Adjustment for multiple testing can be debated, is however not suitable for an exploratory study. The results have to be interpreted with the respective caution.

Comparing the haplotype sequences in column 17, one can see that the variant alleles of the second, third and forth SNP occur exclusively in this haplotype.

From column 5, I can see that these are the variant alleles of rs8063870, rs8054731 and rs1529930.

Comparing EHH or REHH (columns 19 and 20) among the five haplotypes in this linkage block I see that the EHH of haplotypes three was greater than that of these others. This infers that the selection of haplotype three was positive selection (not negative).

It has to remain unclear, which of these variant alleles, or whether a coupled allele caused selection.