SNP in James Watson’s genes were obtained from dbSNP using the limits option:

Using the option:

Individual SNP: Watson

SNP Class: SNP

Query returned 3,303,479 SNP.

The data was downloaded in “Flatfile” format and parsed for SNP that only had a unique mapping to the human reference assembly GRCh37-Hg19 using the script parseWatsonFlatfile.pl.

This produced 1,982,981SNP. Many SNP had multiple annotated “reference alleles, which I do not understand. Mappings to other assemblies were also in the file but the “reference” assembly for the reference allele is not specified in the annotation. References were associated with different LocusIDs but I am surprised that there are multiple loci at one genomic position. I have written to dbSNP for clarification on this but they are not usually very responsive.

SNP were randomly on forward or reverse strands. parseWatsonFlatfile.pl complemented SNP in negative orientation and indicated this by setting strand to “plus (complemented)”. As it was not clear which allele was the reference allele at each position I obtained the reference allele from the GRCh37 assembly for each position using testGetChrDynamicOnWatson.pl. This checked that one of the alternate alleles for the SNP equaled the reference. At 1,816(<0.01%) loci neither allele matched the reference. These may be triallelic SNP but were discarded. The high percentage of matches indicates that the mapping was probably accurate. The reference allele corresponded to the first or second allele given in about half the cases each so the ordering of the alleles in dbSNP appears to be random. testGetChrDynamicOnWatson.pl sorted the two alleles so that the reference allele appeared first in the list. An example entry in Flatfile format is shown below with multiple different reference sequences. Positions were mapped back to Hg18 using liftOver, 230 of 1,980,976 lines did not map from Hg19 to Hg18. The output files from liftOver were converted from BED format to workflow input format using convertBedToWorkflowIn.pl. All lines had a double underscore in the metadata this was removed with “sed -i 's/__/_/' filename”, trailing underscores were removed with “sed –i ‘/_$//’ filename”. Files of data have no headers they should be:

Chromsome / position / reference / genotype / dbSNP id / Concatenated
metadata

Concatenated metadata is “chromosome, position, reference, genotype, dbSNP id, heterozygote ratio, strand, db SNP annotation”.

testGetChrDynamicOnWatson.pl was also used to test the relative speeds of getting chromosome sequences dynamically or unzipping them from local gz files. It took about forty minutes to run getting data directly from Ensembl and four minutes using local files. A second run getting data from Ensembl was much faster so it is obviously dependent on network speeds.

rs2034920 | Homo sapiens | 9606 | snp | genotype=YES | submitterlink=YES | updated 2010-03-10 14:21

ss2944057 | TSC-CSHL | TSC1048850 | orient=+ | ss_pick=YES

ss3428840 | SC_JCM | AC021135.4_32988 | orient=+ | ss_pick=NO

ss8529251 | SC_SNP | NT_011786.13_10516297 | orient=- | ss_pick=NO

ss24730596 | PERLEGEN | afd4202396 | orient=- | ss_pick=NO

ss24786399 | SEQUENOM | sqnm63986 | orient=+ | ss_pick=NO

ss43661523 | ABI | hCV26955825 | orient=- | ss_pick=NO

ss77807861 | HGSV | Cor12156_SNV_20070510.chrX_134673554 | orient=- | ss_pick=NO

ss86171279 | HGSV | Cor18517_SNV_20070510.chrX_134673554 | orient=- | ss_pick=NO

ss94408327 | BCMHGSC_JDW | JWB-2707043 | orient=- | ss_pick=NO

ss105766884 | BGI | BGI_rs2034920 | orient=- | ss_pick=NO

ss113028958 | 1000GENOMES | CEU.trio.12.15.2008_3916722_chrX_134775700 | orient=- | ss_pick=NO

ss115656873 | ILLUMINA-UK | NA18507_000074277_NCBI36.1_chrX_134775700 | orient=- | ss_pick=NO

ss142771783 | ENSEMBL | ENSSNP9152744 | orient=- | ss_pick=NO

ss157738254 | GMI | GMI_SNP_260052379 | orient=- | ss_pick=NO

ss159351186 | ILLUMINA | Human660W-Quad_v1_A_rs2034920-128_T_R_1549860631 | orient=- | ss_pick=NO

ss159745778 | SEATTLESEQ | RP13-36C9.6-134775700 | orient=- | ss_pick=NO

SNP | alleles=C/T | het=0.25 | se(het)=0.2500

VAL | validated=YES | min_prob=? | max_prob=? | notwithdrawn

CTG | assembly=Celera | chr=X | chr-pos=135324510 | NW_927722.1 | ctg-start=726 | ctg-end=726 | loctype=2 | orient=-

CTG | assembly=GRCh37 | chr=X | chr-pos=134948034 | NT_011786.16 | ctg-start=19215744 | ctg-end=19215744 | loctype=2 | orient=-

CTG | assembly=HuRef | chr=X | chr-pos=124239596 | NW_001842404.1 | ctg-start=1451 | ctg-end=1451 | loctype=2 | orient=-

LOC | LOC100133581 | locus_id=100133581 | fxn-class=coding-synonymous | allele=C | frame=3 | residue=N | aa_position=96

LOC | LOC100133581 | locus_id=100133581 | fxn-class=reference | allele=T | frame=3 | residue=N | aa_position=96

LOC | CT45A5 | locus_id=441521 | fxn-class=coding-synonymous | allele=C | frame=3 | residue=N | aa_position=96

LOC | CT45A5 | locus_id=441521 | fxn-class=reference | allele=T | frame=3 | residue=N | aa_position=96

LOC | CT45A6 | locus_id=541465 | fxn-class=reference | allele=C | frame=3 | residue=N | aa_position=97

LOC | CT45A6 | locus_id=541465 | fxn-class=coding-synonymous | allele=T | frame=3 | residue=N | aa_position=97

SEQ | 61657911 | source-db=blastmb | seq-pos=536 | orient=+

SEQ | 239752428 | source-db=blastmb | seq-pos=370 | orient=+

SEQ | 239757917 | source-db=blastmb | seq-pos=294 | orient=+