5.9 Final Protocol for Mitochondrial and Nuclear Capture Enrichment

JRA2-Report October 2013

Deliverable

5.9 Final protocol for mitochondrial and nuclear capture enrichment

The following modifications and strategies for capture enrichment are based on the protocol

presented in Deliverable 5.7

Library pooling and Double capture - Providing better conditions for hybridization reaction

Through comparison of various sequencing results of samples in different states of preservation we determined that preparing one library following one capture and a single sequencing reaction will not lead to sufficient data for further population genetic analysis.

At various steps of the NGS workflow enzymatic reactions are applied to make samples ready for Illumina sequencing. Because each individual enzymatic reaction is not 100 % efficient there is a random loss of DNA molecules that is unique for every single library.

Combinations of different libraries circumvent these stochastic processes. The complexity of DNA molecules increase through the combination of various DNA libraries out of the same extract.

We modified the pipeline to increase the amount of unique information of the sample in the most efficient and least expensive way. Through pooling aliquots of multiple libraries of one sample in one enrichment and sequencing run the amount of unique information gained from one sequencing run could be increased by a factor that corresponds to the number of libraries.

If the capture was made out of independently prepared libraries we achieve 100 % more unique sequencing reads. Every unique reads means an added value of information.

If the capture was made out of exactly the same library the gain in complexity is only at 4,6 %.

On the other hand, we can not observe that a second sequencing of an independent capture product out of the same library provide more unique sequencing reads, in this example with two different samples (Fig.1).

The quality of a capture experiment is defined as the proportion of DNA library fragments obtained at the end of the enrichment process that could be exclusively assigned to the target region – so called on-target ratio.

Due to the relatively small size of our desired target region (175 kb in size) we recommend to capture one sample twice in succession to increase the on-target ratio for one sample.

This approach provides a higher amount of molecules to saturate the baits during the second hybridization reaction.

The enriched library output of the first capture is used as input into the second capture. Both hybridization steps are incubated overnight. The increased on-target rates resulted furthermore in deeper coverage depths.

Determination of capture efficiency with quantitative PCR

A quantitative real-time PCR provides a fast, sensitive and low-cost method to estimate the efficiency of an enriched DNA library before sequencing.

For this we select one specific locus

i) that is included in our developed nuclear array

ii) the primer system is well established for application with ancient DNA

HERC2 (rs12913832; amplicon length 76 bp) is a single nucleotide polymorphism nearby the OCA2 gene that may be functionally linked to brown or blue eye color, due to a lowering of promoter activity of the OCA2 gene.

We have investigated the coverage for this particular locus in four selected samples through quantitative PCR and depth of sequencing reads.

This example illustrate that the fold enrichment could be calculated for this control loci.

We can demonstrate that there is a correlation between qPCR measurements and sequencing results in that specific nuclear region.

This screening method is a good way to find out if the enrichment process was successful and avoid insufficient sequencing.

At the moment, we screen our nuclear sequencing data to find second and third control loci that are suitable for using as an enrichment quality check through qPCR.

Detecting mutations

With the ability to find and remove contaminating sequences and the improved per base coverage due to library pooling it is now possible to correctly call mutations in diploid organisms. The GATK pipeline is used in combination with hard filtering. So far 15 called SNPs in two samples are verified with Sanger sequencing data from previous studies. The SNP calls followed the same parameters as for the variant detection in mitochondrial DNA with the additions that there is now a haploid organism which allows heterozygosity and a PHRED scaled quality for the possible genotype is generated, that may not be below 50 (error probability of 0.001).