S1. SUPPLEMENTARY NOTES

ARRAY DESIGN DETAILS

First Phase Design

  1. Source Sequence. Retrieved human chromosome sequences from UCSC (Hinrichs, Karolchik et al. 2006)and used RepeatMasker (Smit, Hubley et al. 1999-2004) to identify repeat elements
  2. Import the target region list. Ensured that the chromosome name and coordinates specified for every region were valid. Also tested that there were no overlaps between regions in the input list. Overlaps and duplicates in the initial lists were resolved and resulted in a final list of 11,000 target chromosomal regions, mostly corresponding to exons.
  3. Add flanking sequence to uniform length of 1200 bp in size. Flanking regions were added to all regions less than 600 bp so that their final size will be 1200 bpto accommodate >8 50-60mer probes per region. Regions that are already 1200 bp or greater were left untouched. The resulting region is always centered on the original target (i.e. the same amount of flank is added to each side)
  4. Probe extraction. Extracted probe sequences at 5 bp intervals from within each target region. Checked for ambiguous bases and counted the number of RepeatMasked bases. At each 5 bp interval, the length of the extracted probe was varied to be 54bp +/- 10bp to achieve a target Tm of 76ºC. If the resulting Tm was still more than 5 ºC from the target Tm, the probe was discarded. Also tested for simple low complexity elements (homopolymers, dipolymers, etc.) at this stage and discarded probes if they were encountered. Finally, if more than 25% of the bases of a probe were repeat-masked, the probe was discarded. Of ~9.4 million possible probes at total of ~3.6 million probes passed these simple tests and were stored as the complete ‘unfiltered’ probe set. The most common reason for failure of a probe at this early stage was the presence of a repeat element. Probe Tm was calculated according to the ‘Nearest Neighbour’ approach described by (Breslauer, Frank et al. 1986) using the thermodynamic measurements of (Sugimoto, Nakano et al. 1996).
  5. Random sequences generation. Random probes were generated to be used as negative controls and to estimate background hybridization. They were selected to uniformly cover the range of probe Tm and length represented by the actual region probes extracted above. Initially, ~1.9 million random probes were generated. These probes were subjected to the same quality tests as the experimental probes but were tested to ensure that they have minimal homology to the human genome.
  6. Probe folding. All probes were folded by MultiRNAFold ( to identify sequences which form hairpins or duplexes (Andronescu, Aguirre-Hernandez et al. 2003; Andronescu, Fejes et al. 2004).
  7. Low complexity testing. The ‘mdust’ algorithm (Hancock and Armstrong 1994) was used to identify low complexity elements which were not previously identified by searching for homopolymers, dipolymers, etc.
  8. Specificity testing. Each probe was mapped to the complete human genome sequence using BLAST 2.2.15 (with a word size of 20). The probe needed to be successfully mapped back to its source location and the number of hits to other regions were noted.
  9. Cycle calculation. The number of cycles required by NimbleGen to synthesize each probe was calculated by a modification of their cycle calculator tool (written in C). I modified this tool to accept a more convenient file format as input. The algorithm was not altered.
  10. Summary statistics. At this point various figures and statistics were generated to represent the distribution of values resulting from the tests conducted in steps 5-8 for all probes. These statistics were used to chose reasonable cutoffs for filtering probes in the following step.
  11. Filtering region probes. Probes which did not meet particular cutoffs for the values calculated in steps 5-9 were removed from consideration for the final array. Specifically a probe was required to have: (a) a probe length of 54bp +/- 10bp (b) a Tm of 76ºC +/- 4.5 ºC, (c) less than 10% of its length representing repeat-masked sequence, (d) a free-energy of hairpin folding greater than -10.0 kcal/mol, (e) a free-energy of dimerization greater than -26.0 kcal/mol, (f) low complexity bases occupying 10% or less of the probe length, (g) no non-specific blast hits of 80% of the probe length or greater, (h) no more than 1 non-specific blast hits of 75% of the probe length or greater, (i) no more than 4 non-specific blast hits of 50% of the probe length or greater and (j) no more than 178 cycles required for synthesis according to NimbleGen’s cycle calculator. After filtering for these 10 criteria, a pool of 2.8 million region probes remained for potential inclusion on the final array design.
  12. Filtering random control probes. Filtering of random sequence probes was done exactly as for region probes except that no blast hits of any length to the human genome were allowed. After filtering, a pool of 1.5 million random probes remained for potential inclusion on the final array design.
  13. Probe selection. The probe selection process involved cycling through all 11,000 target regions and selecting probes for each region until 99% of 385,000 probes were identified. At every cycle for each region, the ‘best’ probe was determined by considering its Tm and length as well as is distance from probes already selected within the region. The selection algorithm attempted to maximize the distance between probes, promote even coverage of each region, and minimize probe overlap. The remaining 1% of the array was filled by selecting random control probes. These were selected to uniformly represent the range of Tm and probe length of all region probes selected.

Design Outcome

  • The 11,000 regions targeted by the design represent ~1.5% of the human genome and the majority of these regions encompass 1 or more exons.
  • At least 1 probe was selected for 10,675 (97%) of the total 11,000 regions. 96% of regions have 6 or more probes. Most regions are ~1200 bp in size and typically have 25-35 probes. Larger regions may have as many as 62 probes.
  • 375 have 0 probes but in general other exons in the corresponding genes were more successful. Furthermore 50 of these regions with 0 probes were Y chromosome controls and only 225 were exons of primary gene targets.
  • The final microarray design consists of 385,000 probes, of which, 99% correspond to the 10,625 successful targeted regions. The remaining 1% of the array consists of random sequence probes.
  • This 385k design will be used to profile a series of test samples and will ultimately be used to generate a 72k design compatible with NimbleGen’s ‘4-plex’ format. The 5-6 probes from each regions which are determined to have the best performance during the test phase will be selected for the 72k design.

References

Andronescu, M., R. Aguirre-Hernandez, et al. (2003). "RNAsoft: A suite of RNA secondary structure prediction and design software tools." Nucleic Acids Res31(13): 3416-22.

Andronescu, M., A. P. Fejes, et al. (2004). "A new algorithm for RNA secondary structure design." J Mol Biol336(3): 607-24.

Breslauer, K. J., R. Frank, et al. (1986). "Predicting DNA duplex stability from the base sequence." Proc Natl Acad Sci U S A83(11): 3746-50.

Hancock, J. M. and J. S. Armstrong (1994). "SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences." Comput Appl Biosci10(1): 67-70.

Hinrichs, A. S., D. Karolchik, et al. (2006). "The UCSC Genome Browser Database: update 2006." Nucleic Acids Res34(Database issue): D590-8.

Smit, A. F. A., R. Hubley, et al. (1999-2004). RepeatMasker Open-3.0.

Sugimoto, N., S. Nakano, et al. (1996). "Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes." Nucleic Acids Res24(22): 4501-5.

Testing Of First Phase Design

- identified 10 individuals with known CNVs with probes on designed array

- hybridized patients to arrays according to manufacturer’s specifications (see below for protocol)

- ensures probes within known CNVs identified the CNV if not the probe was eliminated

- the remaining probes outside of the known CNV were used to assess probe performance

- The mean log2 ratio for each probe (taken from all 10 individual hybs) and the standard deviation (SD) of the log2 ratio were used as selection criteria. The following are criteria I used to select probes to eliminate:

a)Probes with a mean log2 ratio >0.25 and any SD

b)Probes with a mean log2 ratio >0.2 and a SD >0.15

c)Probes with a mean log2 ratio>0.1 and a SD >0.25

Second Phase Design

- use a script to select the best performing probe (log2 ratio closest to 0 with low SD) for each exon within each target gene until all exons were covered. A second round selected the next best probe for each exon within each gene and the cycle continued until all 135,000 probes for the final 12-plex 135K array were filled.

- manual scan of all 135,000 probes that were selected for 2nd phase design for genes that did not have any probes and exons/regions that had fewer than 5 probes.

NIMBLEGENHYB PROTOCOL AND NEXUS ANALYSIS PROTOCOL

500 ng of DNA was labeled according to the manufacturer’s specifications (Roche NimbleGen CHG Analysis User’s Guide V5.1 Mar 16, 2009). Arrays were hybridized in the NimbleGen Hybridization 12-plex System and washed manually. The arrays were scanned with Molecular Devices GenePix 4000B at 5µm. Sample Tracking Control was used in all 12-plex reactions. Data was extracted using NimbleScan v2.5. Only those arrays passing all QC measures outlined in NimbleScan v2.5 Software User Guide were analysed for CNVs. SegMNT files were generated using the following settings: Min segment difference = 0.1, Min Segment length = 5, Acceptance percentiles = 0.999, Averaging window – 1X, Including non-uniques probes, spatial correction, normalized.SegMNT files were imported in Nexus v4.0. Gender of each sample was included do allow for probe correction on the sex chromosomes. CNVs were identified with the following settings: log2 ratio >0.2 (duplications) or <0.2 (deletions) with a minimum of 5 probes affected, maximum probe distance was <10 Kb (>10Kn these were treated as separate CNVs). All CNVs identified each hybridization were visually assessed for confidence and to determine if it was present in the other parental hybridization but not called by the software. De novo CNVs were identified when set of probes identified by the platform algorithm was called as a deletion in the child relative to both parents or as a duplication in the child relative to both parents on the same platform. Males who inherited an X-chromosome CNV from the mother were also identified by the software because of the correction.

QUANTITATIVE PCR VALIDATION

Primer Design

1.Selecting Target Sequence from UCSC browser. Go to UCSC genome browser, human genome, select appropriate build. Go to target sequence. I.e., if a gene, enter gene name in search box and jump to gene.If needed, zoom in to exon of interest. Obtain DNA sequence of selected region on browser. On ‘Get DNA’ page, make sure ‘mask repeats’ is checked and get DNA sequence with repeats visualized. Examine sequence for repeats. Make a selection of sequence that does not have repeats. Sequence length can be from one exon (about 200bp) or 1000bp.Copy and paste this sequence in to a .txt file and save.

2.Designing primers using PrimerExpress. Go to PrimerExpress. Open PrimerExpress. Open DNA PCR document and import relevant .txt file with DNA sequence. Select design parameters of min length of 100 bp and max length of 150bp and run program. Obtain list of primer pairs.

3.Ensuring primers are specific. Use in-silico PCR ( Make sure the appropriate genome version is selected and then load your forward and reverse primer sequence and submit. Output generated should be a single amplicon of the same length as denoted in PrimerExpress, with 100% sequence match. Click on the browser location link to look at where the primer positions on the genome and make sure the primer maps to the exact exon inputted to design software. Discard primers that return more than one unique hit or do not generate a 100% match hit

4.Screening Primers for secondary structure. Use Beacon Design ( Select ‘SYBR green’ qPCR oligo analysis page. Paste in forward and reverser primer sequences and leave all other parameters as default and analyze. Check results for; primer cross binding, primer self-self binding and primer hairpins. Chose primers without secondary structure noted, however in the event secondary structure is found, choose primers of delta G (Gibbs free energy) score corresponding to each reaction greater than -3.0 indicating a low likelihood of this occurring.Only primers that are unique (in-silico PCR) and pass secondary structure test (Beacon Design) are used for qPCR.

Quantitative PCR protocol

Primers were designed within ID candidate genes within the identified CNVs, avoiding benign polymorphisms listed in the DGV (version- variation.hg18.v10.nov.2010) as described above. Primers were designed using Primer Express (Applied Biosystems) and purchased from Integrated DNA Technologies ( in lab ready format. The patient's DNA was diluted in PCR-grade water, and the quality and concentration was assessed using spectrophotometer (Nanodrop, Thermo Scientific). Primers were optimized for qPCR by standard PCR amplification (50ng/μl sample DNA concentration) on a positive and negative control. PCR product was visualized on a 2% agarose gel stained with ethidium bromide. The presence of only a single band of the expected size in the control DNA, the absence of primer-dimers, and the absence of any amplification on the blank was considered indicative of a primer set that could be used for qPCR.

CNVs were validated by qPCR (ΔΔCt method) using SYBR Green (Applied Biosystems). Testing was performed in triplicate on child, mother, father and pooled Promega reference sample, on an endogenous control gene (H6PD) and target gene. Promega pooled male and female sample (catologue#: G3041) was used for autosomal CNVs and Promega pooled female-only sample (catalogue#: G1521) was used for X linked CNVs. Sample DNAs were diluted to 30ng/µl and concentration and quality was re-assessed using spectrophotometer (Nanodrop,Thermo Scientific). 30ng of sample DNA was combined in a 10µl reaction mixture with 5nM forward and reverse primer and SYBR green master mix solution (Qaunta Biosciences). qPCR thermal cycle parameters were programmed for each test based on the preceding standard PCR amplification protocol optimization (see above). Testing was performed on an ABI7500 fast DNA sequencerand melt curve analysiswas also conducted as an additional QC metric (amplifications where the melt curve did not show the expected single peak corresponding to the melting temperature of the amplicon were discarded). Results were visualized using Applied Biosystems software (7500fastSDS software, Applied Biosystems) as outlined in the software user guide. PCR amplification curves were examined visually for amplification efficiency and the software was programed to only use runs with 100% PCR amplification efficiency for result generation. A heterozygous deletion was confirmed with the following settings; RQ=0.5, range 0.3<RQ<0.7, a normal two copy state was considered when RQ=1, range 0.8<RQ<1.3, and a heterozygous duplication was considered when RQ=1.5, range 1.3<RQ<2. Only those results that were replicated by all triplicates were considered true positives.

1