CNVS calling and visualization from NGS data

CNVs calling Modules

Table of Contents

INTRODUCTION

INTRODUCTION REFERENCES

GLOSSARY and MODULE OVERVIEW

MODULE DESCRIPTION

CNV CALLING: three different path

1.ERDS/SVA path (DOC) (see STEP5 “Sequence Variant Analyzer v1.0, for hg18 annotations” pipeline, for “Sequence Variant Analyzer v1.1, for has not been released yet a compatible ERDS version)

2.BOWTIE/CNVer/SAVANT path (DOC+PEM)

2.1BOWTIE alignment

2.1.1Build bowtie index for the reference genome

2.1.2Bowtie alignment with SAM production

2.1.3SAM2BAM conversion

2.2CNVer

2.2.1CNVer call

2.2.2Sort .bam

2.2.3Indexing the .bam file

2.2.4Visualization

3.CNVseq path (DOC)

3.1Hits file production

3.2CNV call with CNVseq

INTRODUCTION

The field of computational methods for discovering structural variation on NGS data is a still open computational and bioinformatics challenge. Many tools are being currently available (as described at The framework of all structural variant discovery methods is to detect anomalous “signature” or patterns, and then call the underlying variant.There are four main strategies able to map sequence reads to the reference genome and to identify discordant signatures /patterns diagnostic for different classes of structural variants. None of them is comprehensive, with each method having peculiar strengths and weakness in detection, depending on the variant type or the properties of the underlying sequence at the variant locus. The four methods are:

(1)Read pair technologies (aka Paired End Mapping strategies): these software rely on the information that reside in the (1) order, (2) distance and (3) orientation of the reads to detect structural variants (deletion, insertion, inversions, indels). These methods allow to detect (in principle) almost all the classes of variation. Reads that map too far apart suggest the presence of a deletion, reads too close together may be indicative of insertions, and reads with orientation inconsistencies can point to inversions and tandem duplications. When only one member of the pair map to the reference genome (aka singlet read), the companion read suggest the presence of variant sequences not included in the reference genome. Some computational tools based on read-pair approach (There are now many computational tools based on a read-pair approach, including PEMer[1], VariationHunter[2-4], BreakDancer[5], MoDIL[6], MoGUL[7], HYDRA[8], Corona[9] and SPANNER [10]. This is a powerful strategy with some limitations :(1) resolving ambiguous mapping assignments in repetitive regions may be difficult; (2) they depend on the insert size that usually follows a distribution far from being perfectly tight rising the problem in detecting larger events.

(2)Read-depth methods (aka Depth of Coverage strategies): these software are based on the assumption that mapping depth across the whole genome follows a random (typically Poisson or modified Poisson) distribution, and then the number of reads expected to map within a region to be proportional to the number of times that region appears in the sequenced genome. Thus, a region that has been deleted or duplicated will display either less or more reads than expected. Read depth is theonly sequencing-based method to accurately predict absolute copy numbers[11-12], even if (1) the breakpoint resolutionis often poor and (2) they are less sensitive than PEM methods for the detection of smaller events.

Some softwares use a combination of the two previously described approaches to more reliably detect CNVs (CNVer [13] and Genome STRiP [12]).

(3)Split-read approaches: these methods are able to detect deletions and smaller insertions down to single-base-pair resolution. They are based on the detection of “split” sequence-read signature, namely when the alignment of a read to the genome is broken: thus, a stretch of gaps in the read indicates a deletion or in the reference indicates an insertion.The application of this method is still limited owing the computation overhead of the local gapped alignment and currently available only in the unique regions of the genome. A tool available is PINDEL[14].

(4)De novo assembly: de novo assembly would be in theory able to capture all the forms of structural variation with reads long enough. However the sequence-assembly approaches are not extensively used yet, and typically a combination of de novo and local assembly algorithm are used to form contigs that are then compared to reference genome.Well-known de novo assembly algorithms are EULER-USR [15], ABySS [16], SOAPdenovo[17] and ALLPATHS-LG [18], Cortex assembler [13], NovelSeq framework [19], TIGRA[13].

INTRODUCTION REFERENCES

[1]Korbel, J. O. et al. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol. 10, R23 (2009).

[2]Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variationdetection in high-throughput sequenced genomes.Genome Res. 19, 1270–1278 (2009).

[3]Hormozdiari, F. et al. Next-generation VariationHunter:combinatorial algorithms for transposon insertiondiscovery. Bioinformatics (Oxford, England) 26,i350–i357 (2010).

[4]Hormozdiari, F., Hajirasouliha, I., A., M., Eichler, E. E.Sahinalp, S. C. Simultaneous structural variationdiscovery in multiple paired-end sequenced genomes.Proc. RECOMB 2011 (in the press).

[5]Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.Nature Methods 6, 677–681 (2009).

[6] Lee, S., Hormozdiari, F., Alkan, C. & Brudno, M. MoDIL: detecting small indels from clone-endsequencing with mixtures of distributions. NatureMethods 6, 473–474 (2009).

[7] Lee, S., Xing, E. & Brudno, M. MoGUL: detectingcommon insertions and deletions in a population.Proc. RECOMB 2010 6044, 357–368 (2010).

[8] McKernan, K. J. et al. Sequence and structural variation in a human genome uncovered by short-read,massively parallel ligation sequencing using two-base

encoding. Genome Res. 19, 1527–1541 (2009).

[9] Mills, R. E. et al. Mapping copy number variation at fine scale by population scale genome sequencing.Nature 470, 59–65 (2011).

[10] Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generationsequencing. Nature Genet. 41, 1061–1067 (2009).

[11] Sudmant, P. H. et al. Diversity of human copynumber variation and multicopy genes. Science 330,641–646 (2010).

[12] Medvedev, P., Fiume, M., Dzamba, M., Smith, T. &Brudno, M. Detecting copy number variation with matedshort reads. Genome Res. 20, 1613–1622 (2010).

[13] Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. A. Discovery and genotyping of genomestructural polymorphism by sequencing on a

population scale. Nature Genet. 13 Feb 2011

[14] Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect breakpoints of large deletions and medium sized insertions

from paired-end short reads. Bioinformatics (Oxford, England) 25, 2865–2871 (2009).

[15] Chaisson, M. J., Brinza, D. & Pevzner, P. A. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 19, 336–346 (2009).

[16] Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).

[17] Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2009).

[18] Gnerre, S. et al. High-quality draft assemblies ofmammalian genomes from massively parallel sequencedata. Proc. Natl Acad. Sci. USA 108, 1513–1518

(2011).

[19] Hajirasouliha, I. et al. Detection and characterization of novel sequence insertions using paired-endnext-generation sequencing. Bioinformatics (Oxford,

England) 26, 1277–1283 (2010).

GLOSSARY and MODULE OVERVIEW

“Input”: name of the input file

“Label”: is specified when the name of the input on the pipeline canvas has to be slightly different form the one specified in the Input section to be more clear.

“Tool”: script/program in use

“Server Location”: location on the fgene server

“Output”: name of the output file

1

CNVS calling and visualization from NGS data

sss

1

CNVS calling and visualization from NGS data

MODULE DESCRIPTION

GOAL: call CNVs starting from raw reads (FASTQ format)

FINAL OUTPUT: files with CNV calls

CNV CALLING: three different path

1.ERDS/SVA path (DOC)(see STEP5 “Sequence Variant Analyzer v1.0, for hg18 annotations” pipeline, for “Sequence Variant Analyzer v1.1, for has not been released yet a compatible ERDS version)

Input: alignment.rmd.sorted.md.recal.clean.bam (= output STEP#)

Tool:erds

Server Location: /applications/scripts_sva/erds1.01/erds.sh

Output: .gsap file (visualization of CNVs in Sequence Variant Analyzer).From the GUI is possible to export the large variants in a .csv file.

Example:

Pipeline Module:

2.BOWTIE/CNVer/SAVANT path (DOC+PEM)

(PREREQUISITE: download the companion package from the CNVer website It allows to useCNVer on alignments performed against the UCSC ref genome. See notes on ensemble alignments.)

2.1BOWTIE alignment

2.1.1Build bowtie index for the reference genome

Input: ref.fa reference files (preferentially the UCSC fasta reference genome, chr1-22,X,Y. See notes if using ensemble genome)

Tool: bowtie (bowtie-build command)

Server Location: /applications/rseqtools/example_dataset/bowtie-0.12.7

Output: series of .ebwt files

Example: bowtie-build ${CD}/human_genome2.fa ucsc_hg18_new_bowtie > \

ucsc_hg18_new_bowtie.log

Pipeline Module:

2.1.2Bowtie alignment with SAM production

Input: sequence.fastq file (SANGER FORMAT, output of STEP#1, I)

Label: Illumina reads sequence.fastq files

Tool: bowtie

Server Location: /applications/BOWTIE/bowtie-0.12.7

Output: alignment.sam

Example: bowtie ${CD}/ucsc_hg18_new_bowtie -1 ${CD}/941408_fwd.fastq -2

${CD}/941408_rev.fastq -v 3 -a -m 600 --best --strata --sam

${CD}/941408_bduc_ucsc_hg18_new.sam >

${CD}/941408_bduc_ucsc_hg18_new.log

Pipeline Module:

2.1.3SAM2BAM conversion

Input: alignment.sam

Tool: bowtiepicard (SamFormatConverter.jar)

Server Location: /applications/picard_1.38/picard-tools-1.38

Output: alignment.bam

Example: -jar /apps/picard/1.45/bin/SamFormatConverter.jar INPUT= ${CD}/941408_bduc_ucsc_hg18_new.sam \

OUTPUT=${CD}/941408_bduc_ucsc_hg18_new.bam

2.2CNVer

2.2.1CNVer call

Input: alignment.bam

Tool: CNVer (cnver.pl)

Server Location: /applications/CNVer/cnver-0.8.1/src

Output: .cnv files

Example: cnver.pl --map_list /projects2/CNVer_0.8.1_testing/map_list.txt --ref_folder /applications/CNVer/cnver-0.8.1/hg18comp --work_dir /projects2/CNVer_0.8.1_testing --read_len 101 --mean_insert 175 --stdev_insert 25 --min_mps 3

Pipeline Module:

2.2.2Sort .bam

Input: alignment.bam file

Tool: samtools (sort option)

Server Location: /applications/samtools-0.1.7_x86_64-linux

Output: alignment.sorted.bam file

Example: /applications/samtools-0.1.7_x86_64-linux/samtools sort /projects/pipelineCache/pipeline/2011January27_15h51m34s061ms/SamToolsRemoveDuplicates_1.OutputNo-DuplicatesBAMfile-1.bam /projects/pipelineCache/pipeline/2011January27_15h51m34s061ms/SamToolsSort_1.OutputSortedBAMfile-1.bam

Pipeline Module:

2.2.3Indexing the .bam file

Input: alignment.sorted.md.bam file

Tool: samtools (index option)

Server Location: /applications/samtools-0.1.7_x86_64-linux

Output: alignment.sorted.md.bam.bai file

Example: /applications/samtools-0.1.7_x86_64-linux/samtools index /projects/pipelineCache/pipeline/2011January27_15h51m34s061ms/SamToolsCamldMDtag_1.OutputNo-DuplicatesBAMfile-1.bam

Pipeline Module:

2.2.4Visualization

2.2.4.1BAM sorting

Input: alignment.bam

Tool: samtools (sort

Server Location: /applications/SAVANT_gb_updated

Output: alignment.genome.cov.bam file

Example:

Pipeline Module:

2.2.4.2Coverage track production

Input:alignment.bam

Tool: SAVANT genome browser (GUI)

Server Location: /applications/SAVANT_gb_updated

Output: alignment.genome.cov.bam file

Example:

Pipeline Module:

2.2.4.3 Formatting the ref.fa file in a ref.fa.savant

Input: ref.fa file (see #2.1.1)

Tool: SAVANT genome browser (GUI)

Server Location: /applications/SAVANT_gb_updated

Output: ref.fa.savant project

Example:

Pipeline Module:

2.2.4.4Formatting the cnv file in cnv.bed

Input: .cnv file

Tool: SAVANT genome browser (GUI)

Server Location: /applications/SAVANT_gb_updated

Output: .cnv.bed

Example:

Pipeline Module:

2.2.4.5Visualization

Input: alignment.sorted.bam, alignment, genome.cov.bam, .cnv.bed files

Tool: SAVANT genome browser

Server Location: /applications/SAVANT_gb_updated

Output: saved in a .savant project

Example:

Pipeline Module:

ENSEMBL ALIGNMENTS NOTES

PREREQUISITE: perform the alignments on a 1-22,X,Y ensemble reference genome. AVOID the contigs!

1)Conversion chr1-22 (UCSC) to 1-22 (ENSEMBL).

  1. In the hg18comp folder remove chr from allchr.txt, autosomes.txt.
  2. In the folders inside hg18comp (contig_breaks_folder, fasta_files_folder, repeat_regions_folder, self_alignments_folder) change the name of the files and also the content of the file accordingly chr1-22 (UCSC) to 1-22 (ENSEMBL).

3.CNVseq path (DOC)

3.1Hits file production

Input: alignment.bam

Tool: samtools (view option)

Server Location: /applications/samtools-0.1.7_x86_64-linux

Output: .hits file

Example: samtools view -F 4 my.bam | perl -lane 'print "$F[2]\t$F[3]"' > my.hits /ifs/ccb/CCB_SW_Tools/BioinformaticsGenetics/samtools/samtools-0.1.10/samtools view -bT /ifs/ccb/CCB_SW_Tools/BioinformaticsGenetics/MAQ_BWA_2010/test_data/ref_chr2.fasta.fai -o /ifs/pl_cache/cranium/pipelnvr/2010December02_09h30m18s797ms/SamToolsView_1.OutputBAMfile-1.bam /ifs/pl_cache/cranium/pipelnvr/2010December02_09h30m18s797ms/SamToolsmaq2sam-long_1.OutputSAMfile-1.sam

Pipeline Module:

3.2CNV call with CNVseq

Input: .hits file

Tool: CNVseq+R

Server Location: /applications/CNVseq/cnv-seq

Output: .hits.cnv file

Example: ./cnv-seq.pl --test SMS-135.hits --ref JAM-230.hits --genome human

In R:

library(proto)

library(grid)

library(cnv)

data <- read.delim("JLK-227.hits-vs-JAM-230.hits.log2-0.6.pvalue-0.001.minw-4.cnv") #upload the .cnv file

cnv.summary(data) # produce a description of the dataset like that:

CNV percentage in genome: 0.8%

CNV nucleotide content: 24516307

CNV count: 403

Mean size: 60835

Median size: 135865

Max Size: 503201

Min Size: 20129

plot.cnv(data, chromosome=2, from=140036061, to=144238634) # to plot a specific regionso in this case you can insert the interval you want and see the log R ratio OR check the cnv.print file and look inside that if your region is among the significant ones.

plot.cnv(data, CNV=4, upstream=4e+6, downstream=4e+6) #to plot a CNV (when you do cnv.print(data) look at the CNVid and them choose what CNV to be plotted)

Pipeline Module:

NOTE:

  • What a user might want to do with the plotting (see OUTPUT PLOT)?

(1)Automatically export ALL the pdf plots of all the CNVs (so this step may be connected directly to the CNVseq. I guess would be great to put a parameter that if checked allows this automatic production:

plot.cnv.all <- function (data, chrom.gap = 2e+07,

colour = 5, title = WG plots, ylim = c(-2,2),

xlabel = "Chromosome")

(2) Automatically export ALL the pdf plots of all the CNVs in one chromosome by time (so this step may be connected directly to the CNVseq. I guess would be great to put a parameter that if checked allows this automatic production (with the possibility to choose the chromosome):

plot.cnv.chr <- function (data, chromosome = the number of the chromosome, from = NA(beginning coordinate), to = (ending coordinate), title = chromosome x,

ylim = c(-4, 4), glim = c(-2, 2),

xlabel = "Position (bp)")

(3) In the vast majority of the cases, the user will download the cnv file produced by CNVseq, will examine it and choose some events that are more interesting. In this case what I figure in my mind is the possibility to check a parameter that says “by now stop the process to CNVseq”. Once the user has downloaded and examined the .cnv files and has x regions to visualize, he can go back to the same module (without having restarted it) and export plots:

By genomic region: plot.cnv(data, chromosome=2, from=140036061, to=144238634) # to plot a specific region so in this case you can insert the interval you want and see the log R ratio

ByCNVid: plot.cnv(data, CNV=4 (this the is found by the user in the cnv file, upstream=4e+6, downstream=4e+6)

  • What a user might want to do with the output description file (see OUTPUT DATA DESCRIPTION)?

This step is ok.

1