CNVS calling and visualization from NGS data
CNVs calling Modules
Table of Contents
INTRODUCTION
INTRODUCTION REFERENCES
GLOSSARY and MODULE OVERVIEW
MODULE DESCRIPTION
CNV CALLING: three different path
1.ERDS/SVA path (DOC) (see STEP5 “Sequence Variant Analyzer v1.0, for hg18 annotations” pipeline, for “Sequence Variant Analyzer v1.1, for has not been released yet a compatible ERDS version)
2.BOWTIE/CNVer/SAVANT path (DOC+PEM)
2.1BOWTIE alignment
2.1.1Build bowtie index for the reference genome
2.1.2Bowtie alignment with SAM production
2.1.3SAM2BAM conversion
2.2CNVer
2.2.1CNVer call
2.2.2Sort .bam
2.2.3Indexing the .bam file
2.2.4Visualization
3.CNVseq path (DOC)
3.1Hits file production
3.2CNV call with CNVseq
INTRODUCTION
The field of computational methods for discovering structural variation on NGS data is a still open computational and bioinformatics challenge. Many tools are being currently available (as described at The framework of all structural variant discovery methods is to detect anomalous “signature” or patterns, and then call the underlying variant.There are four main strategies able to map sequence reads to the reference genome and to identify discordant signatures /patterns diagnostic for different classes of structural variants. None of them is comprehensive, with each method having peculiar strengths and weakness in detection, depending on the variant type or the properties of the underlying sequence at the variant locus. The four methods are:
(1)Read pair technologies (aka Paired End Mapping strategies): these software rely on the information that reside in the (1) order, (2) distance and (3) orientation of the reads to detect structural variants (deletion, insertion, inversions, indels). These methods allow to detect (in principle) almost all the classes of variation. Reads that map too far apart suggest the presence of a deletion, reads too close together may be indicative of insertions, and reads with orientation inconsistencies can point to inversions and tandem duplications. When only one member of the pair map to the reference genome (aka singlet read), the companion read suggest the presence of variant sequences not included in the reference genome. Some computational tools based on read-pair approach (There are now many computational tools based on a read-pair approach, including PEMer[1], VariationHunter[2-4], BreakDancer[5], MoDIL[6], MoGUL[7], HYDRA[8], Corona[9] and SPANNER [10]. This is a powerful strategy with some limitations :(1) resolving ambiguous mapping assignments in repetitive regions may be difficult; (2) they depend on the insert size that usually follows a distribution far from being perfectly tight rising the problem in detecting larger events.
(2)Read-depth methods (aka Depth of Coverage strategies): these software are based on the assumption that mapping depth across the whole genome follows a random (typically Poisson or modified Poisson) distribution, and then the number of reads expected to map within a region to be proportional to the number of times that region appears in the sequenced genome. Thus, a region that has been deleted or duplicated will display either less or more reads than expected. Read depth is theonly sequencing-based method to accurately predict absolute copy numbers[11-12], even if (1) the breakpoint resolutionis often poor and (2) they are less sensitive than PEM methods for the detection of smaller events.
Some softwares use a combination of the two previously described approaches to more reliably detect CNVs (CNVer [13] and Genome STRiP [12]).
(3)Split-read approaches: these methods are able to detect deletions and smaller insertions down to single-base-pair resolution. They are based on the detection of “split” sequence-read signature, namely when the alignment of a read to the genome is broken: thus, a stretch of gaps in the read indicates a deletion or in the reference indicates an insertion.The application of this method is still limited owing the computation overhead of the local gapped alignment and currently available only in the unique regions of the genome. A tool available is PINDEL[14].
(4)De novo assembly: de novo assembly would be in theory able to capture all the forms of structural variation with reads long enough. However the sequence-assembly approaches are not extensively used yet, and typically a combination of de novo and local assembly algorithm are used to form contigs that are then compared to reference genome.Well-known de novo assembly algorithms are EULER-USR [15], ABySS [16], SOAPdenovo[17] and ALLPATHS-LG [18], Cortex assembler [13], NovelSeq framework [19], TIGRA[13].
INTRODUCTION REFERENCES
[1]Korbel, J. O. et al. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol. 10, R23 (2009).
[2]Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variationdetection in high-throughput sequenced genomes.Genome Res. 19, 1270–1278 (2009).
[3]Hormozdiari, F. et al. Next-generation VariationHunter:combinatorial algorithms for transposon insertiondiscovery. Bioinformatics (Oxford, England) 26,i350–i357 (2010).
[4]Hormozdiari, F., Hajirasouliha, I., A., M., Eichler, E. E.Sahinalp, S. C. Simultaneous structural variationdiscovery in multiple paired-end sequenced genomes.Proc. RECOMB 2011 (in the press).
[5]Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.Nature Methods 6, 677–681 (2009).
[6] Lee, S., Hormozdiari, F., Alkan, C. & Brudno, M. MoDIL: detecting small indels from clone-endsequencing with mixtures of distributions. NatureMethods 6, 473–474 (2009).
[7] Lee, S., Xing, E. & Brudno, M. MoGUL: detectingcommon insertions and deletions in a population.Proc. RECOMB 2010 6044, 357–368 (2010).
[8] McKernan, K. J. et al. Sequence and structural variation in a human genome uncovered by short-read,massively parallel ligation sequencing using two-base
encoding. Genome Res. 19, 1527–1541 (2009).
[9] Mills, R. E. et al. Mapping copy number variation at fine scale by population scale genome sequencing.Nature 470, 59–65 (2011).
[10] Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generationsequencing. Nature Genet. 41, 1061–1067 (2009).
[11] Sudmant, P. H. et al. Diversity of human copynumber variation and multicopy genes. Science 330,641–646 (2010).
[12] Medvedev, P., Fiume, M., Dzamba, M., Smith, T. &Brudno, M. Detecting copy number variation with matedshort reads. Genome Res. 20, 1613–1622 (2010).
[13] Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. A. Discovery and genotyping of genomestructural polymorphism by sequencing on a
population scale. Nature Genet. 13 Feb 2011
[14] Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect breakpoints of large deletions and medium sized insertions
from paired-end short reads. Bioinformatics (Oxford, England) 25, 2865–2871 (2009).
[15] Chaisson, M. J., Brinza, D. & Pevzner, P. A. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 19, 336–346 (2009).
[16] Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).
[17] Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2009).
[18] Gnerre, S. et al. High-quality draft assemblies ofmammalian genomes from massively parallel sequencedata. Proc. Natl Acad. Sci. USA 108, 1513–1518
(2011).
[19] Hajirasouliha, I. et al. Detection and characterization of novel sequence insertions using paired-endnext-generation sequencing. Bioinformatics (Oxford,
England) 26, 1277–1283 (2010).
GLOSSARY and MODULE OVERVIEW
“Input”: name of the input file
“Label”: is specified when the name of the input on the pipeline canvas has to be slightly different form the one specified in the Input section to be more clear.
“Tool”: script/program in use
“Server Location”: location on the fgene server
“Output”: name of the output file
1
CNVS calling and visualization from NGS data
sss
1
CNVS calling and visualization from NGS data
MODULE DESCRIPTION
GOAL: call CNVs starting from raw reads (FASTQ format)
FINAL OUTPUT: files with CNV calls
CNV CALLING: three different path
1.ERDS/SVA path (DOC)(see STEP5 “Sequence Variant Analyzer v1.0, for hg18 annotations” pipeline, for “Sequence Variant Analyzer v1.1, for has not been released yet a compatible ERDS version)
Input: alignment.rmd.sorted.md.recal.clean.bam (= output STEP#)
Tool:erds
Server Location: /applications/scripts_sva/erds1.01/erds.sh
Output: .gsap file (visualization of CNVs in Sequence Variant Analyzer).From the GUI is possible to export the large variants in a .csv file.
Example:
Pipeline Module:
2.BOWTIE/CNVer/SAVANT path (DOC+PEM)
(PREREQUISITE: download the companion package from the CNVer website It allows to useCNVer on alignments performed against the UCSC ref genome. See notes on ensemble alignments.)
2.1BOWTIE alignment
2.1.1Build bowtie index for the reference genome
Input: ref.fa reference files (preferentially the UCSC fasta reference genome, chr1-22,X,Y. See notes if using ensemble genome)
Tool: bowtie (bowtie-build command)
Server Location: /applications/rseqtools/example_dataset/bowtie-0.12.7
Output: series of .ebwt files
Example: bowtie-build ${CD}/human_genome2.fa ucsc_hg18_new_bowtie > \
ucsc_hg18_new_bowtie.log
Pipeline Module:
2.1.2Bowtie alignment with SAM production
Input: sequence.fastq file (SANGER FORMAT, output of STEP#1, I)
Label: Illumina reads sequence.fastq files
Tool: bowtie
Server Location: /applications/BOWTIE/bowtie-0.12.7
Output: alignment.sam
Example: bowtie ${CD}/ucsc_hg18_new_bowtie -1 ${CD}/941408_fwd.fastq -2
${CD}/941408_rev.fastq -v 3 -a -m 600 --best --strata --sam
${CD}/941408_bduc_ucsc_hg18_new.sam >
${CD}/941408_bduc_ucsc_hg18_new.log
Pipeline Module:
2.1.3SAM2BAM conversion
Input: alignment.sam
Tool: bowtiepicard (SamFormatConverter.jar)
Server Location: /applications/picard_1.38/picard-tools-1.38
Output: alignment.bam
Example: -jar /apps/picard/1.45/bin/SamFormatConverter.jar INPUT= ${CD}/941408_bduc_ucsc_hg18_new.sam \
OUTPUT=${CD}/941408_bduc_ucsc_hg18_new.bam
2.2CNVer
2.2.1CNVer call
Input: alignment.bam
Tool: CNVer (cnver.pl)
Server Location: /applications/CNVer/cnver-0.8.1/src
Output: .cnv files
Example: cnver.pl --map_list /projects2/CNVer_0.8.1_testing/map_list.txt --ref_folder /applications/CNVer/cnver-0.8.1/hg18comp --work_dir /projects2/CNVer_0.8.1_testing --read_len 101 --mean_insert 175 --stdev_insert 25 --min_mps 3
Pipeline Module:
2.2.2Sort .bam
Input: alignment.bam file
Tool: samtools (sort option)
Server Location: /applications/samtools-0.1.7_x86_64-linux
Output: alignment.sorted.bam file
Example: /applications/samtools-0.1.7_x86_64-linux/samtools sort /projects/pipelineCache/pipeline/2011January27_15h51m34s061ms/SamToolsRemoveDuplicates_1.OutputNo-DuplicatesBAMfile-1.bam /projects/pipelineCache/pipeline/2011January27_15h51m34s061ms/SamToolsSort_1.OutputSortedBAMfile-1.bam
Pipeline Module:
2.2.3Indexing the .bam file
Input: alignment.sorted.md.bam file
Tool: samtools (index option)
Server Location: /applications/samtools-0.1.7_x86_64-linux
Output: alignment.sorted.md.bam.bai file
Example: /applications/samtools-0.1.7_x86_64-linux/samtools index /projects/pipelineCache/pipeline/2011January27_15h51m34s061ms/SamToolsCamldMDtag_1.OutputNo-DuplicatesBAMfile-1.bam
Pipeline Module:
2.2.4Visualization
2.2.4.1BAM sorting
Input: alignment.bam
Tool: samtools (sort
Server Location: /applications/SAVANT_gb_updated
Output: alignment.genome.cov.bam file
Example:
Pipeline Module:
2.2.4.2Coverage track production
Input:alignment.bam
Tool: SAVANT genome browser (GUI)
Server Location: /applications/SAVANT_gb_updated
Output: alignment.genome.cov.bam file
Example:
Pipeline Module:
2.2.4.3 Formatting the ref.fa file in a ref.fa.savant
Input: ref.fa file (see #2.1.1)
Tool: SAVANT genome browser (GUI)
Server Location: /applications/SAVANT_gb_updated
Output: ref.fa.savant project
Example:
Pipeline Module:
2.2.4.4Formatting the cnv file in cnv.bed
Input: .cnv file
Tool: SAVANT genome browser (GUI)
Server Location: /applications/SAVANT_gb_updated
Output: .cnv.bed
Example:
Pipeline Module:
2.2.4.5Visualization
Input: alignment.sorted.bam, alignment, genome.cov.bam, .cnv.bed files
Tool: SAVANT genome browser
Server Location: /applications/SAVANT_gb_updated
Output: saved in a .savant project
Example:
Pipeline Module:
ENSEMBL ALIGNMENTS NOTES
PREREQUISITE: perform the alignments on a 1-22,X,Y ensemble reference genome. AVOID the contigs!
1)Conversion chr1-22 (UCSC) to 1-22 (ENSEMBL).
- In the hg18comp folder remove chr from allchr.txt, autosomes.txt.
- In the folders inside hg18comp (contig_breaks_folder, fasta_files_folder, repeat_regions_folder, self_alignments_folder) change the name of the files and also the content of the file accordingly chr1-22 (UCSC) to 1-22 (ENSEMBL).
3.CNVseq path (DOC)
3.1Hits file production
Input: alignment.bam
Tool: samtools (view option)
Server Location: /applications/samtools-0.1.7_x86_64-linux
Output: .hits file
Example: samtools view -F 4 my.bam | perl -lane 'print "$F[2]\t$F[3]"' > my.hits /ifs/ccb/CCB_SW_Tools/BioinformaticsGenetics/samtools/samtools-0.1.10/samtools view -bT /ifs/ccb/CCB_SW_Tools/BioinformaticsGenetics/MAQ_BWA_2010/test_data/ref_chr2.fasta.fai -o /ifs/pl_cache/cranium/pipelnvr/2010December02_09h30m18s797ms/SamToolsView_1.OutputBAMfile-1.bam /ifs/pl_cache/cranium/pipelnvr/2010December02_09h30m18s797ms/SamToolsmaq2sam-long_1.OutputSAMfile-1.sam
Pipeline Module:
3.2CNV call with CNVseq
Input: .hits file
Tool: CNVseq+R
Server Location: /applications/CNVseq/cnv-seq
Output: .hits.cnv file
Example: ./cnv-seq.pl --test SMS-135.hits --ref JAM-230.hits --genome human
In R:
library(proto)
library(grid)
library(cnv)
data <- read.delim("JLK-227.hits-vs-JAM-230.hits.log2-0.6.pvalue-0.001.minw-4.cnv") #upload the .cnv file
cnv.summary(data) # produce a description of the dataset like that:
CNV percentage in genome: 0.8%
CNV nucleotide content: 24516307
CNV count: 403
Mean size: 60835
Median size: 135865
Max Size: 503201
Min Size: 20129
plot.cnv(data, chromosome=2, from=140036061, to=144238634) # to plot a specific regionso in this case you can insert the interval you want and see the log R ratio OR check the cnv.print file and look inside that if your region is among the significant ones.
plot.cnv(data, CNV=4, upstream=4e+6, downstream=4e+6) #to plot a CNV (when you do cnv.print(data) look at the CNVid and them choose what CNV to be plotted)
Pipeline Module:
NOTE:
- What a user might want to do with the plotting (see OUTPUT PLOT)?
(1)Automatically export ALL the pdf plots of all the CNVs (so this step may be connected directly to the CNVseq. I guess would be great to put a parameter that if checked allows this automatic production:
plot.cnv.all <- function (data, chrom.gap = 2e+07,
colour = 5, title = WG plots, ylim = c(-2,2),
xlabel = "Chromosome")
(2) Automatically export ALL the pdf plots of all the CNVs in one chromosome by time (so this step may be connected directly to the CNVseq. I guess would be great to put a parameter that if checked allows this automatic production (with the possibility to choose the chromosome):
plot.cnv.chr <- function (data, chromosome = the number of the chromosome, from = NA(beginning coordinate), to = (ending coordinate), title = chromosome x,
ylim = c(-4, 4), glim = c(-2, 2),
xlabel = "Position (bp)")
(3) In the vast majority of the cases, the user will download the cnv file produced by CNVseq, will examine it and choose some events that are more interesting. In this case what I figure in my mind is the possibility to check a parameter that says “by now stop the process to CNVseq”. Once the user has downloaded and examined the .cnv files and has x regions to visualize, he can go back to the same module (without having restarted it) and export plots:
By genomic region: plot.cnv(data, chromosome=2, from=140036061, to=144238634) # to plot a specific region so in this case you can insert the interval you want and see the log R ratio
ByCNVid: plot.cnv(data, CNV=4 (this the is found by the user in the cnv file, upstream=4e+6, downstream=4e+6)
- What a user might want to do with the output description file (see OUTPUT DATA DESCRIPTION)?
This step is ok.
1