Workshop for NGS data analysis
(GCBA 815)
Overview
Steps
- Process raw data
- Check sequencing quality
- Filter and / or trim reads if necessary
- Map to genome
- Mark duplicates
- Realign indels
- Recalibrate bases
- Variant discovery
- Call variants
- Filter variants
- Refine variant calls
- Annotate variants
- Evaluate variants
01. Log in using SSH:
Host name: cbsb.unmc.edu
Username: <your UNMC id>
Password: <your password>
02. Either: create a new folder for your project, or use the pre-made script.
- cd /storage/gcba815/USER_NAME
- mkdirNGS_analysis
- cd NGS_analysis
03. Log your progress:
- script NGS_analysis.txt
04. Create folders and copy necessary files:
- mkdir {alignments,raw_data,genome,tmp,vcf_files}
- cp /home/adam.cornish/GCBA815/raw_data/*fastqraw_data
- cp /home/adam.cornish/GCBA815/genome/chr22.fa genome
05. Build necessary indexes and dictionary:
- samtoolsfaidx genome/chr22.fa
- java -jar /home/adam.cornish/GCBA815/tools/picard.jar CreateSequenceDictionary R=genome/chr22.fa O=genome/chr22.dict
- cd genome & bwa index chr22.fa
- cd ../raw_data
06. Assess the quality of the reads:
- fastqc -f fastq read1.fastq read2.fastq
07. Filter / Trim the reads if necessary:
- fqtrim -l 50 -p 8 -o filtered.fastq read1.fastq,read2.fastq
- fastqc -f fastq read1.filtered.fastq read2.filtered.fastq
- cd ..
08. Align the reads to chromosome 22:
- bwamem -R "@RG\tID:SAMPLE_NAME\tPL:illumina\tSM:SAMPLE_NAME" -t 8 genome/chr22.fa raw_data/read1.filtered.fastq raw_data/read2.filtered.fastq > alignments/SAMPLE_NAME.raw.sam
09. Convert the SAM file to BAM:
- samtools view -bS alignments/SAMPLE_NAME.raw.sam -o alignments/SAMPLE_NAME.raw.bam
10. Sort the BAM file:
- java -jar /home/adam.cornish/GCBA815/tools/picard.jar SortSam I=alignments/SAMPLE_NAME.raw.bam O=alignments/SAMPLE_NAME.sorted.bam SO=coordinate VALIDATION_STRINGENCY=SILENT
11. Mark PCR duplicates:
- java -jar /home/adam.cornish/GCBA815/tools/picard.jar MarkDuplicates I=alignments/SAMPLE_NAME.sorted.bam O=alignments/SAMPLE_NAME.dup.bam M=tmp/SAMPLE_NAME.mark_dups_metrics_file VALIDATION_STRINGENCY=SILENT
- samtools index alignments/SAMPLE_NAME.dup.bam
12. Realign around indels:
- java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T RealignerTargetCreator -R genome/chr22.fa -I alignments/SAMPLE_NAME.dup.bam -known /home/adam.cornish/GCBA815/genome/chr22_mills.vcf -o tmp/SAMPLE_NAME.intervals
- java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T IndelRealigner -R genome/chr22.fa -I alignments/SAMPLE_NAME.dup.bam -known /home/adam.cornish/GCBA815/genome/chr22_mills.vcf -o alignments/SAMPLE_NAME.indels.bam -targetIntervalstmp/SAMPLE_NAME.intervals
13. Perform base quality score recalibration:
- java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T BaseRecalibrator -R genome/chr22.fa -I alignments/SAMPLE_NAME.indels.bam -knownSites /home/adam.cornish/GCBA815/genome/chr22_dbsnp_138.vcf -o tmp/SAMPLE_NAME.grp
- java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T PrintReads -R genome/chr22.fa -I alignments/SAMPLE_NAME.indels.bam -BQSR tmp/SAMPLE_NAME.grp -o alignments/SAMPLE_NAME.BQSR.bam
14. Perform variant calling using HaplotypeCaller
- java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T HaplotypeCaller -R genome/chr22.fa -I alignments/SAMPLE_NAME.BQSR.bam --dbsnp /home/adam.cornish/GCBA815/genome/chr22_dbsnp_138.vcf -o vcf_files/SAMPLE_NAME.raw.vcf
15. Split the variants into indels and SNVs
- java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T VariantAnnotator -R genome/chr22.fa -A VariantType --variant vcf_files/SAMPLE_NAME.raw.vcf -o vcf_files/SAMPLE_NAME.raw.vartype
- grep -P '^#' vcf_files/SAMPLE_NAME.raw.vartypevcf_files/SAMPLE_NAME.raw.snv.vcf
- grep -P '^#' vcf_files/SAMPLE_NAME.raw.vartypevcf_files/SAMPLE_NAME.raw.indel.vcf
- grep -P 'VariantType=SNP' vcf_files/SAMPLE_NAME.raw.vartypevcf_files/SAMPLE_NAME.raw.snv.vcf
- grep -P '(VariantType=DEL|VariantType=INS)' vcf_files/SAMPLE_NAME.raw.vartypevcf_files/SAMPLE_NAME.raw.indel.vcf
- rmvcf_files/*vartype
16. Filter indels:
- java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T VariantFiltration -R genome/chr22.fa -V vcf_files/SAMPLE_NAME.raw.indel.vcf -filter 'QD < 2.0 || FS > 200.0 || SOR > 10.0 || InbreedingCoeff < -0.8 || ReadPosRankSum < -20.0' -filterName 'indel_filter' -o vcf_files/SAMPLE_NAME.filtered.indel.vcf
17. Filter SNVs:
- java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T VariantFiltration -R genome/chr22.fa -V vcf_files/SAMPLE_NAME.raw.snv.vcf -filter 'QD < 2.0 || FS > 60.0 || SOR > 4.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' -filterName 'SNV_filter' -o vcf_files/SAMPLE_NAME.filtered.snv.vcf
18. Annotate indels:
- java -jar /home/adam.cornish/snpeff_data/snpEff/snpEff.jar -c /home/adam.cornish/snpeff_data/snpEff/snpEff.config -s vcf_files/SAMPLE_NAME.snv.html hg19 vcf_files/SAMPLE_NAME.filtered.snv.vcf > vcf_files/SAMPLE_NAME.annotated.snv.vcf
19. Annotate SNVs:
- java -jar /home/adam.cornish/snpeff_data/snpEff/snpEff.jar -c /home/adam.cornish/snpeff_data/snpEff/snpEff.config -s vcf_files/SAMPLE_NAME.snv.html hg19 vcf_files/SAMPLE_NAME.filtered.snv.vcf > vcf_files/SAMPLE_NAME.annotated.snv.vcf
20. Kill the script command: ctrl+D