Workshop for NGS data analysis

(GCBA 815)

Overview

Steps

  1. Process raw data
  2. Check sequencing quality
  3. Filter and / or trim reads if necessary
  4. Map to genome
  5. Mark duplicates
  6. Realign indels
  7. Recalibrate bases
  8. Variant discovery
  9. Call variants
  10. Filter variants
  11. Refine variant calls
  12. Annotate variants
  13. Evaluate variants

01. Log in using SSH:

Host name: cbsb.unmc.edu

Username: <your UNMC id>

Password: <your password>

02. Either: create a new folder for your project, or use the pre-made script.

  1. cd /storage/gcba815/USER_NAME
  2. mkdirNGS_analysis
  3. cd NGS_analysis

03. Log your progress:

  1. script NGS_analysis.txt

04. Create folders and copy necessary files:

  1. mkdir {alignments,raw_data,genome,tmp,vcf_files}
  2. cp /home/adam.cornish/GCBA815/raw_data/*fastqraw_data
  3. cp /home/adam.cornish/GCBA815/genome/chr22.fa genome

05. Build necessary indexes and dictionary:

  1. samtoolsfaidx genome/chr22.fa
  2. java -jar /home/adam.cornish/GCBA815/tools/picard.jar CreateSequenceDictionary R=genome/chr22.fa O=genome/chr22.dict
  3. cd genome & bwa index chr22.fa
  4. cd ../raw_data

06. Assess the quality of the reads:

  1. fastqc -f fastq read1.fastq read2.fastq

07. Filter / Trim the reads if necessary:

  1. fqtrim -l 50 -p 8 -o filtered.fastq read1.fastq,read2.fastq
  2. fastqc -f fastq read1.filtered.fastq read2.filtered.fastq
  3. cd ..

08. Align the reads to chromosome 22:

  1. bwamem -R "@RG\tID:SAMPLE_NAME\tPL:illumina\tSM:SAMPLE_NAME" -t 8 genome/chr22.fa raw_data/read1.filtered.fastq raw_data/read2.filtered.fastq > alignments/SAMPLE_NAME.raw.sam

09. Convert the SAM file to BAM:

  1. samtools view -bS alignments/SAMPLE_NAME.raw.sam -o alignments/SAMPLE_NAME.raw.bam

10. Sort the BAM file:

  1. java -jar /home/adam.cornish/GCBA815/tools/picard.jar SortSam I=alignments/SAMPLE_NAME.raw.bam O=alignments/SAMPLE_NAME.sorted.bam SO=coordinate VALIDATION_STRINGENCY=SILENT

11. Mark PCR duplicates:

  1. java -jar /home/adam.cornish/GCBA815/tools/picard.jar MarkDuplicates I=alignments/SAMPLE_NAME.sorted.bam O=alignments/SAMPLE_NAME.dup.bam M=tmp/SAMPLE_NAME.mark_dups_metrics_file VALIDATION_STRINGENCY=SILENT
  2. samtools index alignments/SAMPLE_NAME.dup.bam

12. Realign around indels:

  1. java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T RealignerTargetCreator -R genome/chr22.fa -I alignments/SAMPLE_NAME.dup.bam -known /home/adam.cornish/GCBA815/genome/chr22_mills.vcf -o tmp/SAMPLE_NAME.intervals
  2. java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T IndelRealigner -R genome/chr22.fa -I alignments/SAMPLE_NAME.dup.bam -known /home/adam.cornish/GCBA815/genome/chr22_mills.vcf -o alignments/SAMPLE_NAME.indels.bam -targetIntervalstmp/SAMPLE_NAME.intervals

13. Perform base quality score recalibration:

  1. java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T BaseRecalibrator -R genome/chr22.fa -I alignments/SAMPLE_NAME.indels.bam -knownSites /home/adam.cornish/GCBA815/genome/chr22_dbsnp_138.vcf -o tmp/SAMPLE_NAME.grp
  2. java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T PrintReads -R genome/chr22.fa -I alignments/SAMPLE_NAME.indels.bam -BQSR tmp/SAMPLE_NAME.grp -o alignments/SAMPLE_NAME.BQSR.bam

14. Perform variant calling using HaplotypeCaller

  1. java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T HaplotypeCaller -R genome/chr22.fa -I alignments/SAMPLE_NAME.BQSR.bam --dbsnp /home/adam.cornish/GCBA815/genome/chr22_dbsnp_138.vcf -o vcf_files/SAMPLE_NAME.raw.vcf

15. Split the variants into indels and SNVs

  1. java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T VariantAnnotator -R genome/chr22.fa -A VariantType --variant vcf_files/SAMPLE_NAME.raw.vcf -o vcf_files/SAMPLE_NAME.raw.vartype
  2. grep -P '^#' vcf_files/SAMPLE_NAME.raw.vartypevcf_files/SAMPLE_NAME.raw.snv.vcf
  3. grep -P '^#' vcf_files/SAMPLE_NAME.raw.vartypevcf_files/SAMPLE_NAME.raw.indel.vcf
  4. grep -P 'VariantType=SNP' vcf_files/SAMPLE_NAME.raw.vartypevcf_files/SAMPLE_NAME.raw.snv.vcf
  5. grep -P '(VariantType=DEL|VariantType=INS)' vcf_files/SAMPLE_NAME.raw.vartypevcf_files/SAMPLE_NAME.raw.indel.vcf
  6. rmvcf_files/*vartype

16. Filter indels:

  1. java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T VariantFiltration -R genome/chr22.fa -V vcf_files/SAMPLE_NAME.raw.indel.vcf -filter 'QD < 2.0 || FS > 200.0 || SOR > 10.0 || InbreedingCoeff < -0.8 || ReadPosRankSum < -20.0' -filterName 'indel_filter' -o vcf_files/SAMPLE_NAME.filtered.indel.vcf

17. Filter SNVs:

  1. java -jar /home/adam.cornish/GCBA815/tools/GenomeAnalysisTK-3.4.jar -T VariantFiltration -R genome/chr22.fa -V vcf_files/SAMPLE_NAME.raw.snv.vcf -filter 'QD < 2.0 || FS > 60.0 || SOR > 4.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' -filterName 'SNV_filter' -o vcf_files/SAMPLE_NAME.filtered.snv.vcf

18. Annotate indels:

  1. java -jar /home/adam.cornish/snpeff_data/snpEff/snpEff.jar -c /home/adam.cornish/snpeff_data/snpEff/snpEff.config -s vcf_files/SAMPLE_NAME.snv.html hg19 vcf_files/SAMPLE_NAME.filtered.snv.vcf > vcf_files/SAMPLE_NAME.annotated.snv.vcf

19. Annotate SNVs:

  1. java -jar /home/adam.cornish/snpeff_data/snpEff/snpEff.jar -c /home/adam.cornish/snpeff_data/snpEff/snpEff.config -s vcf_files/SAMPLE_NAME.snv.html hg19 vcf_files/SAMPLE_NAME.filtered.snv.vcf > vcf_files/SAMPLE_NAME.annotated.snv.vcf

20. Kill the script command: ctrl+D