SOAPindel 1.0 User Guide

Shengting Li

Bioinformatics Research Centre (BiRC), AarhusUniversity, Denmark
Beijing Genomics Institute, Shenzhen (BGI-Shenzhen), China

August 9, 2011

1.Introduction

2.Requirements

3.Workflow

4.Command & Input files

5.Parameters

String & Number Parameters

Switch Parameters

6.Output files

Log

Result

1.Introduction

SOAPindel 1.0 is focusing on calling indels from the next-generation paired-end sequencingdata.

2.Requirements

SOAPindel needs two input data sources:

  1. The reference sequence file used to align the reads. It must be in Fasta format.
  2. The files with read-alignments. SOAPindel accepts both SOAP and SAM formats as input. When the input files are in SOAP format, users must also provide the raw reads files in Fasta or Fastq format.SOAPindel can guess the library insert sizes by itself, but if users could provide the correct ones, it will save some time.

Sequencing platform:Theoretically, SOAPindel is designed for all paired-end sequencing data because it doesn’t consider any qualities for now. So it works for both Ilumina GA data and 454 data, but we’ve only tested it on Ilumina GA data.

3.Workflow

  1. Get fuzzy postions of unmatched reads
  2. Sort fuzzy postions
  3. Find possible SNPs in all potential regions
  4. Cluster unmatched reads to regions
  5. Split reference sequence
  6. Get unmatched reads
  7. Get matched reads
  8. Do local assembly and alignment to get indels
  9. Filter results

4.Command & Input files

perl indel_detection.pl <mapping.listreference_seq(DIR|FILE)> [<reads.list>]

  1. mapping.list>: reads mapping files list. File format:

path_of_alignment_filelib_insert_sizedeviation_of_insert_sizereads_lengthtype_of_file

path_of_alignment_file: better set to absolute path.

lib_insert_size: insert size of library

deviation_of_insert_size: normally set to SD*2 (~20% of library insert size).

reads_length: max read length in the alignment file

type_of_file: this value should be set to

SINGLE: SOAP unpaired alignment hits file

PAIR: SOAP alignment file, which is not necessary to occur in the list. Only when you want to calculate the lib_insert_size and deviation automatically, you could add this file and must add it right before the corresponding “SINGLE” file in the list.

BOTH:bwa alignment file

  1. reference_seq(DIR|FILE): reference sequence folderor fasta file.

Folder:All reference sequence files must be put in this folder and named like <chromosome_name.fa>. Each file contains only one chromosome. Every file should be in Fasta format and have id: “>chromosome_name“ (case sensitive).

Fasta file: all sequence must be put in a fasta file and have id:“>chromosome_name “.

  1. reads.list: reads files list. Only needed by SOAP, just put the absolute path of files in each line.
  2. If you put pared-end reads in separate files, the pared-end reads id could be same but the file name must contain _a, _b or _1, _2 so SOAPindel can tell the order of reads.
  3. If you put pared-end reads in one file, the reads name must be like reads_id/1 & reads_id/2.

The command could be run directly, but for the big data set (like one or more chromosomes), users are suggested to use “-p”to print out the script and run it manually (please see the details in Parameters section).

5.Parameters

String & Number Parameters

-chr chromosomes (ALL)

This parameter could be a string of chromosomes with regions or not, split by underline “+”, e.g. “chr7:26730761-27230760+chr7:89428340-90542763+chr20” means detecting all indels from 26730761 to 27230760 and 89428340 to 90542763 on chr7 and the whole chr20. If you want to detect a lot of regions, you could put them in one file (each line contains a region) and use the file name as the parameter. DEFAULT: ALL, means detect every thing in the alignment files.

-llayer_num (2)

The min valley depth allowed in one cluster. Any thinner layer will be cut and start to find a new cluster. DEFAULT: 2, could be set to 1/10 of sequencing coverage and not less than 2.

-k kmer size (25)

Dimensional of de Bruijn graph. DEFAULT: 25, should be not more than ½ reads length and not less than 10.

-cpucpu number (7)

Number of local CPUs used. This number doesn’t affect jobs number of qsub or other multi-task submitting methods.

-ext extension length for every cluster (50)

In order to find the indels near the edge of cluster, SOAPindelextends every cluster on each side. DEFAULT: max(ext, reads length), take the longer one.

-wwindow size for successive homogeneous indels checking (30)

In order to filter out false positive sites, SOAPindel limits max number of successive homogeneous indels in a certain size window. DEFUALT: 30

-nmax successive homogeneous indels #/window (2)

Max number of successive homogeneous indels allowed. DEFAULT: 2, should be set to higher number if the divergence of the reference and sample is too big, e.g. 7 for Human vs Chimp.

-x_numread_max_hits# (1)

Don’t consider any reads that hit the reference more than this parameter. DEFAULT: 1, means don’t consider any repeats.

-mmmax_mismatch (int(read_len/25)+1)

Max mismatches allowed in one read. DEFAULT: int(read_len/25)+1.

-il max gap between nails (100)

SOAPindel will stop extending the path if the distance of two adjacent nails is too big. This parameter could set the limitation of indel size softly. DEFAULT: 100

-ol overlap length when cut cluster by max length (il)

If the cluster is too long, SOAPindel will cut it into segments with certain length and certain overlap. DEFAULT: 100

-xl max cluster length (ol*3)

If the cluster is longer than this parameter, SOAPindel will cut it into segments with certain length and certain overlap. DEFAULT: 300

-fmt mapping format (SOAP) [SOAP|SAM]

The format of input alignment files. Only support SOAP and SAM (.sam or .bam, .bam need the path of samtools) formats now. DEFAULT: SOAP

-sdp use bigger deviation (deviation*sdp) to filter insert size of pair aligned reads (2.0)

SOAPindel considers a pair of reads unmapped if the distance between them is biggerthan insert_size*(1+deviation*sdp) or smaller than insert_size*(1-deviation*sdp). DEFAULT: 2.0

-wdwork_dir (.)

SOAPindel will create this folder for all internal and final result.DEFAULT: current folder.

-cmpath_of_cross_match, only works when -ucm is set ()

-st path of samtools, only need for .bam file

-mc max contigs to align (100)

If the local assembly generates too many contigs, the local alignment only aligns part of them. DEFAULT: 100

-mx unique reads max reliable coverage (auto)

Even if filter reads with SNPs, sometimes there are still too many coverage of reads in some regions (mostly repeats) and could affect the assembly unexpectedly. SOAPindel will drop reads randomly to reduce the coverage to a reasonable number.

-aa 0|1|2 (0)

0: Normal

1: Only run script before assembly-alignment

2: Only run script after assembly-alignment

When user want to run assembly-alignment part manually, this parameter could help to split the script.

Switch Parameters

-p only print script of every step on screen, don't run the database.

Print out the script for user to run it manually.

-qseq print sampleseq for PCR

SOAPindel doesn’t output sequence of sample around the indels by default because it will generate too huge result. When users need to do PCR validation or similar options, this parameter could help.

-qsub use qsub to submit jobs

If users want to use PBS (qsub) or other batch jobs submission system, the @run_sh variable in the indel_detection.pl must be modified manually in order to fit the object system.

-pp print progress in log (debug)

-no_fs don't filter reads with SNPs (debug)

-ucm use cross_match to do local alignment (debug)

-t test time and memoery(debug)

6.Output files

Log

SOAPindel stores running log in the WORK_DIR/log folder. Users can use “cat WORK_DIR/log/*.log to check if there is something wrong during the process.

If the “-pp” option switched on, users can see the progress in percentage.

Result

SOAPindel stores results in the WORK_DIR/result/chr*, one folder for one chromosome.There would be8 files for each chromosome:

chr*_2L.mutation.raw
chr*_2L.mutation.raw2
chr*_2L.mutation.sorted
chr*_2L.mutation_sfo1.list
chr*_2L.heterozygous.list
chr*_2L.indel.list
chr*.HEAD
chr*_2L.indel.VCF

All filtered mutations (include SNPs, only those within 300bp around the indel would be detect) are listed in the file:

WORK_DIR/result/chr*/chr*_2L.mutation_sfo1.list.

The indels are listed in the file: WORK_DIR/result/chr*/chr*_2L.indel.list.

chr*_2L.indel.VCF is the vcfv4.0 format file. chr*.HEAD is the head of vcf file.

All these files are same format. Here is the description of each columns:

  1. cluster regions id
  2. assembled contigs id
  3. chromosome
  4. type[-size](S:SNP; D:Deletion; I:Insertion; N:Heterozygote Position)
  5. start position on reference chromosome
  6. localstart position on thecluster.
  7. indel range (local start -local end: indels could happen on any position between the start and end, so the local end - local start + size = tandem repeat length, and tandem repeat length / size = repeat times)
  8. indel allele
  9. average coverage for insertion; minimum coverage for deletion
  10. minimum coverage for insertion; overlapping coverage for deletion

“+” for homozygote; sample genotype / reference genotype ratio for heterozygote

  1. L(left flank length from the assembled contig start to indel start)_R(right flank length from indel end to the contig end)
  2. 10 bp left flank on reference_10bp right flank on reference
  3. Empty by default; if the -qseq is set, this column is (left flank of the contig)_(right flank of the contig) without allele sequence.
  4. used internal
  5. successive homogeneous indels #/window size
  6. used internal

And there is one statistics file: WORK_DIR/result/chrALL.stat.tab.you’d better keep the “result” directory, if you want to save storage, those two files,*mutation.raw2 and *mutation.sorted, could be removed.