Multiplex sequencing of pooled mitochondrial genomes – a crucial step towards biodiversity analysis using mito-metagenomics

Min Tang1, Meihua Tan2,1, Guanliang Meng3,1, Shenzhou Yang1, Xu Su1, Shanlin Liu1, Wenhui Song1, Yiyuan Li1, Qiong Wu1, Aibing Zhang4, Xin Zhou1*

1 China National GeneBank, BGI-Shenzhen, Beishan Road, Beishan Industrial Zone, Yantian District, Shenzhen, Guangdong Province, 518083, China

2 University of Chinese Academy of Sciences, 19A Yuquan Road, Shijingshan District, Beijing, 100094, China

3 China University of Geosciences, 388 Lumo Road, Wuhan, 430074, China

4 Capital Normal University, Beijing, 100094, China

* Xin Zhou. Tel: +86-0755-25273620; Fax: +86-0755-25273620; Email:

Present Address: Yiyuan Li, Department of Biological Sciences, Galvin Life Science Center, University of Notre Dame, IN 46556, USA

Appendix 2. Supplementary notes

S1. Pipeline: software and scripts

We summarize the software and scripts used in the paper. Relevant files (package “mt10kScripts”) are available at: http://sourceforge.net/projects/mt10k/files/?source=navbar.

1.Data filtering:

#adapter remove QC: rawread_filter.pl

Shell command:

perl /bin/rawread_filter.pl -lis list.txt -nma 5 -qua 10 -zip n

Result files: 130608_I189_FCD20KDACXX_L8_SZAXPI029527-158_1.fq 130608_I189_FCD20KDACXX_L8_SZAXPI029527-158_2.fq

*Change name to: clean_data.1.fq, clean_data.2.fq

#Transfer fastq to fasta:

Shell command:

perl /bin/fq2fa.pl clean_data.1.fq clean_data.1

perl /bin/fq2fa.pl clean_data.2.fq clean_data.2

Result files: clean_data.1.fa, clean_data.2.fa

Shell command:

cat ./clean_data.1.fa ./clean_data.2.fa > clean_data.1_2.fa

#preliminary mito-reads searching:

Shell command:

/bin/blastall –p blastn -i clean_data.1.fa -F F -d 20140310_phylum_Arthropoda_699_classs_Asteroidea_7_family_Cyprinidae_10.Refseq.fasta -o cleandata.1.blastout -m 8 -e 1e-5

/bin/blastall –p blastn -i clean_data.2.fa -F F -d 20140310_phylum_Arthropoda_699_classs_Asteroidea_7_family_Cyprinidae_10.Refseq.fasta -o cleandata.2.blastout -m 8 -e 1e-5

Result files: cleandata.1.blastout cleandata.2.blastout

*Select reads with identity > 30% into file “raw_reads12_seq”

#Second round of mito-read filtering using a 51-mer search:

Shell command:

perl /bin/onerun.pl raw_reads12_seq clean_data.1_2.fa

Result file: idout

perl /bin/final.pl clean_data.1.fq clean_data.2.fq idout

Result files: f_clean_data.1.fq f_clean_data.2.fq. Only these two files were used for subsequent assemblies.

2. Assembly

#Parallel assembly using 3 different programs: [configuration file “assembly.lib” used here is available in package “mt10kScripts”, please reset your own parameters properly]

a.  SOAPdenovo

Shell command:

/bin/soapdenovo-63mer all -s assembly.lib -K 61 -u -R -k 45 -o k61 > k61.infor

less k61.scafSeq |perl -e ‘while(>){chomp; if(/>/){my @a=split /\s+/; print”$a[0]\*$a[1]\_D\n”;}else{print”$_\n”;} }’ > denovo61.scafSeq

Result file: denovo61.scafSeq

b.  SOAPdenovo-Trans

Shell command:

/bin/SOAPdenovo-Trans-127mer pregraph -s assembly.lib -K 71 -o k71 -i 10 -n

/bin/SOAPdenovo-Trans-127mer contig -g k71 -q 5 -Q 2 -e 2 -M 3

/bin/SOAPdenovo-Trans-127mer map -s assembly.lib -g k71 -r -f

/bin/SOAPdenovo-Trans-127mer scaff -g k71 -F -L 100 -t 1 –r

less k71.scafSeq |perl -e ‘while(>){chomp; if(/>/){my @a=split /\s+/; if(/Locus/){ print”$a[0]\*$a[2]\_T\n”;}else{print”$a[0]\*$a[1]\_T\n”;}}else{print”$_\n”;} }’ > trans71.scafSeq

Result file: trans71.scafSeq

c.  IDBA-UD

Shell command:

/bin/fq2fa --merge f_clean_data.1.fq f_clean_data.2.fq f_clean_data.f.fa

/bin/idba_ud -r f_clean_data.f.fa -o ./ --num_threads 12

Result file: scaffold.fa

*Pool different assemblies together & annotation for mito-scaffold extraction:

Shell command:

cat denovo61.scafSeq trans71.scafSeq scaffold.fa > multi_assembly.fa

perl /bin/MT_annotation_BGI.pl -i multi_assembly.fa -d /bin/MitoProtein_RefSeq_Arthropoda604_starfish2_zebrafish1.fasta -o ./

perl /bin/getMitoGscaf.pl multi_assembly.fa.solar.genewise.gff.cds.position.cds multi_assembly.fa multi

mv multi_mitogeneScaf.fa multi_scaffold.fa

Result file: multi_scaffold.fa

#Scaffold concatenation

Shell command:

perl /bin/tgicl.nozmsort.pl multi_scaffold.fa -l 100 -c 10 -v 10000 -p 99 -O '-repeat_stringency 0.95 -minmatch 35 -minscore 35' & echo finish tgicl at `date +%y-%m-%d.%H:%M:%S`i

cat ./asm_*/align > align & echo get align at `date +%y-%m-%d.%H:%M:%S`

perl /bin/phrap.id.list.pl align align & echo get cluster id at `date +%y-%m-%d.%H:%M:%S`

perl /bin/get_single.pl align.cluster multi_scaffold.fa single & echo get single at `date +%y-%m-%d.%H:%M:%S`

cat asm_*/contigs > asm_cluster.fa & echo get cluster sequences at `date +%y-%m-%d.%H:%M:%S`

cat single.prefect.fa > single.fa & echo get singleton sequences at `date +%y-%m-%d.%H:%M:%S`

cat asm_*/contigs single.fa > tgicl_cluster_and_single.fa & echo get cluster out file at `date +%y-%m-%d.%H:%M:%S`

Result file: tgicl_cluster_and_single.fa

#TGICL tends to fail in merging overlapping terminuses when the overlapping regions are less than 100bp in length or either end contains Ns, e.g., CL21.Contig1 CL21.Contig2, CL5.Contig1 CL5.Contig2 CL5.Contig3. Consequently, manual examination of terminuses of assembled mitogenomes is necessary.

#Annotation & mito-scaffold extraction

Shell command:

perl /bin/MT_annotation_BGI.pl -i tgicl_cluster_and_single.fa -d /bin/MitoProtein_RefSeq_Arthropoda604_starfish2_zebrafish1.fasta -o ./

perl /bin/getMitoGscaf.pl tgicl_cluster_and_single.fa.solar.genewise.gff.cds.position.cds tgicl_cluster_and_single.fa multi

Result file: multi_mitogeneScaf.fa

#Calculate lengths and depths of mito-scaffolds:

Shell command:

perl /bin/fq2fa.pl f_clean_data.1.fq f_clean_data.1

Result file: f_clean_data.1.fa

perl /bin/fq2fa.pl f_clean_data.2.fq f_clean_data.2

Result file: f_clean_data.2.fa

cat ./f_clean_data.1.fa ./f_clean_data.2.fa > f_clean_data.fa

/bin/formatdb -i multi_mitogeneScaf.fa -p F -o F

/bin/blastall -i f_clean_data.fa -d multi_mitogeneScaf.fa -o reads1_2.blastout -m 8 -e 1e-5 -p blastn -F F

perl /bin/filterBlastoutbyLenIden.pl reads1_2.blastout 150 100 #Please change your parameter here properly.

perl /bin/get_coverage_depth_of_every_scaffold_from_blastout.v4.pl -fa multi_mitogeneScaf.fa reads1_2.blastout.len150.Iden100 > multi_mitogeneScaf.fa.length_abundance.txt

perl /bin/fa_length.pl multi_mitogeneScaf.fa

Result files: multi_mitogeneScaf.fa.length_abundance.txt multi_mitogeneScaf.fa.length

3. Taxonomic assignments for protein coding genes and scaffolds

Shell command:

less multi_mitogeneScaf.fa.solar.genewise.gff.cds.position.cds |perl -e '$/=">"; >;while(>){chomp; my @a=split /\n/; $title=shift @a; $seq=join"", @a; $title=~s/\s+/\_/g; print">$title\n$seq\n";}' > S_I.position.cds_reform

/bin/megablast -i S_I.position.cds_reform -d /bin/NCBImitoNT.fa -e 1e-5 -F F -m 8 -b 5 -o PCG_NCBI_MTal.out

perl /bin/blastout_merge_filter.pl PCG_NCBI_MTal.out

perl /bin/taxon.pl PCG_NCBI_MTal.out.1st.filter

less PCG_NCBI_MTal.out.1st.filter.taxa_infor |perl -e 'my %hash; while(>) {chomp (my $str=$_); my @a = split /\s+/, $str; unless($hash{$a[0]}){print "$str\n"; $hash{$a[0]}=1;}}' > besthit_taxon_infor.txt

perl /bin/PCG_taxon_assign.pl multi_mitogeneScaf.fa.length_abundance.txt besthit_taxon_infor.txt 49species.txt #Replace your own species.txt following the format in 49species.txt from the uploaded package "mt10kScripts".

*Retain mito-scaffolds assigned to input taxa (Fig. S1 and Section S2 in Appendix 2). Add sample ID to each scaffold following format in "confirmedCOI.txt" from the uploaded package “mt10kScripts” which includes confirmed CO1-scaffolds and non-CO1-scaffolds.

#Filter out unassigned COI-scaffolds for identification via BOLD:

Shell command:

less multi_mitogeneScaf.fa.solar.genewise.gff.cds.position.cds |perl -e '$/=">"; >; while(>){next unless(/COX1/); chomp; my @a=split /\n/; my $title=shift @a; $title=~s/\[mRNA\]//; $title=~s/\s+/_/g; my $cds=join"", @a; print">$title\n$cds\n";}' > mito_COI.cds

perl /bin/get_unassignedCOI.pl multi_mitogeneScaf.fa.length_abundance.txt mito_COI.cds confirmedCOI.txt

Result file: unassigned_COI.cds

*Identify COIs (unassigned_COI.cds) via BOLD

*Produce a file following format in “V1confirmedScaf.txt” from the uploaded package “mt10kScripts”.

4. Mitogenome construction

#Select candidate assemblies that may contain erroneous concatenations: scaffolds with repetitive PCGs, complete circular mitogenomes, and scaffolds with overlapped regions but missed by TGICL.

Shell command:

perl /bin/assign2table.pl multi_mitogeneScaf.fa.length multi_mitogeneScaf.fa.length_abundance.txt V1confirmedScaf.txt multi_mitogeneScaf.fa.solar.genewise.gff.cds.position.cds

*Result file: gene_scaf_table.txt -> Copy into Excel; sort by "Sample_ID" and "genename1" -> produce a file following format in “Vscaf_gene_table.xlsx” from the uploaded package “mt10kScripts”.

Shell command:

perl /bin/mito_scaf.pl multi_mitogeneScaf.fa Vscaf_gene_table.txt

perl /bin/overCircular_check.pl mito_scaffolds.fa Vscaf_gene_table.txt #-> overCircularScaf.fa

perl /bin/circle_check.pl mito_scaffolds.fa #->circle_checked.txt

*Find overlapped regions for remaining scaffolds longer than 15,000 bp using Geneious. These overlapped regions might contain a few different nucleotides due to artefact caused by TGICL and were overlooked by the script.-> updated V1mitoscaffolds.fa

#Annotation for V1mitoscaffolds.fa & update scaf_gene_table.txt

Shell command:

perl /bin/MT_annotation_BGI.pl -i V1mitoscaffolds.fa -d /bin/MitoProtein_RefSeq_Arthropoda604_starfish2_zebrafish1.fasta -o ./

perl /bin/fa_length.pl V1mitoscaffolds.fa

perl /bin/assign2table.pl V1mitoscaffolds.fa.length multi_mitogeneScaf.fa.length_abundance.txt V1confirmedScaf.txt V1mitoscaffolds.fa.solar.genewise.gff.cds.position.cds

Result file: gene_scaf_table.txt -> Copy into Excel; sort by "Sample_ID" and "genename1" -> produce a file following format in “Vscaf_gene_table.xlsx” from the uploaded package “mt10kScripts”.

Result file for next step: updated Vscaf_gene_table.txt -> V2scaf_gene_table.txt

#Convert all mito-scaffolds to the "same" strand (the strand encoding COX1) for cross-taxon alignment.

Shell command:

perl /bin/Ori_Unify.pl V1mitoscaffolds.fa V2scaf_gene_table.txt V2

Result files: V2.mitoscaffolds.fa V2.scaf_gene_table.txt

*Sort the table by "Sample_ID" and "genename1".

*For successfully assembled PCGs that failed in annotation, estimated positions are provided.

#PCGs extraction & Alignment:

Shell command:

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt COX1

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt COX2

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt COX3

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt CYTB

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt ATP6

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt ATP8

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt ND1

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt ND2

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt ND3

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt ND4

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt ND4L

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt ND5

perl /bin/geneSeq_get_ref_post_Reverse_complement.pl V2.mitoscaffolds.fa V2.scaf_gene_table.txt ND6

Result files: (PCGs)_raw_seq.fa (PCGs)_gene_cut_record.txt

#PCGs: COX1 COX2 COX3 CYTB ATP6 ATP8 ND1 ND2 ND3 ND4 ND4L ND5 ND6

* PCGs of 6 model insects (Drosophila melanogaster, D. simulans, D. pseudoobscura, Aedes aegypti, Danaus plexippus, and Tribolium castaneum) are downloaded from NCBI and used as the reference profile. Extracted PCGs of assembled scaffolds are aligned with the reference profile using Clustalw.

*Alignments are manually examined using MEGA. Redundant Ns are removed or additional gaps are added according to the alignment results.

Results file: (PCGs)_aligned.fas

#Place aligned PCGs back to replace the extracted PCGs.

Shell command:

perl /bin/get_alignedGeneBack_V2.1.pl V2.mitoscaffolds.fa (PCGs)_aligned.fas (PCGs)_gene_cut_record.txt

Result file: PCG_aligned_updated_Mito_scaf.fa -> Change the name to V2.mitoscaffolds.fa for placing back PCGs in the next step.

*Manually check corresponding assemblies from 3 assemblers to correct artefacts by TGICL -> produce updated mito-scaffolds: V3.mitoscaffolds.fa

#Verifying erroneous sites with exceptionally low coverage

Shell command:

/bin/formatdb -i V3.mitoscaffolds.fa -p F -o F

/bin/blastall -i f_clean_data.fa -d V3.mitoscaffolds.fa -m 8 -e 1e-5 -p blastn -F F -o readsMapping.out

perl /bin/filterBlastoutbyLenIden.pl readsMapping.out 150 100 #Result file: readsMapping.out.len150.Iden100

perl /bin/get_coverage_depth_of_every_scaffold_from_blastout.v4.pl -fa V3.mitoscaffolds.fa readsMapping.out.len150.Iden100 > V3.mitoscaffolds.length_abundance.txt

Result files: readsMapping.out.len150.Iden100.depth.pos & V3.mitoscaffolds.length_abundance.txt

perl /bin/depth.pos.svg_v6.pl readsMapping.out.len150.Iden100.depth.pos

Result file: (scaffoldID).svg

*Identifying sites with unexceptionally low depth

#Aligning reads to regions with potential errors

Shell command:

/bin/bwa index -a is V3.mitoscaffolds.fa

/bin/bwa aln -I V3.mitoscaffolds.fa f_remove_adapter_high_quality.1.fq.gz > V3.mitoscaffolds.1.aln.out.sai

/bin/bwa aln -I V3.mitoscaffolds.fa f_remove_adapter_high_quality.2.fq.gz > V3.mitoscaffolds.2.aln.out.sai

/bin/bwa sampe V3.mitoscaffolds.fa V3.mitoscaffolds.1.aln.out.sai V3.mitoscaffolds.2.aln.out.sai f_remove_adapter_high_quality.1.fq.gz f_remove_adapter_high_quality.2.fq.gz |gzip > V3.mitoscaffolds.aln-pe.sam.gz

/bin/samtools view -bSh V3.mitoscaffolds.aln-pe.sam.gz > V3.mitoscaffolds.aln-pe.bam

/bin/samtools sort V3.mitoscaffolds.aln-pe.bam V3.mitoscaffolds.aln-pe.sorted

/bin/samtools index V3.mitoscaffolds.aln-pe.sorted.bam

/bin/samtools mpileup -f V3.mitoscaffolds.fa V3.mitoscaffolds.aln-pe.sorted.bam > V3.mitoscaffolds.aln-pe.sorted.pileup

*Correct problematic bases referring to V3.mitoscaffolds.aln-pe.sorted.pileup

S2. Taxon assignments for protein coding genes and mito-scaffolds

Fig. S1. Taxon assignments for PCGs and scaffolds

1.  A table was generated first with a list of taxonomy hierarchies for the input taxa (“Morphological taxon table”). Taxonomic ranks were provided to the finest level for all pooled taxa (i.e., Phylum, Class, Order, Family, Sub-family, Genus, and Species). When multiple taxa belonged to the same higher taxonomic rank, the higher ranks were coded in a way to include the range of species codes. E.g., sp3, sp4, and sp5 were all from the same genus g(3-5), the same subfamily sf(3-5), etc.

2.  All protein coding genes (PCGs) identified on the final scaffolds were each megablasted against the NCBI PCG reference database (containing 886,010 genes), producing a list of top-hit matches. For each of these top-hit NCBI matches, a table of taxonomy hierarchies was also generated (“Top hit taxonomic hierarchy table”) as in Step 1.

3.  The top hit NCBI taxon for each of all PCGs was compared with the “Morphological taxon table” list from the finest taxonomic rank (e.g., Species). If a species name was matched with the input taxon list, a species-level match was made for the given PCG; otherwise, higher taxon ranks were subsequently compared one by one till the Order-level. For example, if a ND2 gene from the scaffolds was blasted with a best match with a NCBI reference sequence that was listed as Drosophila repleta, which was not included in the 3 Drosophila species in our input list, this ND2 gene was then assigned to the Genus Drosophila.

4.  After Steps 1-3, PCGs identified from the final scaffolds should have all been assigned to a taxon rank, which was listed in the input taxon table.

5.  Taxon assignments for scaffolds followed these rules (examples were taken from real data):

a)  If all PCGs assembled on the same scaffold were assigned to the same taxon, the scaffold was then assigned to the corresponding taxon (e.g., case).

b)  If all PCGs of the same scaffold were assigned to different taxon levels (e.g., some to species-level, others to genus-level), but with no conflicting assignment results (e.g., species belonging to the genus), the species-level assignment was made to the scaffold (e.g., ).

c)  Sometimes when multiple species from the same genus (or higher ranks) were pooled (e.g., ), some of them could not be assigned to the finest taxon level (e.g., ) due to incompleteness of the reference library. However, the exact taxon assignment could be made for using a process of elimination because congeners were already assigned to species). If the assignments for these closely related taxa failed, Sanger sequences would be used for verification of the assembled scaffolds.

d)  When fragmented scaffolds with similar average depths could be assigned to the same taxon, and the PCGs identified on these scaffolds were not overlapping, these scaffolds could be associated to the same taxon and virtually linked on a super-scaffold (connected by dashed lines in Fig. 2, e.g., ).

e)  Occasionally, PCGs of the same scaffolds showed conflicting taxon assignment results due to poor coverage of reference sequences, final assignment for the scaffold was inferred based on a majority consensus basis. For example, in , one of the 6 PCGs (marked in red color) of 2 scaffolds showed conflicting results from other PCGs by assigning the PCG to subfamily 9 (sf9), while most other PCGs were assigned to sf(10-11). The assignment to sf9 was then considered erroneous, which was likely caused by the incompleteness of the reference for that specific gene. In , one PCG was assigned to g11, which was NOT conflicting with assignments of other PCGs (except for the one assigned to sf9). Therefore scaffold was assigned to g11, which only contained 1 species (sp11) in the input taxon table. Consequently, was then assigned to g10 (and thus sp10).

S3. Mitogenome annotation

Fig. S2. Annotation for 6 assembled mitogenomes using Geneious

The protein-coding, tRNA and rRNA genes were annotated using Geneious V6.1.2: ORFs were identified first; CDS were confirmed by ORF positions and homology-based searches against the reference mitogenome database downloaded from GenBank; tRNAs were identified by ARWEN webserver with default parameters and tRNAscan1.21 webserver (http://lowelab.ucsc.edu/tRNAscan-SE/) using search mode “EufindtRNA->cove”, source “mito/chloroplast”, genetic code “Invertebrate Mito” and cove score cutoff at 0.1. Finally, A+T-rich regions were determined via sequence alignments with homologous regions of full-length insect mitochondrial genomes.

All mitogenomes assembled in this study are available at NCBI with accession numbers KM244654 - KM244713. Circular mitogenomes of 6 species are presented in Fig. S3.

S4. Correlation of DNA quality and assembly quality