Bioinformatics Processing Pipeline for pmoA Amplicons Sequencing Data
----based on protein sequences level
Basic Information
Author: Li Chaonan
E-mail:
Organization: Li XZ’s Lab, Chengdu Institute of Biology, Chinese Academy of Sciences
Notes
This pipeline was established based on existing bioinformatics tools and could be free available at: We integrated these functions with some scripts we implemented for data processing of deduced amino acid sequences. Please note that freeware not means no limitations. All codes copyright original author. Users are free to use, copy, modify or spread for academic necessary, but not for any commercial use without permission. Please contact Li, Chaonan or Li, Jiabao if there are any questions, mail: or
Running Environmental Requirements
Except some existing tools, we added some scripts to generate OTU table for downstream statistical analyses in this processing pipeline. These Python scripts were implemented using Python (version 2.7), Bio-Python packages and Linux shell. Therefore, please ensure your WorkStation meet these demands as following before use it to polish sequencing data.
(1)The Operate System must be Linux/UNIX (we have tested on Bio-Linux OS)
(2)Python interpreter (Version 2.7) and compatible Bio-Python package are necessary
(3)QIIME pipeline should be installed before use
(4)Other bioinformatics tools including FLASH, FastX, Usearch8, UCLUST, BLAST and Biom also are needed
(5)The command lines colored with red are the scripts we implemented; The detail usage of these scripts can be found though type “python filename.py -h” in terminal.
Database Construction
To align and annotation deduced amino acid sequence, we firstly constructed two databases: amino acid alignment database and species assignment database. These two databases referred to databases processing pmoA DNA sequences proposed by Marc, G.D. et al (2013)[1]. The detail steps we provided as following:
Step 1: Downloading amino acid sequences from protein database of GenBank
python amino_seq_download.py -i pmoa_dna_seq.fasta -d prot_gi_nucl_ac.txt -o prot_seqs.fasta
Step 2: Species assignment database construction
python species_assignment_db_creation.py -i pmoA_tax.txt -d prot_gi_nucl_ac.txt -o pmoA_assignment_db.txt
Step 3: Amino acid sequence alignment database construction
makeblastdb -in prot_seqs.fasta -dbtype prot -parse_seqids -out pmoA_balst_db
Sequences Processing
Step 1: Assemble pair-end reads using FLASH tool
/home/server/soft/FLASH/flash -m 30 -M 250 -x 0.25 -r 240 -f 400 -s 50 R1.fastq R2.fastq
Step 2: Split sequence data from library
cat out.extendedFrags.fastq | /home/server/soft/fastx_toolkit/scripts/fastx_barcode_splitter.pl --bcfile barcode.txt --mismatches 0 --prefix ./r1_ --suffix ".fastq" –-bol
Step 3: Generate reverse compliment sequences of unmatched reads from step 2
fastx_reverse_complement -i r1_unmatched.fastq -o reverse.fastq
Step 4: Split reverse sequence library generated by step 3
cat reverse.fastq | /home/server/soft/fastx_toolkit/scripts/fastx_barcode_splitter.pl --bcfile barcode.txt --prefix ./r2_ --suffix ".fastq" –-bol
Step 5: Combine all sequence samples generated by step 2 and step 3
cat r1_1. fastq r2_1. fastq > combination/1.fastq
…
cat *.fastq > m.fastq
…
Step 6: Quality control of all raw sequences
convert_fastaqual_fastq.py -c fastq_to_fastaqual -f m.fastq
split_libraries.py -m map.txt -f m.fna -q m.qual -s 20 -o m.lib -a 6 -r -l 350 -b 12 -e 0
Step 7: Chimeras sequences check (Usearch 8)
Firstly, we should split the file saving sequence of all sampleds produced in step 6 into single file. All these single file must be fasta format (end with “.fasta”) and be placed into a new folder in which any file except these fasta files should not appear.
split_sequence_file_on_sample_ids.py -i seqs.fna -o fa/
python chimeras_check.py –p /home/qual_data/single_file/
cat *.fasta > nonchimeras.fasta
Step 8: Frame shift check
After chimeras check, users should combine all samples file for frame shift checking. Before perform frame shift checking, FrameBot should be installed. The detail of installing and usage can be found at
mkdir frambot
java -jar ~/RDPTools/FrameBot.jar framebot -N -o ./frambot/licn_pmoA/home/student/RDPTools/Framebot/refset/pmoA_prot_ref.fasta ./nonchimeras.fasta
Step 9: Filter samples with low sequence counts
python filter_sample.py -i ./nonchimeras/ -o ./nonchimeras/cleaned_sample/ -n 2000
Step 10: OTU clustering using UCLUST algorithm
uclust --sort cleaned_readsresamples.fasta --output seqs_sorted.fasta
uclust --input seqs_sorted.fasta --uc reads_resample_clust_result.uc --id 0.93
uclust --uc2clstr reads_resample_clust_result.uc --output reads_resample_clust_result.clstr
Step 11: Pick OTUs
python pick_otu.py -i reads_resample_clust_result.clstr -o readsresample_prot_otu.txt
Step 12: Pick represent sequences
python pick_rep_seq.py -i readsresample_prot_otu.txt -o readsresample_prot_otu_rep_seq.fasta -f all_seqs_derep_prot_corr.fasta
Step 13: Amino acid alignment using BLAST
Please note that the parameter of ‘outfmt’ in blastp must be set as 7 because the current version of script we provided only can be used to parse this format.
blastp -query /home/student/converted_data/m.lib/fa/nonchimeras/resample/readsresampled_rep_prot.fasta -db pmoa -evalue 1e-10 -num_threads 10 -max_target_seqs 5 -outfmt 7 -out /home/student/converted_data/m.lib/fa/nonchimeras/resample/readsresampled_rep_prot_blastp.xml
Step 14: Parse result of BLAST
python read_blast.py -i readsresampled_rep_prot_blastp.xml -o readsresampled_rep_prot_blastp_result.txt
Step 15: Species assignment
python assign_taxonomy_prot_otu.py -d pmoa_aa_taxonomy.txt -b readsresampled_rep_prot_blastp_result_82.txt -s 82 -o readsresampled_rep_prot_blastp_result_assign_82.txt
Step 16: Make OTU table
python make_otu_table.py -i readsresample_prot_otu.txt -o readsresample_prot_otu_table.txt -a readsresampled_rep_prot_blastp_result_assign.txt -n 2
Step 17: Alpha diversity calculation
biom convert -i readsresample_prot_otu_table.txt -o readsresample_prot_otu_table.biom --table-type="OTU table" --process-obs-metadata taxonomy --to-hdf5
alpha_diversity.py -i readsresample_prot_otu_table.biom -m chao1,observed_otus,shannon -o div_alpha.txt
References
1.Marc, G.D., et al., Classification of pmoA amplicon pyrosequences using BLAST and the lowest common ancestor method in MEGAN. Frontiers in Microbiology, 2013. 5(34): p. 1-11.