Bioinformatics Processing Pipeline for Pmoa Amplicons Sequencing Data

Bioinformatics Processing Pipeline for pmoA Amplicons Sequencing Data

----based on protein sequences level

Basic Information

Author: Li Chaonan

E-mail:

Organization: Li XZ’s Lab, Chengdu Institute of Biology, Chinese Academy of Sciences

Notes

This pipeline was established based on existing bioinformatics tools and could be free available at: We integrated these functions with some scripts we implemented for data processing of deduced amino acid sequences. Please note that freeware not means no limitations. All codes copyright original author. Users are free to use, copy, modify or spread for academic necessary, but not for any commercial use without permission. Please contact Li, Chaonan or Li, Jiabao if there are any questions, mail: or

Running Environmental Requirements

Except some existing tools, we added some scripts to generate OTU table for downstream statistical analyses in this processing pipeline. These Python scripts were implemented using Python (version 2.7), Bio-Python packages and Linux shell. Therefore, please ensure your WorkStation meet these demands as following before use it to polish sequencing data.

(1)The Operate System must be Linux/UNIX (we have tested on Bio-Linux OS)

(2)Python interpreter (Version 2.7) and compatible Bio-Python package are necessary

(3)QIIME pipeline should be installed before use

(4)Other bioinformatics tools including FLASH, FastX, Usearch8, UCLUST, BLAST and Biom also are needed

(5)The command lines colored with red are the scripts we implemented; The detail usage of these scripts can be found though type “python filename.py -h” in terminal.

Database Construction

To align and annotation deduced amino acid sequence, we firstly constructed two databases: amino acid alignment database and species assignment database. These two databases referred to databases processing pmoA DNA sequences proposed by Marc, G.D. et al (2013)[1]. The detail steps we provided as following:

Step 1: Downloading amino acid sequences from protein database of GenBank

python amino_seq_download.py -i pmoa_dna_seq.fasta -d prot_gi_nucl_ac.txt -o prot_seqs.fasta

Step 2: Species assignment database construction

python species_assignment_db_creation.py -i pmoA_tax.txt -d prot_gi_nucl_ac.txt -o pmoA_assignment_db.txt

Step 3: Amino acid sequence alignment database construction

makeblastdb -in prot_seqs.fasta -dbtype prot -parse_seqids -out pmoA_balst_db

Sequences Processing

Step 1: Assemble pair-end reads using FLASH tool

/home/server/soft/FLASH/flash -m 30 -M 250 -x 0.25 -r 240 -f 400 -s 50 R1.fastq R2.fastq

Step 2: Split sequence data from library

cat out.extendedFrags.fastq | /home/server/soft/fastx_toolkit/scripts/fastx_barcode_splitter.pl --bcfile barcode.txt --mismatches 0 --prefix ./r1_ --suffix ".fastq" –-bol

Step 3: Generate reverse compliment sequences of unmatched reads from step 2

fastx_reverse_complement -i r1_unmatched.fastq -o reverse.fastq

Step 4: Split reverse sequence library generated by step 3

cat reverse.fastq | /home/server/soft/fastx_toolkit/scripts/fastx_barcode_splitter.pl --bcfile barcode.txt --prefix ./r2_ --suffix ".fastq" –-bol

Step 5: Combine all sequence samples generated by step 2 and step 3

cat r1_1. fastq r2_1. fastq > combination/1.fastq

…

cat *.fastq > m.fastq

…

Step 6: Quality control of all raw sequences

convert_fastaqual_fastq.py -c fastq_to_fastaqual -f m.fastq

split_libraries.py -m map.txt -f m.fna -q m.qual -s 20 -o m.lib -a 6 -r -l 350 -b 12 -e 0

Step 7: Chimeras sequences check (Usearch 8)

Firstly, we should split the file saving sequence of all sampleds produced in step 6 into single file. All these single file must be fasta format (end with “.fasta”) and be placed into a new folder in which any file except these fasta files should not appear.

split_sequence_file_on_sample_ids.py -i seqs.fna -o fa/

python chimeras_check.py –p /home/qual_data/single_file/

cat *.fasta > nonchimeras.fasta

Step 8: Frame shift check

After chimeras check, users should combine all samples file for frame shift checking. Before perform frame shift checking, FrameBot should be installed. The detail of installing and usage can be found at

mkdir frambot

java -jar ~/RDPTools/FrameBot.jar framebot -N -o ./frambot/licn_pmoA/home/student/RDPTools/Framebot/refset/pmoA_prot_ref.fasta ./nonchimeras.fasta

Step 9: Filter samples with low sequence counts

python filter_sample.py -i ./nonchimeras/ -o ./nonchimeras/cleaned_sample/ -n 2000

Step 10: OTU clustering using UCLUST algorithm

uclust --sort cleaned_readsresamples.fasta --output seqs_sorted.fasta

uclust --input seqs_sorted.fasta --uc reads_resample_clust_result.uc --id 0.93

uclust --uc2clstr reads_resample_clust_result.uc --output reads_resample_clust_result.clstr

Step 11: Pick OTUs

python pick_otu.py -i reads_resample_clust_result.clstr -o readsresample_prot_otu.txt

Step 12: Pick represent sequences

python pick_rep_seq.py -i readsresample_prot_otu.txt -o readsresample_prot_otu_rep_seq.fasta -f all_seqs_derep_prot_corr.fasta

Step 13: Amino acid alignment using BLAST

Please note that the parameter of ‘outfmt’ in blastp must be set as 7 because the current version of script we provided only can be used to parse this format.

blastp -query /home/student/converted_data/m.lib/fa/nonchimeras/resample/readsresampled_rep_prot.fasta -db pmoa -evalue 1e-10 -num_threads 10 -max_target_seqs 5 -outfmt 7 -out /home/student/converted_data/m.lib/fa/nonchimeras/resample/readsresampled_rep_prot_blastp.xml

Step 14: Parse result of BLAST

python read_blast.py -i readsresampled_rep_prot_blastp.xml -o readsresampled_rep_prot_blastp_result.txt

Step 15: Species assignment

python assign_taxonomy_prot_otu.py -d pmoa_aa_taxonomy.txt -b readsresampled_rep_prot_blastp_result_82.txt -s 82 -o readsresampled_rep_prot_blastp_result_assign_82.txt

Step 16: Make OTU table

python make_otu_table.py -i readsresample_prot_otu.txt -o readsresample_prot_otu_table.txt -a readsresampled_rep_prot_blastp_result_assign.txt -n 2

Step 17: Alpha diversity calculation

biom convert -i readsresample_prot_otu_table.txt -o readsresample_prot_otu_table.biom --table-type="OTU table" --process-obs-metadata taxonomy --to-hdf5

alpha_diversity.py -i readsresample_prot_otu_table.biom -m chao1,observed_otus,shannon -o div_alpha.txt

References

1.Marc, G.D., et al., Classification of pmoA amplicon pyrosequences using BLAST and the lowest common ancestor method in MEGAN. Frontiers in Microbiology, 2013. 5(34): p. 1-11.