Supplementary Appendix 2. Spike-in experiment with Salmonellatyphimurium

V3, V4 and V5 regions of 16S rRNA are commonly used for bacteria sequencing. However, different regions contain different levels of variability and may have different power in detecting bacteria species. The choice of the PCR primer may also play a role. Therefore, we designed a spike-in experiment to access the performance and classification accuracy of the three regions. The PCR primers developed for these regions were:

V3-341Forward: 5’ ACACTCTTTCCCTACACGACGCTCTTCCGATCTCCTACGGGAGGCAGCAG-3′

V3-518Reverse: 5’ GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTATTACCGCGGCTGCTGG-3′

V4-Forward: 5’ ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGCCAGCMGCCGCGGTAA-3′

V4-Reverse: 5’ GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGGACTACHVGGGTWTCTAAT-3′

V5-Forward: 5’ ACACTCTTTCCCTACACGACGCTCTTCCGATCTGATTAGATACCCTGGTAG-3′

V5-Reverse: 5’ GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCCGTCAATTCMTTTGAGTTT-3′

The Salmonella typhimurium species we used is a laboratory strain S.typhi700720 (Cat#700720D-5, ATCC). It is used in Greengene database as one of the bacteria to define Samonella enterica with OTU_3620. DNA of Salmonella typhimurium was mixed with DNA extract from the fecal sample denated by a healthy donor at 1%, 5%, 25% and 75%. A closed reference OTU picking protocol is used to search Greengene database with 97% similarity. We tested two clustering algorithms UCLUST and USEARCH. Illumina HiSeq 2000 was used to generate paired end reads of 100bp. For the 4 samples for each region, the average number of paired reads was 4,192,154 (SD 362,263.1), 2,825,426 (SD 691,593.1) and 5,593,790 (SD 979,528.3) for V3, V4, and V5 regions, respectively. For V3 and V5 regions, 91.3% and 95.0% paired end reads could be joint by fasta-join, whereas only 0.13% of paired reads from V4 region could be jointed. Thereore, for subsequent analysis, we used joint reads from V3 and V5 regions and forward reads from the V4 region. The same QIIME quality filter and Chimera filter were applied to the reads as described in “Fecal DNA isolation and multiplex sequencing”in the Methods section.

UCLUST Results.

For V3 region, we were able to identify S.typhi at family level (Enterobacteriaceae) but were unable to label it with a taxon at the genus and species level. At the species level, it was estimated with unspecified OTU at 0.23%, 5.56%, 21.55% and 66.54% at the true spike-in level of 1%, 5%, 25%, and 75% respectively.

For V4 region, UCLUST correctly identified S.typhi at the family level but assigned it to the wrong genus Erwinia with an unspecified species. The relative abundance was estimated to be 1.1%, 5.1%, 20.9% and 65.7% at the true spike-in level of 1%, 5%, 25%, and 75% respectively.

For V5 region, we correctly identified S.typhi at the family level with Enterbacteriaceae but assigned it to the wrong genus (Erwinia instead of Salmonella). At the species level, it was estimated with an unspecified species taxon at 0.67%, 4.09%, 17.8 and 64.1% at the true spike-in level of 1%, 5%, 25%, and 75% respectively.

USEARCH Results

For V3 region, USEARCH identified S.typhicorrectly at the family level. At the genus level, instead of assigningS.typhito the correct genus, it placed S.typhi into two different categories--Erwinia and Enterobacter. For the 1%, 5%, 25% and 75% spike-in samples, the genus estimate wasEnterobacter (1.0%) + Erwinia (0.2%), Enterobacter (4.6%) + Erwinia (0.8%), Enterobacter (17.8%) + Erwinia (3.2%) and Enterobacter (54.7%) + Erwinia (10.1%). At the species level, the estimated species corresponding to the spike-in sample was g_enterobacter;s_amnigenus (1.0%) + g_Erwinia; s_(0.2%), g_enterobacter;s_amnigenus (4.6%) + g_Erwinia; s_(0.8%), g_enterobacter;s_amnigenus (17.6%%) + g_Erwinia; s_(3.2%) and g_enterobacter;s_amnigenus (54.3%) + g_Erwinia; s_(10.1%), respectively.

For V4 region, S.typhi was correctly identified at the family level but assigned to the wrong genus Klebsiella with an unspecified species. The estimated abundance was 1.1%, 5.2%, 21.3% and 67.3% at the true spike-in level of 1%, 5%, 25%, and 75% respectively.

For V5 region, USEARCH assigned S.typhito the correct genus (Salmonella) and species (Salmonella enterica). At the species level, the estimated abundance was 0.9%, 4.2%, 18.4% and 66.4% for the true spike-in levels of 1%, 5%, 25% and 75% respectively.

Given above results, we decided to use V5 region. The primer for V5 was able to produce more sequence reads under the same experiment condition. Most importantly, the combination of using V5 region and USEARCH algorithm was the only approach that correctly identified S.typhiat genus and the species level. The accuracy for estimating species level abundance was very close to the true level of spike-in with 0.9%, 4.2%, 18.4% and 66.4% for the true spike-in levels of 1%, 5%, 25% and 75%, respectively.