Supplementary Methods and Materials

SUPPLEMENTARY METHODS AND MATERIALS

Comparative genomics reveals adaptations of a halotolerant thaumarchaeon in the interfaces of brine pools in the Red Sea

David Kamanda Ngugi1†, Jochen Blom2, Intikhab Alam3, Mamoon Rashid1, Wail Ba-Alawi3, Guishan Zhang1, Tyas Hikmawan1, Yue Guan1, Andre Antunes1*, Rania Siam4, Hamza El Dorry4, Vladimir Bajic3, and Ulrich Stingl1†

1Red Sea Research Centre, King Abdullah University of Science and Technology (KAUST), 4700 KAUST, Thuwal 23955-6900, Saudi Arabia; 2Bioinformatics and Systems biology, Justus Liebig University Giessen, Germany; 3Computational Bioscience Research Centre, KAUST; 4Department of Biology, American University in Cairo, Egypt. *Current address: Institute for Biotechnology and Bioengineering, Centre of Biological Engineering, University of Minho, Portugal.

Sample collection, DNA extraction, and chemical analyses

Brine-seawater interface (BSI) samples from five brine pools of the Red Sea (Figure S1) were collected during the 3rd WHOI-KAUST Red Sea sampling expedition on board the R/V Aegaeo in Oct./Nov. 2010. A rosette water sampler mounted with ten-litre Niskin bottles for sample collection and a SIMRAD EK50 echo sounder system for depth monitoring was deployed at each sampling location. Simultaneously, a conductivity-temperature-depth (CTD) unit (Sea-Bird Electronic, Bellevue, Washington) was also deployed to measure temperature, salinity, oxygen, and pressure; salinity was also evaluated using a hand-held refractometer (Atago, Tokyo, Japan). At each site, 180 litres of the BSI or brine sample were first pre-filtered through a 5-mm SMWP membrane (Ø 290 mm, Millipore, Ireland) to remove suspended particles, and then filtered and concentrated using Tangential Flow Filtration (TFF) system with a 0.1-mm Durapore PVDF filter (Pellican 2 Cassette Filter, size 0.5 m2, Millipore, MA, USA). The filters were separately placed inside sterile polythene bags, flash frozen in liquid nitrogen, and then stored on dry ice on board the ship. Subsequently, DNA was extracted from the filters using the protocol described by (Ngugi et al., 2012). In order to place the archaeal community of the BSI into perspective, we additionally analysed the thaumarchaeal community of samples in the overlying deep-seawater (200–1,500 m) of Atlantis II Deep (21°20.76’N, 38°04.68’ E; Oct 2008) based on clone libraries using our previously extracted DNA (Ngugi & Stingl 2012). Nutrient concentrations and elemental composition of samples from the BSI and the brine were determined using the commercial service provided by GEOMAR Helmholtz Centre for Ocean Research (Kiel, Germany; Table 1).

Pyrosequencing of 16S rRNA gene amplicons

The V6–V9 region of the archaeal 16S rRNA encoding genes were PCR amplified using the Archaea-specific forward primer, A934F (Teske & Sorensen 2008), and the universal reverse primer, U1492R (Lane 1991). For bacterial 16S amplicons (V3–V6 region), the forward (B343F) and reverse (B1099rc) primers were used (Liu et al., 2007). In both cases, the primers consisted of the FLX Titanium adaptors (Lib-L), and a domain-specific primer fused with a two-nucleotide linker (to reduce unspecific binding) and a sample-specific 8-bp long barcode for multiplexing. HPLC-purified primers were obtained from Eurofins MWG Operon (Germany).

PCR was performed in 50-ml reactions that contained (final concentrations): 25 ml PfuUltra II Hotstart PCR master mix (Strategene, La Jolla, CA, USA), 0.2 mM of each primer (Eurofins MWG, Ebersberg, Germany), and 10 ng of template DNA. The PCRs (in triplicate) were run with the following conditions: initial denaturation at 94°C for 2 min; followed by 30 cycles at 94°C for 20 s, 54°C (Archaea) or 48°C (Bacteria) for 20 s, 72°C for 2 min; and a final extension at 72°C for 10 min. The barcoded 16S rRNA gene amplicons (~600 and 850 bp for Archaea and Bacteria, respectively) from the different samples were pooled together in equimolar amounts after purification with MiniElute Spin Columns (Qiagen, Hilden, Germany) and quantification with Quanti-iT PicoGreen (Molecular Probes Invitrogen). The pooled amplicons library was then sequenced on a 454 GS FLX+ Sequencer (454 Life Sciences, Branford, CT, USA) with one-fourth of a pico-titer plate each at our Genomics Core Lab facility in KAUST (Thuwal, Saudi Arabia) as per the manufacture’s instructions. Pre-processing, OTUs clustering (3% distance threshold), and taxonomic assignment of sequence data was done using the MOTHUR software package (v1.29; Schloss et al., 2009) as previously described (Ngugi et al., 2012). The total number of post-processed reads were for the Bacteria (405 ± 105 bp): 31,809 (Atlantis II Deep), 35, 825 (Discovery Deep), 3,592 (Erba Deep), 101,477 (Kebrit Deep), and 19,311 (Nereus Deep); and for Archaea (470 ± 58 bp): 38,573 (Atlantis II Deep), 18,466 (Discovery Deep), 27,050 (Erba Deep), 30,941 (Kebrit Deep), and 20,115 (Nereus Deep).

Clone libraries and phylogenetic analysis

Nearly full-length 16S rRNA genes were PCR amplified using an Archaea-specific forward primer, Arch21F, (Delong 1992), and the universal reverse primer, U1492R (Lane 1991). Each PCR reaction mixture contained: (final concentrations) 1´ PCR buffer (New England Biolabs), dNTPs (100 mM), primers (1 mM), Taq DNA polymerase (2.5 U; Invitrogen), and 10 ng of template DNA. The PCR conditions were same as for pyrotag sequencing except that the annealing temperature was set at 52°C. The archaeal ammonia mono-oxygenase (amoA) gene and key CO2-fixation gene (4-hbd) encoding for the 4-hydroxybutyryl-CoA dehydratase were amplified respectively using primers AmoA-ModF/R and 4HBD 312F/4HBD 1360R as described previously (Yakimov et al., 2011). PCR amplicons were purified with the MiniElute Spin Columns (Qiagen, Hilden, Germany), and cloned into the pCR®2.1 vector (Invitrogen) as per the manufacturers protocol. Colonies with the correct insert size were then bi-directionally sequenced with M13 primers on an ABI 3730´l DNA Analyzer in our Genomics Core Lab Facility at KAUST, and assembled using Geneious Pro software (v5.6.6; http://www.geneious.com/).

Prior to phylogenetic analysis, all putatively chimeric sequences were checked and removed using Bellerophon in Greengenes (http://greengenes.lbl.gov/), ChimeraSlayer (http://microbiomeutil.sourceforge.net/), and Blast-based searches. The total number of post-processed 16S rRNA gene sequences (per sample) were: 137 (200 m), 38 (700 m), 68 (1,500 m), 143 (Atlantis II Deep), 134 (Erba Deep), and 124 (Kebrit Deep). Representative sequence clusters at 97% identity level were then selected using MOTHUR prior to phylogenetic inference. These selected representative sequences were then aligned using the SINA web alignment tool (http://www.arb-silva.de/) based on a non-redundant 16S reference database (SILVA ver.111), and imported into the ARB software package (Ludwig et al., 2004). After manual curation of the alignments, a distance-based phylogenetic tree (Figure 1B) was constructed using the neighbor-joining algorithm and the Jukes Cantor distance correction model (1000 bootstrap replicates) in PAUP (v4.0b1; Swofford 2003). Maximum parsimony was also used to test the stability of tree topology.

For the functional genes (amoA and 4-hbd; Figure S3), nucleotide sequences were first translated with the Transec tool (http://www.ebi.ac.uk/), and aligned with their closest blast hits using ClustalW, as implemented in Geneious Pro software (v5.6.6; http://www.geneious.com/). The aligned amino acid sequences were then used for phylogenetic inference in PAUP as described above.

Single-cell sorting, amplification, genome sequencing, and assembly

BSI samples that had been collected from Atlantis II Deep in the Red Sea (Figure S1), were sent for single-cell sorting and whole-genome amplification (Swan et al. 2011; Stepanauskas 2012) at the Single Cell Genomics Center, Bigelow Laboratory for Ocean Sciences (Maine, USA). Cell sorting, whole genome amplification (WGA), PCR-based screening of 16S rRNA positive cells and QC were performed using protocols described on their website (www.bigelow.org/scgc; Rinke et a., 2014). The genomic DNA following the multiple displacement amplification (MDA)-step was then diluted 50-fold in TE buffer for PCR screening of candidate amoA-carrying cells from a total of 80 cells. Seven amoA-positive MDA libraries were then sequenced using the Illumina GAIIx platform (San Diego, CA, USA) at KAUST.

Assembly of the single-cell amplified genomes (SAGs) was generated using a pipeline that employs a choice of assemblers designed for single-cell sequencing data, including VelvetSC (Chitsaz et al., 2011), Spades (Bankevich et al., 2012), and IDBA-UD v.1.1.1 (Peng et al., 2012), along with several pre and post assembly data quality checks using Trimmomatic (Lohse et al., 2012). Here, IDBA-UD was benchmarked as the overall best assembler for our SAGs as we could reconstruct longer contigs with higher accuracy to the reference genome, Nitrosopumilus maritimus SCM1 (Walker et al., 2010). We discarded assembled contigs that were less than 1 kb in length, and retained only those that gave consistent GC (± 5% of target genome) and phylogenetic profile based on Tetranucleotide frequency binning using Metawatt Binner (v1.7; Strous et al., 2012), and for which more than half of their protein-coding genes were best blast hits to published thaumarchaeal genomes. The selected contigs were then ordered using the ProgressiveMAUVE software (Darling et al. 2010) based on the genome of N. maritimus.

Gene prediction and annotation

Functional annotation of coding regions from the ordered contigs proceeded by using an in-house annotation pipeline (www.cbrc.kaust.edu.sa/indigo; Alam et al., 2013), which carries out a series of steps including: (1) data quality control with fastaclean (Exonerate package, (Slater & Birney 2005); (2) RNA prediction using RNAmmer (Lagesen et al., 2007) and tRNAscanSE (Schattner et al. 2005); (3) ORF prediction using GeneMark (Besemer et al. 2001) and Prodigal (Hyatt et al. 2010); (4) Blastp searches against the NCBI non-redundant (nr) database (Altschul et al., 1997), and Uniprot and Reverse Position Specific hmmpfam-based searches (Marchler-Bauer et al., 2011) against Conserved Domain Databases (Pfam, SMART, COG, PRK, TIGRFam). Interproscan analysis was also carried out for GO terms and signature domains and combined with KEGG pathway mappings. Putative transporter proteins were deduced using the Transport Classification Database (TCDB; Saier et al., 2009) and also via TransAAP, a web-based transporter annotation tool (http://www.membranetransport.org/).

Genome-level identity and synteny estimations

The relatedness between our SAGs and published thaumarchaeal genomes was evaluated based on the average nucleotide identity of the assembled nucleotides (Teeling et al., 2004) using the program JSpecies (Richter & Rosselló-Móra 2009). The expected genome size was expolated from the frequency counts of expected highly conserved single-copy genes found in all eight genomes of Marine Group I.1a thaumarchaea (801 in total; Table S3). To complement the average nucleotide identity analysis results and to aid in selecting SAGs genomes for a combined assembly, we also evaluated the overall degree of gene order conservation (synteny) among pairs of SAGs based on homologous protein-coding genes (Table S4), as described by Yelton et al. (2011).

Comparative genome analyses

To assess the degree of homology in the proteomes of all Nitrosopumilus genomes, we used the phylogenomic platform EDGAR (Blom et al., 2009). EDGAR defines orthologous proteins using bidirectional best blast hits based on Blast score ratio values (in our case >37) to identify conserved and unique gene sets among pairs of genomes. For this analysis the complete genome of N. maritimus SCM1 (Walker et al., 2010) was used as the reference. We refer to sets of genes of the reference genome for which an orthologous gene could be identified in every other genome as “the core genome”. In contrast, genes of the reference strain with no ortholog to any of the other genome are referred to here as “strain-specific” genes, while genes present in some genomes but not all as the “flexible genome”. Moreover, the pan-genome of a set of genomes consists here of both the core and the variable genomes (Tettelin et al., 2005). Genes in the core or variable genomes were further characterized by functional assignment to categories in the Cluster of Orthologous Genes (COGs; Tatusov et al., 1997) database using blast (e-value >10e–5). Unless otherwise mentioned, the gene locus IDs of the reference genome (N. maritimus) will throughout be used to provide gene context.

The evolutionary relationship of our SAGs or the co-assembled RSA3 genome (see main text for details) with other thaumarchaea in the genera Nitrosopumilus (Walker et al., 2010; Park et al., 2012a; 2012b), “Nitrosoarchaeum” (Kim et al., 2011; Mosier et al., 2012), “Cenarchaeum” (Hallam et al. 2006a; 2006b), and “Nitrososphaera” (Spang et al., 2012) was further tested using single-copy conserved gene set of 131 (for analysis of all SMGI clade SAGs; Figure S4) or 301 (when only the co-assembled genome was considered; Figure 3A) respectively, that were orthologous in their genomes. Orthologous genes were generated using an automated pipeline, Hal with the inflation parameter set at 1.5 for the clustering algorithm (Robbertse et al., 2011). The resultant core orthologous genes from individual genomes were then concatenated, aligned using MUSCLE (Edgar 2004), followed by phylogenetic inference with RAxML (v7.04; Stamatakis et al., 2008), based on the amino acid substitution model PROTGAMMAI + WAG with 1000 runs for bootstrap support.

Multiple protein sequence alignment of proline dehydrogenases (PutA and PRODH) from several characterized organisms and sequenced genomes were obtained from NCBI, aligned with CLUSTW (v2.0.12; Thompson et al., 1994), and used for phylogenetic inference with RaxML as described above (Figure S6). The stability of the tree topology was also tested using neighbor-joining (Saitou & Nei 1987) and maximum parsimony (PAUP v4.0b1; Swofford 2003) approaches.

Determination of isoelectric point of the predicted proteomes

Extremely halophilic prokaryotes are characterized by an acidic proteome since their intracellular proteins are adapted to function at high salinity, which is a consequence of accumulating molar concentrations of KCl to balance the externally high salt concentration (Oren 2011; and references therein). In contrast, halotolerant microorganisms keep their intracellular ionic concentrations at low levels, and instead use organic solutes to produce the necessary osmotic balance (Oren 2011). An acidic shift of the proteome can therefore be used a proxy for evaluating the osmotic strategy employed by an organism. In the absence of proteomic data, the isoelectric point (pI) of protein-coding genes provides a good indicator of the acidic nature of the predicted proteome. Therefore, we compared the pI of predicted proteome of the composite Nitrosopumilus-like RSA3 genome (Figure 4) with those from extreme halophiles (e.g., Halobacteria salinarum and Salinibacter ruber), moderate halophiles (e.g., Chromohalobacter salexigens), and typical marine bacteria (e.g., SAR11). To make these analyses robust we calculated the pI of the bulk predicted proteome and also the fractionated proteome that include (1) all protein-coding genes with no trans-membrane helices (TMH), (2) protein-coding genes with only one TMH, and (3) those with 2 or more TMHs. The logic was that proteins with TMH interact with the extracellular milieu and are intimately responsible for regulating cell communication and adhesion, and allow controlled exchange of chemicals across the cell membrane. A special category (those with one TMH) includes mostly proteins with signal peptides, many of which have a single N-terminal TM segment and are usually secreted proteins.