Dataset (Supplemental)

Dataset (Supplemental)

The supplemental dataset contains information for the source of each sequence, the major lineage and subgroup designation based on our phylogenetic analysis, the G-protein binding partner, cell type, tissue expression, and counter-ion location, as well as the associated references for each piece of information. All of the sequences acquired from genome data represent manually curated sequences, which can be found on the UCSC genome browser (http://genomewiki.ucsc.edu/index.php/Opsin_evolution:_update_blog) or in the supplemental alignment. For records without an associated Genbank accession number (i.e. 123 sequences), the source has been listed as ‘Genome’ or ‘EST library’. Because genome assemblies are unstable and change as improvements are made, the accessions used at the time of making the gene models are transient. Therefore, these labels indicate that the gene model was obtained by a BLAST of close homologs to GenBank genomic sequences or ESTs rather than to trace reads or transcripts or extracted from published literature. To find the source for these gene models, BLAST the sequence supplied in our supplemental alignment against the raw genomic and EST data at GenBank for the species in question to find the current assembly, scaffold, contig, trace or transcript number. To validate the model for consistency, compare visually to known orthologs in our compilation of manually curated sequences (http://genomewiki.ucsc.edu/index.php/Opsin_evolution:_update_blog) for exon phase and length matching, perform a simple multiple alignment for consistency with orthologs using Multalin (http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_multalin.html), and examine the gene model for conservation at known conserved sequence motifs. For more details on the methods used, see below.

METHODS (Supplemental)

Sequence Acquisition

Opsin sequences were acquired from Genbank using one of two methods. To include sequences where expression data are available, opsin transcript sequences were mined from Genbank for major taxonomic and known opsin groups using queries of either the nucleotide or the ‘transcriptome shotgun assmebly’ databases. To generate the dataset of genomically-derived opsins, we used conventional transcript-derived entries from GenBank as tBlastn queries against the wgs division of GenBank (where all genome assemblies are stored). Because opsins have a minimum of ~ 20% protein identity to bovine rhodopsin, as determined by the degree of conservation across all GPCRs, and because other opsins in the collection will outperform bovine rhodopsin as a query in specific cases, this method has more than enough sensitivity to detect even short homologous exons in the bovine rhodopsin K296 region. Thus the complete opsin repertoire can be recovered from each species - even new homology classes - provided it is present in the assembly. The recovered sequences using this methodology often need curation in regions of weak alignment against the original data for a particular species and in the case of misassembly stutter. Missing exons were often locatable at the NCBI trace archive in the form of isolated reads that were omitted from the assembly, taking into account the risk of gathering inappropriate exons from unsuspected gene duplications. Recovered opsin sequences were quality checked for the presence of a lysine in the 7th transmembrane helix (bovine rhodopsin site 296), conservation of invariant opsin residues and motifs, and better back-blast to opsins than any annotated non-opsin GPCR at the GenBank nucleotide division. With the exception of species too closely related to one already represented in the dataset, all metazoan genomes at GenBank as of 1 Dec 2010 are represented by the dataset used in this study. No species diverging earlier than the ctenophore (notably neither the sponge Amphimedon queenslandica nor the choanoflagellate Monosiga brevicollis) contains a GPCR with a lysine in a position alignable with the K296 motif of bovine rhodopsin. A curated set of recovered opsins from genome trace files, updated monthly, is available at the UCSC genome browser (http://genomewiki.ucsc.edu/index.php/Opsin_evolution:_update_blog).

For each opsin extracted from a genome project, the location and phase of each intron within the coding region was determined by alignment togenomic contigs. All opsins to date utilize standard GT-AG splice donors and acceptors.We parsimoniously resolved each significant sequence change (i.e. intron change and indels) as a gain or loss by determination of its ancestral status via multiple outgroups. In bilaterans, notably the tunicate Ciona and insects including Drosophila, very rapid turnover of introns (both gain and loss) occurs; however these limitations do not interfere with opsin classification because enough species with conservatively evolving introns are available to reconstruct the evolutionary history.

Phylogenetic Analyses

For phylogenetic analyses, sequences that spanned less than half of the transmembrane regions of the protein were discarded from the analyses, resulting in 889 transcript plus genome trace opsin sequences. In order to root our opsin phylogenetic analyses, 22 non-opsin GPCRs from the human genome were used as outgroups: somatostatin receptor, opioid receptor mu 3, galanin receptor, chemokine (C-C motif) receptor, bradykinin receptor 1, uracil/cys-leukotriene dual receptor, cys-leukotriene receptor 1, purinergic receptor, orexin receptor, tachykinin receptor, neuromedin U receptor, pyroglutamylated RFamide peptide receptor, human orphan receptor 19, pancreatic polypeptide receptor, neuropeptide Y receptor, prolactin releasing hormone, human orphan receptor 161, alpha-1D-adrenergic receptor, thyrotropin-releasing hormone receptor, thyrotropin receptor, adenosine A3 receptor, and opiate receptor-like 1. This set of sequences was selected as outgroups based on previous phylogenetic studies of opsin and GPCR evolution (Davies et al. 2010; Fredriksson et al. 2003; Plachetzki et al. 2010 Porter et al. 2007; Suga et al. 2008) as well as based on a rigorous procedure of BLASTing human opsins against the all other human GPCRs (for description of detailed BLAST procedure see: http://genomewiki.ucsc.edu/index.php/Opsin_evolution#GPCR_outgroup_sequences). Furthermore, rerunning the phylogeny reconstruction without outgroup sequences does not significantly change the sequences within each of the major clades nor the relationships among them, with the exception of three sequences (Platynereis dumerilii TMT1 and TMT2; Stronglocentrus purpuratus encephalopsin) placed at the base of the ‘C-type’ clade when rooted, which are placed at the base of the ‘Cnidops’ clade when unrooted (data not shown).

Amino acid sequences of the 889 opsins mined from Genbank and the 22 human outgroup GPCR sequences were aligned using the online MAFFT v6.0 server (http://mafft.cbrc.jp/alignment/server/) (Katoh et al. 2005a; Katoh et al. 2005b; Katoh et al. 2002). The aligned dataset was then trimmed to remove the N- and C-terminal sequences, leaving only the transmembrane and loop regions of the protein for further phylogenetic analyses. The resulting alignment has been provided as a supplemental FASTA data file.

The aligned and trimmed dataset was used to reconstruct a maximum likelihood phylogeny using Randomized Axelerated Maximum Likelihood (RAxML) v.7.2.7 with rapid bootstrapping as implemented on the Cyberinfrastructure for Phylogenetic Research (CIPRES) Portal v.3.1 (Miller et al. 2010; Stamatakis 2006; Stamatakis et al. 2008; Stamatakis et al. 2005). Using the resulting phylogeny, character mapping of amino acid residues at particular counterion sites was accomplished in Mesquite v2.72 (Maddison & Maddison 2010) using unordered parsimony reconstruction.

References

Davies, W. L., Hankins, M. W., & Foster, R. G. 2010. Vertebrate ancient opsin and melanopsin: divergent irradiance detectors. Photochemical & Photobiological Sciences 9, 1444-1457.

Fredriksson, R., Lagerstrom, M. C., Lundin, L. G. & Schioth, H. B. 2003. The G-protein-coupled receptors in the human genome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints. Molecular Pharmacology 63, 1256-1272.

Katoh, K., Kuma, K., Miyata, T. & Toh, H. 2005a Improvement in the accuracy of multiple sequence alignment program MAFFT. Genome Inform 16, 22-33.

Katoh, K., Kuma, K., Toh, H. & Miyata, T. 2005b MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33, 511-8.

Katoh, K., Misawa, K., Kuma, K. & Miyata, T. 2002 MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30, 3059-66.

Maddison, W. P. & Maddison, D. R. 2010 Mesquite: a modular system for evolutionary analysis. Version 2.72.

Miller, M. A., Holder, M. T., Vos, R., Midford, P. E., Liebowitz, T., Chan, L., Hoover, P. & Warnow, T. The CIPRES Portals. In CIPRES, vol. 2010.

Plachetzki, D. C., Caitlin, R., & Oakley, T. H. 2010. The evolution of phototransduction from an ancestral cyclic nucleotide gated pathway. Proceedings of the Royal Society B 277, 1963-1969.

Porter, M. L., Cronin, T. W., McClellan, D. A., & Crandall, K. A. 2007. Molecular characterization of crustacean visual pigments and the evolution of pancrustacean opsins. Molecular Biology and Evolution 24, 253-268.

Stamatakis, A. 2006 RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688-90.

Stamatakis, A., Hoover, P. & Rougemont, J. 2008 A rapid bootstrap algorithm for the RAxML Web servers. Syst Biol 57, 758-71.

Stamatakis, A., Ludwig, T. & Meier, H. 2005 RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21, 456-63.

Suga, H., Schmid, V., & Gehring, W. J. 2008. Evolution and functional diversity of jellyfish opsins. Current Biology 18, 51-55.