2010 PCB 5530 Class Projects

2010 PCB 5530 Class Projects

● Background

These projects take cutting-edge genome research into the classroom and so ‘bring to life’ what you are learning. They will train you to integrate different kinds of information in a team setting.

The class is split into three groups, each coordinated by a postdoctoral from my group. Groups are:

Group 1 - Niacin / Group 2 - Thiamine / Group 3 – Met salvage
Jeffrey Waller
/ Linda Jeanguenin
/ Océane Frelin

Kevin Cooper / Yezhang Ding / Joseph Collins
Amrit Ghimire / Elton Goncalves / Cintia Leite Ribeiro
Yuan Yu / Amanda Pendleton / Yih-Feng Hsieh
Yang Hu

Each group will carry out the following tasks:

- Annotate a particular pathway or set of pathways of vitamin- or cofactor-related metabolism in the maize genome using a standard format.

- Identify genes that are missing from the pathways (‘functions without a gene’) and genes of unknown function that are in some way associated with the pathways (‘genes without a function’).

- Predict candidates for missing genes and predict functions for genes of unknown function using the tools of plant-prokaryote comparative genomics

Each group will meet several times, under the supervision of the postdoctoral instructor, to divide up the work to be done, to discuss progress, and to integrate and write up a report in the format below.

Project reports. Reports should be submitted to Dr. Andrew Hanson as an electronic file and a good quality hard copy, by 5 pm, Friday, November 19, 2010.

Grading. For each student, 50% of the grade will be based on the performance of their group as a whole as judged from the project report. The other 50% will be based on the postdoctoral instructor’s assessment of that student’s contribution to the group effort (independent of the group size).

Outcomes. It is anticipated that, in the best cases, the groups’ predictions will form part of a publication in a peer-reviewed journal, in which case the group members will be included as authors.

● Report format – summary Reports should be arranged in four sections:

1. A diagram summarizing the metabolites and enzymes of the pathways, pathway variants in different organisms, subcellular compartmentation in Arabidopsis, and maize genes that are missing or have mysterious paralogs. Use the format on p. 2 (a PowerPoint file is available as a template).

2. A table listing:

- All pathway enzymes (give EC nos.) and transporters (whether or not they have plant homologs)

- Gramene identifiers (e.g. GRMZM2G107665) of the corresponding maize genes

- AGI numbers (e.g. At3g12930) of the corresponding Arabidopsis genes

- Predicted and experimentally determined subcellular location of the Arabidopsis proteins

- Arabidopsis mutant phenotypes (if available), e.g. lethality, growth defect, metabolome change

- When there is >1 maize or Arabidopsis gene, show a phylogenetic tree relating them

3. A figure showing expression in various organs (the ‘development’ display) from the Golm Arabidopsis Expression dbase Multiple Expression Query for each Arabidopsis gene in the pathways.

4. Summaries and supporting evidence for two predictions of candidates for maize genes ‘missing’ from the pathways, or for roles of ‘unknown function’ genes associated with the pathways (no more than 1 page total per gene).

● Report format – example

1. Pathway diagram. This is a simplified version showing only reactions in the cytosol. A full diagram (required for your report) would have three parts, each similar to the one shown, repressenting the reactions found in cytosol, mitochondria, and plastids. See also Fig. 2, PMID: 10785666.

2. Summary table & phylogenetic trees

Gene name / Enzyme (EC no.) / Maize genes / Arabidopsis genes / Arabidopsis Location 1 / Arabidopsis Phenotype
folD / 5,10-Methylene-THF dehydrogenase (EC 1.5.1.5) /
5,10-methenyl-THF cyclohydrolase (EC 3.5.4.9) / GRMZM2G130790
GRMZM2G150485
GRMZM2G143230 / At2g38660
At3g12290
At4g00600
At4g00620 / Mito P
Cytosol P E Plasma memb E
Chloro P
Chloro P E / Abnormal shape seedling
metF / Methylene-THF reductase
(EC 1.5.1.20) / GRMZM2G082463 / At2g44160
At3g59970 / Cytosol P
Cytosol P
ygfZ / n/a / GRMZM2G107665 a / At4g12130
At1g60990 / Mito P
Chloro P E / Morphology & metabolites normal
etc / etc / etc / etc / etc / etc

1 P = prediction; E = experimental evidence

a Identical to GRMZM2G147498

MEGA phylogenetic tree for FolD proteins

MEGA phylogenetic tree for MetF proteins

MEGA phylogenetic tree for YgfZ proteins

3. Expression in various organs

folD
At2g38660, At3g12290, At4g00620/At4g00600

metF At2g44160 (At3g59970 not on chip)

ygfZ At4g12130, At1g60990

4. Predictions

● YgfZ (GRMZM2G107665). YgfZ occurs in all plants, all other eukaryotes, most bacteria, and some archaea. YgfZ is a paralog of the folate-dependent GcvT protein of the glycine cleavage complex, and so is likely to use a folate. Bacterial genes encoding YgfZ often cluster with diverse iron/sulfur (Fe/S) proteins (Fig. 1A), and transcriptomic1 and proteomic data2,3 show induction by oxidative stress and confirm an Fe/S association. YgfZ is required for full activity of certain Fe/S enzymes in E. coli4 and yeast5. Arabidopsis ygfZ At4g12130 is co-expressed with the iron storage protein ferritin 2 (Fig. 1B). We therefore predict that YgfZ is a novel folate-dependent protein involved in assembly or repair of Fe/S proteins, particularly under oxidative stress.

1. Zheng M, Wang X, Templeton LJ, Smulski DR, LaRossa RA, Storz G (2001) DNA microarray-mediated transcriptional profiling of the Escherichia coli response to hydrogen peroxide. J Bacteriol 183: 4562-4570

2. Hu P, Janga SC, Babu M, Díaz-Mejía JJ, Butland G et al (2009) Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biol 7: e96

3. Chen JW, Sun CM, Sheng WL, Wang YC, Syu WJ (2006) Expression analysis of up-regulated genes responding to plumbagin in Escherichia coli. J Bacteriol 188: 456-463

4. Ote T, Hashimoto M, Ikeuchi Y, Su'etsugu M, Suzuki T, Katayama T, Kato J (2006) Involvement of the Escherichia coli folate-binding protein YgfZ in RNA modification and regulation of chromosomal replication initiation. Mol Microbiol 59: 265-275

5. Gelling C, Dawes IW, Richhardt N, Lill R, Mühlenhoff U (2008) Mitochondrial Iba57p is required for Fe/S cluster formation on aconitase and activation of radical SAM enzymes. Mol Cell Biol 28: 1851-1861

● COG0212 (GRMZM2G038128). The COG0212 protein is a paralog of YgfA (5-formyl-THF cycloligase). COG0212 occurs in plants, animals, archaea, and some bacteria. Comparative genomics analysis shows that COG0212 occurs in many archaea that lack folates (Fig. 2A), and that in most other organisms it co-occurs with YgfA; these data suggest that COG0212 differs from YgfA in function and has nothing to do with folates. Comparative genomics analysis also reveals clustering of archaeal and bacterial COG0212 genes with various genes of thiamine metabolism and transport, and with genes encoding the pyruvate dehydrogenase complex, which requires thiamine (Fig. 2B-D). Also, Arabidopsis COG0212 is co-expressed with pyruvate dehydrogenase kinase, which regulates the pyruvate dehydrogenase complex (Fig. 2E). We therefore predict that COG0212 mediates a reaction in thiamine metabolism, most probably a salvage reaction. COG0212 cannot mediate a biosynthetic reaction because it is present in animals, and animals do not synthesize thiamine.

● Instructions and recommendations

Start by identifying all the known metabolites, enzymes and their EC numbers, and transporters in the assigned pathway in plants, bacteria, yeast, and animals. Remember that some pathways have variants; be sure to include these. This work will yield the equivalent of a KEGG pathway map.

Next, identify first Arabidopsis and then maize orthologs for as many as possible of the enzymes and transporters, using BlastP searches of Arabidopsis and maize proteins (at NCBI and Maizesequence.org), AraCyc, the KEGG pathway database, and the bibliome. Identify also which enzymatic or transport steps have no corresponding gene in plants, i.e. are cases of ‘missing genes’. And look for paralogs of the known pathway enzymes. These are almost always interesting targets for function predictions – but remember that they may be ‘overannotated’ (via homology) as actually being pathway enzymes even though they are not. Note also:

- Metabolites, enzymes, and genes have been given various names over the years, and GenBank contains different versions of predictions for the same genes/proteins. Hence multiple gene/protein accession numbers can refer to the same gene/protein.

- Genes can be fused together, so it is important to check whether the proteins you identify have any ‘extra’ domains (use the NCBI Conserved Domains tool). Such domains may carry functions that have yet to be discovered.

Where there are truly multiple genes in Arabidopsis or maize (not just multiple entries for the same gene) that correspond to a pathway step, align all the sequences and draw a phylogenetic tree (e.g. using MEGA 4). (This will distinguish which maize genes correspond to which Arabidopsis genes).

Use TargetP and Predotar to predict subcellular locations of proteins, and use PPDB and SUBA II (and the literature) to find experimental evidence on subcellular location.

Mine plant phenome databases (RAPID, SeedGenes, Chloroplast2010, BAPDB) for information on mutant phenotypes, if available.

Use the Golm Arabidopsis Expression dbase Multiple Expression Query tool to plot the expression in different organs of each Arabidopsis gene in the pathway.

For ‘missing genes’, predict candidates from Arabidopsis and maize based on homology with proteins from other organisms and/or comparative genomics analysis.

Use comparative genomics to identify candidates for unknown proteins that are (i) common to plants and bacteria, and (ii) associated in some way with the assigned pathway. Examples would be:

- paralogs (see above)

- cases where some bacteria have a gene in which a domain of unknown function is fused to an enzyme of the assigned pathway, and the unknown domain has a homolog in plants

- cases where an unknown gene is clustered on the chromosome in diverse bacteria with genes of the assigned pathway, and the unknown gene has a homolog in plants.

Then use comparative genomics analysis (including post-genomic evidence, e.g. microarray data) and the bibliome to predict a function as precise as possible for the ‘unknown proteins’.

For two cases from your predictions for ‘missing genes’ or ‘unknown proteins’ (you could take one of each, or two of either) summarize the evidence for your prediction in not more than page total. Make use of figures to present the evidence.