Structuring Data Files and Data Matrices Discussion Document V0

Discussion document v0.1 – Structuring data files associated to ISA-TAB

Structuring Data files and Data Matrices – Discussion document v0.1

Authors

Philippe Rocca-Serra, Susanna-Assunta Sansone

Scope

This document has been drafted for the BioInvestigation Index system. It does NOT aim to detail all possible data formats associated to an ISA-TAB file and the authors will rely on existing guidelines wherever these exist.

Others can consider this document as a source of information.

Table of Contents

1. Microarray application

2. Mass spectrometry application

2.1 Protein identification file

2.2 Peptide identification file

2.3 Post translation modification file

2.4 Metabolite identification

3. NMR spectroscopy application to metabolite identification

4. High throughput sequencing application:

5. Genotyping application

1. Microarray application

For DNA Microarray based assays, we require to follow MAGE-TAB recommendation for matrix reporting. When providing raw data, provide native raw datafiles, that is all the files generate from scanner output: ( are reproducing MAGE-TAB documentation below for providing information regarding normalized and transformed data files. MAGE-TAB group has devised way to encode information so that it can be related to hybridization objects and to Reporter and Composite element as defined in the Array Design File.

2. Mass spectrometry application

For Peptide and Protein identification, we again follow practices that have been established by the PSI PRIDE format and which are described here:

2.1Protein identification file

Valid File Header Name
Protein Unique ID
Peptide Count / integer / Details how many peptides are found to be matching that protein
Identification Type(*) / String / From a controlled vocabulary, either gel Free or 2D-Gel based
Protein Accession / String / An accession number from Uniprot database
Accession Version
Splice Isoform / String
Search Database / string
Search DB Version / string
Spectrum Reference / string
Score / float
Threshold / Float
Search Engine / String / Controlled Vocabulary
Sequence Coverage / Float / Percentage of protein sequence coverage
Gel Link(*) / URL / a path and filename to either gel images or gel image quantitation files
X-Coord(*) / float / Gel spot x coordinate
Y-Coord(*) / float / Gel spot y coordinate
Mol. Wt. (*) / float / Estimate of protein molecular weight at spot location
pI(*) / float / Isoelectric value at spot location
Term Name
Term Accession
Term Source REF

(*). The fields indicated in the light grey shaded area should be are optional field which should only be filled if the identification Type takes the “2DGel based” value

2.2 Peptide identification file

Peptide Unique ID / string / Locally unique and resolvable identifier
Protein Unique ID / string / Locally unique and resolvable identifier
Protein Accession / string / Uniprot database identifier
Sequence / string / (Please do not include flanking sequences here) The sequence of the peptide as identified by mass spectrometry
Start / integer / The start position of the match in the proteins
End / integer
Spectrum Reference / string / Spectrum reference identifier pointing to Mzdata object
Term Name / string
Term Accession / string
Term Source REF / string

2.3Post translation modification file

The table below lists the valid headers which can be used to report post translational modification on identified peptides.

Peptide Unique ID / string
Peptide Sequence / string
Location (Relative to Peptide) / string
Accession / string
PTM Database / string
Database Version / string
Average Delta Value / float
Mono Delta Value

2.4 Metabolite identification

Metabolite Unique ID / string / Locally unique and resolvable identifier
Chemical Entity Unique ID / string / Locally unique and resolvable identifier
Chemical Entity Accession / string / A chemical database identifier such as CHEBI or CAS or CHEMIDPLUS
Spectrum Reference / string / Spectrum reference identifier pointing to Mzdata object
Identification Type(*) / String / From a controlled vocabulary, either gas or liquid chromatography
Retention Time / Float / Indicate the time of retentionon the column. Specify Unit using the Term Source mechanism
Term Name / string
Term Accession / string
Term Source REF / string

3. NMR spectroscopy application to metabolite identification

Similar to Mass spectrometry application when reporting Metabolite identification

4. High throughput sequencing application:

Create an zip or tar.gz archive containing all files relates to a given sample.

Manufacturer / File Format / Comment
Solexa / Fastaq OR srf OR {_seq.txt AND _prb.txt AND _sig2.txt } / Solexa can export in 3 possible different format, select one
454 / .sff format

5. Genotyping application

Guidelines issued by the Wellcome Trust Case Control Consortium available at could be used as a starting point.

Also HapMap consortium

for instance

1 of 4