Applied Bioinformatics
Finding your way in biological information
Table of Contents:
Applied Bioinformatics
Terms and abbreviations
Introduction
Finding back precise information
Searching and finding unknown information
Managing information
Complex analysis for the Lancet paper
Suggested Solutions (for all questions):
Additional URLs of Interest
Vs 070801 FvV
Terms and abbreviations
Affymetrix / Commercial supplier of DNA hybridisation microarraysGoogle / Web-Based Keyword Search Engine for indexed World Wide Web pages
Scholar Google / Web-Based Keyword Search Engine for indexed scientific papers
PubMed / Web-Based Keyword Search Engine at NCBI for indexed molecular biology scientific papers with links to molecular biology information
GenBank / DNA sequence repository in the United States of America
EMBL / European Molecular Biology Library: DNA sequence repository in Europe
DDBJ / DNA DataBase of Japan: DNA sequence repository in Japan
Swiss-Prot / Protein sequence repository in Europe
PIR / Protein Information Resource: Protein sequence repository in the USA
UniProt / Collaborative protein sequence repository in the USA and Europe
PDB / Protein DataBase: Protein structure repository in the USA
Ensembl / Genome Sequence of Animals, repository and interface, in Europe
UCSC / Genome Sequence of Animals, repository and interface, in the USA
GEO / DNA microarray expression information repository in the USA
ArrayExpress / DNA microarray expression information repository in Europe
OMIM / Disease and genome information repository in the USA
Entrez Gene / Web-Based Keyword Search Engine for biological molecular information, based in the USA
SRS / Web-Based Keyword Search Engine for biological molecular information, based in Europe
NCBI / National Center for Biological Information, USA
EBI / European Bioinformatics Institute, UK
BLAST / Basic Local Alignment Search Tool, a software tool to detect DNA or Protein sequences similar to a DNA or Protein query sequence
CLUSTALW / Cluster Alignment W, a software tool to align DNA or Protein sequences
ExPASy / Expert Protein Analysis System for analysis of protein sequence properties
Abcam / Antibody commercial provider
Invitrogen / Molecular Reagent commercial provider
Treefam / Sequence Alignment database
TextPad / Text Editor, with special interesting features
iRider / Web Browser, with special interesting features
Deltagen / Transgenic mice commercial provider
Introduction
Suppose the following situation:
You have received a copy of the recent Lancet paper (attached) with the request by the your department head to obtain additional information.
The paper lists database identifiers, gene names and symbols, tissue cell lines, disease types, Affymetrix database identifiers, etc. You need to find and to keep track of all this information quickly, cheaply, and understandable by your supervisor.
You would usually start laboratory work and confirm experimental findings described, but now you need to review the paper, to discuss conclusions, and to formulate follow-up studies.
Initial attempts lead you to Google and various redundant sources on the World Wide Web, and you find that a lot of information is updated very quickly. So your search must be easy to repeat in the future.
This workshop will help you:
- find known information – the database record of the sequence with a particular accession number
- search unknown information – mouse sequences corresponding to the listed human sequences
- manage all this information – quickly, cheaply, understandably, and reproducibly
Finding back precise information
A bit of history will explain some terms and avoid further confusion.
In the 1980s, several databases started to collect sequence information: GenBank in the USA, EMBL in Europe and DDBJ in Japan for DNA; SwissProt in Switzerland and PIR in the USA for proteins; PDB in the USA for protein structures. Every database lists each entry (sequence or structure) with a unique and typical ‘database identifier’. To avoid confusion and facilitate data exchange, DNA and protein databases agreed to give an ‘accession number’ to each entry, a number that is unique but common in all databases. The consequence is: the ‘accession numbers’ almost never change, and they are almost always the same if you search GenBank or EMBL or DDBJ, or if you search UniProt or PIR. In contrast, ‘database identifiers’ change all the time.
In the 1990s, the Internet became extremely popular by the World Wide Web invented at the CERN. Suddenly databases were accessible (by SRS) and searchable (by BLAST) on Web pages. It became apparent that different biologists were naming similar genes with completely different names, and were giving similar names to completely different genes in other species. Commissions started to propose official gene names and symbols. The 1990s were also characterised by the development of high throughput techniques such as genome sequencing, and genomic databases such as Ensembl and UCSC appeared.
In the 2000s, gene expression arrays enter mainstream biology, and require databases that are hosting and presenting results from those experiments, such as GEO in the USA and ArrayExpress at the EBI in Europe. Geneticists run high throughput analyses and start to produce information on sequence variation, association with disease. OMIM stores this type of information in a biologist-friendly way.
Literature mining, high throughput investigation of protein interactions, measurement of metabolic fluxes, promoter modulation, etc. are starting to generate information that can be used in modelling simple biological systems. Integration and representation of this new type of integrated information is a major endeavour for the next decade in bioinformatics.
More information can be found at the following resources:
- Google and Scholar Google
- SRS: molecular information
- Entrez Gene: molecular information
- Ensembl or UCSC browser: genomic information
- NCBI: information resource
- EBI: information resource
- OMIM: disease information
System to search for keywords on the World Wide Web, and so much more…
by Google Inc
example:
Search for the paper of Wang et al. (Foekens is corresponding author) in the Lancet.
Download the Affymetrix relevant chip annotation information from the Affymetrix web site. tip: you will probably have to register in order to obtain the information.
Other search examples, which is the most convenient? Search for human kinases
Search for Escherichia coli proteins longer than 100 amino acids, good luck with Google
Scholar Google
System to search for keywords in scientific literature
by Google Inc
example:
Search for the Lancet paper of Wang et al. again.
What are Dr. Foekens’ publications, what has he been working on, what is his curriculum? tip: use Google, Scholar Google. Which is the most convenient?
Other search examples, search for human kinases
Search for Escherichia coli proteins longer than 100 amino acids; good luck
SRS
Sequence Retrieval System
System to search for keywords in sequence-related databases integrated to SRS
by Thure Etzold
Dutch SRS server:
example:
Search for the Lancet paper of Wang et al. again.
Search for a gene of interest in the Lancet list.
Search for Escherichia coli proteins longer than 100 amino acids
Search for sequences published by Jean-Marc Neefs
Entrez Gene
Automated curated system to search for genes and related information of (almost) completely sequenced genomes
by NCBI
http://ncbi.nlm.nih.gov/Entrez
example:
Search for a gene of interest in the Lancet list.
Search for Glutamate Transporters
Search for cirrhosis related genes.
Ensembl
Automated curated system for genomic information of (almost) completely sequenced genomes
by EBI and Sanger Centre
example:
Search for a gene of interest in the Lancet list.
Search for fish Glutamate Transporters
Search for a few Affymetrix identifiers from the Lancet paper if available.
UCSC Genome Browser
Automated curated system for genomic information of (almost) completely sequenced genomes
by University of California at Santa Cruz
http://genome.ucsc.edu/index.html
example:
Search for a gene of interest in the Lancet list.
Search for Glutamate Transporters
Search for a few Affymetrix identifiers from the Lancet paper if available.
NCBI
National Center for Biological Information
http://ncbi.nlm.nih.gov/
Integrated system to search for any sequence-related information
example:
Search for Glutamate Transporters and follow links from there
EBI
European Bioinformatics Institute
Integrated system to search for any sequence-related information
example:
Search for a gene of interest in the Lancet list.
Search for Glutamate Transporters and link out from there
OMIM
Online Mendelian Inheritance in Man
Hand-curated system to search for disease-related keywords and genes
by Victor McKusick
http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM
example:
Search for a gene of interest in the Lancet list.
Search for cirrhosis related genes
Search for Glutamate Transporters in human and other species
Searching and fFinding unknown information
You have a few sequences listed. Is there any other related information?
More information can be found at the following resources:
BLAST: sequence similarity
CLUSTALW: sequence alignment
ExPASy: Sequence Analysis
BLAST
Basic Local Alignment Search Tool
Searches for sequence similar to a query protein or DNA sequence in any specified protein or DNA sequence database
by David Lipman
http://www.ncbi.nlm.nih.gov/BLAST/
Dutch BatchBlast:
CoPub mapper:
example:
From the gene of interest in the Lancet sequence, find similar sequences.
CLUSTALW
Cluster Alignment W
‘Align’, or display DNA or Protein sequences, such that individual sequences are placed horizontally, and residues in common positions between sequences are placed vertically. Amino acids ‘missing’ in shorter sequences are replaced by ‘gap’ symbols to ensure that all sequences in the alignment are equally long.
by Julie Thomson
example:
From the sequences obtained with BLAST, create a sequence alignment for mammalian sequences only.
Align human Glutamate Transporters
ExPASy
Expert Protein Analysis System to analyse protein sequences
by Gasteiger and Bairoch
example:
Find human Glutamate Transporter 1
Managing information
Use the following resources:
- Microsoft Excel: manage tabular data
- TextPad: manage test data
- iRIDER: manage Web Resources
Excel tips
- Ordering information. Make sure your table has headers for ALL columns, otherwise your data will mix up
- Adding information. You can press CTRL+D to add the same information as the cell above; or CTRL+R to add the same information as the cell left. It also works with selections.
- Summarizing information. Pivot tables are a fabulous tool to summarize large tables
- Linking Out. The HYPERLINK function allows to calculate as many hyperlinks as you like from information in one or several tables. Left-clicking the cell opens the link immediately. The example below takes you step by step to creating URL in Excel.
We must have gene symbols in Excel column A, rows 1 to 76
Consider the formula:
There are 2 parts in this URL:
which is the core or the root or the general term of the URL, and
GSK3B
which is gene specific (gene symbol)
We could write the URL for every gene in every cell in column B. This takes time, cannot be changed easily, and is not 100% accurate
We can save time, become 100% accurate, and flexible. Write “GSK3B” in cell A1, and write the following in cell B1:
="
When we copy it to the next rows, the gene name changes in column B automatically.
This is still no clickable URL. To make it clickable in Excel, write the following in cell B1:
=HYPERLINK("
Now the URL is clickable, but quite ugly. To improve, write the following in cell B1:
=HYPERLINK(" "&A1)
The URL stays clickable, but now the text "Ensembl GSK3B" is shown.
When we copy it to the next rows, the gene name changes in column B automatically.
We can still do much better.
write the following in cell Z1:
write the following in cell B1:
=HYPERLINK($Z$1&A1,"Ensembl "&A1)
You will see no change, but the formula is now much smaller, and will only take one value: Z1 for the link, and all gene names
Please note the '$' signs, which fix the reference of the formula to cell Z1 in all cases
The advantage is that you need to make only one change when the URL changes
for example:
write the following in cell Z2:
write the following in cell B1:
=HYPERLINK($Z$1&A1,"Ensembl Human "&$A1)
gives you the human link
write the following in cell C1:
=HYPERLINK($Z$2&A1,"Ensembl Mouse "&$A1)
gives you the mouse link
you can copy B1 and C1 to all rows, and have 152 URLs defined immediately
Exercise: write hyperlinks in column D for rat (Rattus_norvegicus) links
- Merging information. Combining INDEX and MATCH functions in INDEX(column_you_need, MATCH(cell_reference, column_you_search , 0)) will find information from the ‘column_you_need’ for the ‘cell_reference” and add it to your current data. We will practise a few examples.
Merging information in Excel
Excel is bad at merging data in a standard way. Merging rows will only keep the top row; merging columns will only keep the left column; merging files just does not work.
If you type
=MATCH(first_cell, column_A, 0)
in a cell, Excel will find the first row in 'column_A' with the same value as 'first_cell' and return the number of that row.
If you type
=INDEX(column_B, any_number)
in a cell, Excel will return the value of row 'any_number' in column 'column_B'.
This is powerful if you make Excel calculate this 'any_number' by the MATCH formula described above.
This results in:
=INDEX(column_B, MATCH(first_cell, column_A, 0))
Let us work with one example:
Copy the following simple table (tab-delimited text) in Excel
NrLetterWhat?NrColor
1a1red
2b3green
3c2yellow
4d5blue
5e4black
6f
question is: what color are a, b, c, d, e?
Next to "a", type =INDEX(select the column with Color, MATCH(select cell in first column with Nr, select second column with Nr,0))
In the cell, the result should be "red"
Now be careful. As long as references are not fixed by adding '$' to column or rows, these will change when you copy the cells!
So fix references (add '$' where necessary) and copy cells where you want them
This also works between Excel workbooks or between worksheets. It looks impressive but the principle stays the same.
When you are satisfied with the merge results, always copy the new information as values.
Otherwise you will lose the information when the file changes or is deleted.
Now you are ready to make some Excel magic!
TextPad tips
- TextPad has a number of unique interesting time saving features
- Ordering information. Sorting text line by line allows deleting duplicate lines.
- Adding information. Copy and paste works as usual but also as text block
- Regular Expressions. Allow replacement of spaces by tabs, or very complex operations, line by line. Use POSIX settings
- Linking Out. Hyperlinks in the text can be opened by pressing CTRL+G
- Powerful Macros. Everything typed is recorded and can be played back several times. The disadvantage is that you cannot yet edit the saved macros.
iRIDER tips
- iRIDER has a number of unique interesting time saving features
- It is a tabbed browser that allows searching hundreds of links pasted in the application simultaneously. You can view the first results while the rest of the pages are downloaded. A selection in a Web page text followed by a right click will open all selected links.
The complex analysis on the next page will require a combination of the tools mentioned above.
For analysis of the Lancet paper, you can quickly find relevant information TODAY, without programming anything. Writing Excel formulas or TextPad macros is not REAL programming, but will get almost any mundane job done VERY quickly. Of course, if you need to perform more complicated data analysis in a more systematic way, and interfaced on the Web for other users, you will have to resort to programming.
Anyway, programming will use all of the techniques that have been outlined in this workshop: data retrieval, regular expression, data parsing, collection, dynamic URL generation, relational database queries, chaining, etc. Programming will add speed, presentation layers, allow wider application and encourage a systematic approach.
If you really, really want to start programming, use PERL for coding, R for analysis, MySQL for database and all World Wide Web resources for help. They are all for free.
I have brought a few books of interest for bioinformatics starters. If you need additional details and more references, please provide your e-mail contact and I will forward these to you.
Complex analysis for the Lancet paper
In the Lancet Paper, you have a list of accession numbers, gene symbols, Affymetrix identifiers. What are these about?
Get the summary information of table 3 in a table. Tip: use Acrobat Reader and Textpad to obtain a tab-delimited file and put the results in Excel. You cannot obtain these data in Excel immediately. Why not?
Locate the dataset discussed in the paper in the GEO database. Discuss the expression patterns for the 76 reported genes of interest. tip: use Google.
Obtain the protein sequences (if available) for the 76 reported genes of interest. tip: use the gene identifiers or the Affymetrix numbers as a start.
Is there any way to manage this: without programming a database? I think you can: Affymetrix data contains both probeset identifiers and gene symbols. You can search for gene symbols in SRS or NCBI or ExPASy, and get to protein sequences from there. You only need to put the sequences back to the existing table of 76 probesets in the correct order in Excel. We will work through this example step by step.