Applied Bioinformatics

Finding your way in biological information

Table of Contents:

Terms and abbreviations

Introduction

Finding back precise information

Searching and finding unknown information

Managing information

Complex analysis for the Lancet paper

Suggested Solutions (for all questions):

Additional URLs of Interest

Vs 070801 FvV

Terms and abbreviations

Affymetrix / Commercial supplier of DNA hybridisation microarrays
Google / Web-Based Keyword Search Engine for indexed World Wide Web pages
Scholar Google / Web-Based Keyword Search Engine for indexed scientific papers
PubMed / Web-Based Keyword Search Engine at NCBI for indexed molecular biology scientific papers with links to molecular biology information
GenBank / DNA sequence repository in the United States of America
EMBL / European Molecular Biology Library: DNA sequence repository in Europe
DDBJ / DNA DataBase of Japan: DNA sequence repository in Japan
Swiss-Prot / Protein sequence repository in Europe
PIR / Protein Information Resource: Protein sequence repository in the USA
UniProt / Collaborative protein sequence repository in the USA and Europe
PDB / Protein DataBase: Protein structure repository in the USA
Ensembl / Genome Sequence of Animals, repository and interface, in Europe
UCSC / Genome Sequence of Animals, repository and interface, in the USA
GEO / DNA microarray expression information repository in the USA
ArrayExpress / DNA microarray expression information repository in Europe
OMIM / Disease and genome information repository in the USA
Entrez Gene / Web-Based Keyword Search Engine for biological molecular information, based in the USA
SRS / Web-Based Keyword Search Engine for biological molecular information, based in Europe
NCBI / National Center for Biological Information, USA
EBI / European Bioinformatics Institute, UK
BLAST / Basic Local Alignment Search Tool, a software tool to detect DNA or Protein sequences similar to a DNA or Protein query sequence
CLUSTALW / Cluster Alignment W, a software tool to align DNA or Protein sequences
ExPASy / Expert Protein Analysis System for analysis of protein sequence properties
Abcam / Antibody commercial provider
Invitrogen / Molecular Reagent commercial provider
Treefam / Sequence Alignment database
TextPad / Text Editor, with special interesting features
iRider / Web Browser, with special interesting features
Deltagen / Transgenic mice commercial provider

Introduction

Suppose the following situation:

You have received a copy of the recent Lancet paper (attached) with the request by the your department head to obtain additional information.

The paper lists database identifiers, gene names and symbols, tissue cell lines, disease types, Affymetrix database identifiers, etc. You need to find and to keep track of all this information quickly, cheaply, and understandable by your supervisor.

You would usually start laboratory work and confirm experimental findings described, but now you need to review the paper, to discuss conclusions, and to formulate follow-up studies.

Initial attempts lead you to Google and various redundant sources on the World Wide Web, and you find that a lot of information is updated very quickly. So your search must be easy to repeat in the future.

This workshop will help you:

find known information – the database record of the sequence with a particular accession number
search unknown information – mouse sequences corresponding to the listed human sequences
manage all this information – quickly, cheaply, understandably, and reproducibly

Finding back precise information

A bit of history will explain some terms and avoid further confusion.

In the 1980s, several databases started to collect sequence information: GenBank in the USA, EMBL in Europe and DDBJ in Japan for DNA; SwissProt in Switzerland and PIR in the USA for proteins; PDB in the USA for protein structures. Every database lists each entry (sequence or structure) with a unique and typical ‘database identifier’. To avoid confusion and facilitate data exchange, DNA and protein databases agreed to give an ‘accession number’ to each entry, a number that is unique but common in all databases. The consequence is: the ‘accession numbers’ almost never change, and they are almost always the same if you search GenBank or EMBL or DDBJ, or if you search UniProt or PIR. In contrast, ‘database identifiers’ change all the time.

In the 1990s, the Internet became extremely popular by the World Wide Web invented at the CERN. Suddenly databases were accessible (by SRS) and searchable (by BLAST) on Web pages. It became apparent that different biologists were naming similar genes with completely different names, and were giving similar names to completely different genes in other species. Commissions started to propose official gene names and symbols. The 1990s were also characterised by the development of high throughput techniques such as genome sequencing, and genomic databases such as Ensembl and UCSC appeared.

In the 2000s, gene expression arrays enter mainstream biology, and require databases that are hosting and presenting results from those experiments, such as GEO in the USA and ArrayExpress at the EBI in Europe. Geneticists run high throughput analyses and start to produce information on sequence variation, association with disease. OMIM stores this type of information in a biologist-friendly way.

Literature mining, high throughput investigation of protein interactions, measurement of metabolic fluxes, promoter modulation, etc. are starting to generate information that can be used in modelling simple biological systems. Integration and representation of this new type of integrated information is a major endeavour for the next decade in bioinformatics.

More information can be found at the following resources:

Google and Scholar Google
SRS: molecular information
Entrez Gene: molecular information
Ensembl or UCSC browser: genomic information
NCBI: information resource
EBI: information resource
OMIM: disease information

Google

System to search for keywords on the World Wide Web, and so much more…

by Google Inc

example:

Search for the paper of Wang et al. (Foekens is corresponding author) in the Lancet.

Download the Affymetrix relevant chip annotation information from the Affymetrix web site. tip: you will probably have to register in order to obtain the information.

Other search examples, which is the most convenient? Search for human kinases

Search for Escherichia coli proteins longer than 100 amino acids, good luck with Google

Scholar Google

System to search for keywords in scientific literature

by Google Inc

example:

Search for the Lancet paper of Wang et al. again.

What are Dr. Foekens’ publications, what has he been working on, what is his curriculum? tip: use Google, Scholar Google. Which is the most convenient?

Other search examples, search for human kinases

Search for Escherichia coli proteins longer than 100 amino acids; good luck

SRS

Sequence Retrieval System

System to search for keywords in sequence-related databases integrated to SRS

by Thure Etzold

Dutch SRS server:

example:

Search for the Lancet paper of Wang et al. again.

Search for a gene of interest in the Lancet list.

Search for Escherichia coli proteins longer than 100 amino acids

Search for sequences published by Jean-Marc Neefs

Entrez Gene

Automated curated system to search for genes and related information of (almost) completely sequenced genomes

by NCBI

http://ncbi.nlm.nih.gov/Entrez

example:

Search for a gene of interest in the Lancet list.

Search for Glutamate Transporters

Search for cirrhosis related genes.

Ensembl

Automated curated system for genomic information of (almost) completely sequenced genomes

by EBI and Sanger Centre

example:

Search for a gene of interest in the Lancet list.

Search for fish Glutamate Transporters

Search for a few Affymetrix identifiers from the Lancet paper if available.

UCSC Genome Browser

Automated curated system for genomic information of (almost) completely sequenced genomes

by University of California at Santa Cruz

http://genome.ucsc.edu/index.html

example:

Search for a gene of interest in the Lancet list.

Search for Glutamate Transporters

Search for a few Affymetrix identifiers from the Lancet paper if available.

NCBI

National Center for Biological Information

http://ncbi.nlm.nih.gov/

Integrated system to search for any sequence-related information

example:

Search for Glutamate Transporters and follow links from there

EBI

European Bioinformatics Institute

Integrated system to search for any sequence-related information

example:

Search for a gene of interest in the Lancet list.

Search for Glutamate Transporters and link out from there

OMIM

Online Mendelian Inheritance in Man

Hand-curated system to search for disease-related keywords and genes

by Victor McKusick

http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM

example:

Search for a gene of interest in the Lancet list.

Search for cirrhosis related genes

Search for Glutamate Transporters in human and other species

Searching and fFinding unknown information

You have a few sequences listed. Is there any other related information?

More information can be found at the following resources:

BLAST: sequence similarity

CLUSTALW: sequence alignment

ExPASy: Sequence Analysis

BLAST

Basic Local Alignment Search Tool

Searches for sequence similar to a query protein or DNA sequence in any specified protein or DNA sequence database

by David Lipman

http://www.ncbi.nlm.nih.gov/BLAST/

Dutch BatchBlast:

CoPub mapper:

example:

From the gene of interest in the Lancet sequence, find similar sequences.

CLUSTALW

Cluster Alignment W

‘Align’, or display DNA or Protein sequences, such that individual sequences are placed horizontally, and residues in common positions between sequences are placed vertically. Amino acids ‘missing’ in shorter sequences are replaced by ‘gap’ symbols to ensure that all sequences in the alignment are equally long.

by Julie Thomson

example:

From the sequences obtained with BLAST, create a sequence alignment for mammalian sequences only.

Align human Glutamate Transporters

ExPASy

Expert Protein Analysis System to analyse protein sequences

by Gasteiger and Bairoch

example:

Find human Glutamate Transporter 1

Managing information

Use the following resources:

Microsoft Excel: manage tabular data
TextPad: manage test data
iRIDER: manage Web Resources

Excel tips

Ordering information. Make sure your table has headers for ALL columns, otherwise your data will mix up
Adding information. You can press CTRL+D to add the same information as the cell above; or CTRL+R to add the same information as the cell left. It also works with selections.
Summarizing information. Pivot tables are a fabulous tool to summarize large tables
Linking Out. The HYPERLINK function allows to calculate as many hyperlinks as you like from information in one or several tables. Left-clicking the cell opens the link immediately. The example below takes you step by step to creating URL in Excel.

We must have gene symbols in Excel column A, rows 1 to 76

Consider the formula:

There are 2 parts in this URL:

which is the core or the root or the general term of the URL, and

GSK3B

which is gene specific (gene symbol)

We could write the URL for every gene in every cell in column B. This takes time, cannot be changed easily, and is not 100% accurate

We can save time, become 100% accurate, and flexible. Write “GSK3B” in cell A1, and write the following in cell B1:

When we copy it to the next rows, the gene name changes in column B automatically.

This is still no clickable URL. To make it clickable in Excel, write the following in cell B1:

=HYPERLINK("

Now the URL is clickable, but quite ugly. To improve, write the following in cell B1:

=HYPERLINK(" "&A1)

The URL stays clickable, but now the text "Ensembl GSK3B" is shown.

When we copy it to the next rows, the gene name changes in column B automatically.

We can still do much better.

write the following in cell Z1:

write the following in cell B1:

=HYPERLINK($Z$1&A1,"Ensembl "&A1)

You will see no change, but the formula is now much smaller, and will only take one value: Z1 for the link, and all gene names

Please note the '$' signs, which fix the reference of the formula to cell Z1 in all cases

The advantage is that you need to make only one change when the URL changes

for example:

write the following in cell Z2:

write the following in cell B1:

=HYPERLINK($Z$1&A1,"Ensembl Human "&$A1)

gives you the human link

write the following in cell C1:

=HYPERLINK($Z$2&A1,"Ensembl Mouse "&$A1)

gives you the mouse link

you can copy B1 and C1 to all rows, and have 152 URLs defined immediately

Exercise: write hyperlinks in column D for rat (Rattus_norvegicus) links

Merging information. Combining INDEX and MATCH functions in INDEX(column_you_need, MATCH(cell_reference, column_you_search , 0)) will find information from the ‘column_you_need’ for the ‘cell_reference” and add it to your current data. We will practise a few examples.

Merging information in Excel

Excel is bad at merging data in a standard way. Merging rows will only keep the top row; merging columns will only keep the left column; merging files just does not work.

If you type

=MATCH(first_cell, column_A, 0)

in a cell, Excel will find the first row in 'column_A' with the same value as 'first_cell' and return the number of that row.

If you type

=INDEX(column_B, any_number)

in a cell, Excel will return the value of row 'any_number' in column 'column_B'.

This is powerful if you make Excel calculate this 'any_number' by the MATCH formula described above.

This results in:

=INDEX(column_B, MATCH(first_cell, column_A, 0))

Let us work with one example:

Copy the following simple table (tab-delimited text) in Excel

NrLetterWhat?NrColor

1a1red

2b3green

3c2yellow

4d5blue

5e4black

question is: what color are a, b, c, d, e?

Next to "a", type =INDEX(select the column with Color, MATCH(select cell in first column with Nr, select second column with Nr,0))

In the cell, the result should be "red"

Now be careful. As long as references are not fixed by adding '$' to column or rows, these will change when you copy the cells!

So fix references (add '$' where necessary) and copy cells where you want them

This also works between Excel workbooks or between worksheets. It looks impressive but the principle stays the same.

When you are satisfied with the merge results, always copy the new information as values.

Otherwise you will lose the information when the file changes or is deleted.

Now you are ready to make some Excel magic!

TextPad tips

TextPad has a number of unique interesting time saving features
Ordering information. Sorting text line by line allows deleting duplicate lines.
Adding information. Copy and paste works as usual but also as text block
Regular Expressions. Allow replacement of spaces by tabs, or very complex operations, line by line. Use POSIX settings
Linking Out. Hyperlinks in the text can be opened by pressing CTRL+G
Powerful Macros. Everything typed is recorded and can be played back several times. The disadvantage is that you cannot yet edit the saved macros.

iRIDER tips

iRIDER has a number of unique interesting time saving features
It is a tabbed browser that allows searching hundreds of links pasted in the application simultaneously. You can view the first results while the rest of the pages are downloaded. A selection in a Web page text followed by a right click will open all selected links.

The complex analysis on the next page will require a combination of the tools mentioned above.

For analysis of the Lancet paper, you can quickly find relevant information TODAY, without programming anything. Writing Excel formulas or TextPad macros is not REAL programming, but will get almost any mundane job done VERY quickly. Of course, if you need to perform more complicated data analysis in a more systematic way, and interfaced on the Web for other users, you will have to resort to programming.

Anyway, programming will use all of the techniques that have been outlined in this workshop: data retrieval, regular expression, data parsing, collection, dynamic URL generation, relational database queries, chaining, etc. Programming will add speed, presentation layers, allow wider application and encourage a systematic approach.

If you really, really want to start programming, use PERL for coding, R for analysis, MySQL for database and all World Wide Web resources for help. They are all for free.

I have brought a few books of interest for bioinformatics starters. If you need additional details and more references, please provide your e-mail contact and I will forward these to you.

Complex analysis for the Lancet paper

In the Lancet Paper, you have a list of accession numbers, gene symbols, Affymetrix identifiers. What are these about?

Get the summary information of table 3 in a table. Tip: use Acrobat Reader and Textpad to obtain a tab-delimited file and put the results in Excel. You cannot obtain these data in Excel immediately. Why not?

Locate the dataset discussed in the paper in the GEO database. Discuss the expression patterns for the 76 reported genes of interest. tip: use Google.

Obtain the protein sequences (if available) for the 76 reported genes of interest. tip: use the gene identifiers or the Affymetrix numbers as a start.

Is there any way to manage this: without programming a database? I think you can: Affymetrix data contains both probeset identifiers and gene symbols. You can search for gene symbols in SRS or NCBI or ExPASy, and get to protein sequences from there. You only need to put the sequences back to the existing table of 76 probesets in the correct order in Excel. We will work through this example step by step.