Final Technical Report

IV Latin American Course on Bioinformatics for Tropical Disease Research

January 16th-28th, 2006

COURSE OBJECTIVES

The main objective of this course was to teach the participants how to effectively use bioinformatics tools to exploit genome sequences of pathogens and to accelerate research in Tropical Diseases. The workshops presented, in addition to the available Internet tools, the Linux operating system, Linux command line tools, some computer science theoretical background, and notions of perl programming, in order to prepare the participants to manage also high-throughput analysis. Another important goal of the course was to promote the interaction of researchers within Latin America for information exchange and long-term development projects.

TRAINEES

The course had 20 participants from Latin America, of which 18 had expertise in Tropical Diseases and 2 had a strong background on computer science. In addition, we had 2 extra trainees at no expense to the budget assigned by TDR. A complete list of selected participants is attached at the end of this document.

SELECTION CRITERIA

Participants were selected with a policy of creating a group with the highest research impact as possible. In this direction, utmost importance was given to a research curriculum related to tropical diseases, seniority in the research institution, experience with bioinformatics and letters of recommendation. An additional concern was to distribute geographically the participants, provided the previous criteria of academic and research excellence were not denied. Eight participants were selected from Brazil and twelve from other countries of Latin America (Argentina, Bolivia, Colombia, Cuba, Guatemala, Mexico, Peru and Venezuela) in accordance with what was originally stated in the course proposal.

The selection committee was composed by Dr. Dinesh Gupta (ICGEB), Dr. Worachart Sirawaraporn (Mahidol University), Mr. Chuong Hyunh (NCBI), Dr. Alan Durham and Dr. Arthur Gruber (Course Coordinators). Unfortunately, the evaluation of Dr. Dinesh Gupta did not arrive on time and was not used. The forms used by the committee were based on the forms we used to evaluate candidate in the previous years. A copy of the evaluation sheet is available on the CD-ROM annexed in this report. Each candidate was awarded a point for each time he/she appeared on each evaluator' s top 24 scores. After this, an extra point was awarded for each candidate that appeared on each evaluator's “must be” list (each evaluator indicated the candidates which, irrespective of his/her score, would be a good addition for the course on an overall evaluation).

The acceptance of an unusually high number of candidates from Argentina and Colombia was supported by their curricula and by the fact that they were chosen from distinct research institutions, improving geographical coverage. The low acceptance of candidates from Peru can be attributed to candidates with much less research experience than the ones from other countries.

The selection of the Brazilian participants followed the same criteria of curriculum and geographical distribution. There were 7 candidates selected from 4 different states of Brazil.

A detailed table summarizing the geographical distribution of all applicants and selected participants is attached to this report.

COURSE PROGRAM

The course comprised 106 hours distributed as follows. (A) Computer science fundamentals (23%); (B) Conferences (4%); (C) Data mining -lectures and practical workshops (51%); (D) Free lab time for projects and study (16%), (E) Project presentations, student introduction, closing ceremony (6%). The complete schedule of the course as well as other information is available at:

The conferences were open and freely attended by non-participants, whereas workshops were restricted to participants of the course.

This year the total number of hours of the course was increased by one day (9 hours). The goal was to give the students more time to work on their final project, and also to give them the opportunity to visit local research laboratories in case they were interested in future cooperation. At this time, it is too early to evaluate the impact of the visits, but the quality of the projects was noticeably higher than last year.

Module 1: Computer Science Fundamentals

The Computer Science Fundamentals module aimed to provide essential concepts of Computer Science. The final goal was twofold:

  1. To provide the participants with basic skills to improve their productivity into day-to-day small tasks such as converting files, assembling scripts for implementing small genome analysis pipelines using available software, understanding a database schema and building simple queries.
  2. To give the participants a basic understanding on Computer Science research.

In this spirit, it was important to present lectures with structured information and to provide workshops with hands-on experience. As a consequence, lectures in this module were taught at a special classroom that provided one computer for each student.

The module was divided in 3 sub-modules, each taught by a member of the faculty of the Computer Science Department:

  1. Unix fundamentals and tools (Dr. Marco Dimas Gubitoso, Dr. Alan Durham): the Linux/Unix operating systems, the Unix shell, the Emacs configurable editor, security in Unix (user accounts, file permissions), basic Unix utilities (find, grep, diff), redirecting input and output, using pipes, using the command line efficiently. Duration: 12 hours.
  2. Basic Perl programming (Dr. Alan Durham): introduction to the programming paradigm, notions of algorithm complexity, basic Perl commands (reading, printing, arithmetic expressions), conditional commands, loop commands, regular expressions (with emphasis on their use also to describe protein motifs), searching for a pattern, substituting patterns in text, loops, using files. Duration: 12 hours.

3. Introduction to databases, their design and query (Dr. João Eduardo Ferreira): what is a Database and why is it necessary, the basics of database structure and design, understanding a database design description, querying a database with the SQL language. Duration: 4 hours.

In this module classes were mostly interactive, and the students worked individually in their assignments, each one in his/her own computer, assisted by the instructors. The slow and introductory approach proved to be a correct one. From the evaluations handed out in class it was clear that the students found the module important and helpful.

This year we increased the course load of Linux fundamentals, as it was our perception that in previous years, lack of practice with the system was impairing the students' performance on the other workshops. In our perception this was a very successful modification, as the students seemed much more at ease in the use of the computer systems throughout the course. However it was also noted that the duration of the perl module would ideally be longer. Considering the time limitations of the course, this was not possible.

Module 2: Datamining and Genome Annotation

The goal of this module was to provide an extensive overview of the tools freely available for supporting genome sequence analyses. The module followed the same scheme as last year, and included one new workshop: Est Clustering. Last year's “3D modeling and epitope mapping” workshop was replaced by the workshops entitled “Protein Architecture” and “Comparative Modelling for Proteins”. The workshops were organized to sequentially cover the path from sequencing to genome annotation, in order to facilitate the understanding of each step.

Workshop contents

Phred/Phrap/Consed (Arthur Gruber)

The first part of this sub-module comprised a lecture on the package Phred/Phrap/Consed, used for sequence assembly and finishing. In the second part, the students followed a hands-on tutorial on how to run the package using a provided Perl script, and utilize Consed to inspect and edit the results.

Local and Global Alignment – A theoretical View (Alan Durham)

In this lecture students were presented to the concept of sequence alignment. The basic local, semi-global and global alignment algorithms were explained, as well as the scope of application of each algorithm. The need of heuristics to search large databases was explained and the Basics of the Blast algorithm presented. Finally the importance of scoring matrices was stressed.

Blast from a biological standpoint (Arthur Gruber)

Students were introduced to the concepts of identity, similarity and homology of biological sequences, and the relationship among these terms. The need for rapid algorithms to perform similarity searches in large sequence databases and the motivation for these searches were discussed. Exact and heuristic algorithms for sequence similarity searching in large databases were presented, including some detail on specific software packages (BLAST, PSI-BLAST) and the parameters used in the searches. Students were introduced to scoring matrices and their importance in similarity searching. Issues concerning statistical significance, scores, taxonomy and common caveats were discussed.

In the practical part of this module students followed a hands on tutorial and was taught how to use NCBI’s BLAST server though the internet. Also, the students were trained how to download data from public resources, construct local searchable databases and run BLAST on their own computers using local command line BLAST. Students then evaluated the results and computed time of searches with each of the 5 BLAST programs (blastn, blastx, blastp, tblastx, tblastn). Following the exercises, a series of questions regarding theoretical aspects of BLAST and its parameters was discussed.

NCBI Resources (Chuong Huynh)

Chuong Huynh lecture described what is the U.S. National Center for Biotechnology Information and the educational resources available at NCBI. The lecture also included how to use the NCBI Entrez databases that are all freely available to the public beginning with the various NCBI sequence databases for primary data and derivative data, the various NCBI genomic resources, the literature databases. The lecture concluded with a walk through of the latest genome analysis tool, i.e. NCBI Genome Workbench. The practical consisted of a self-study exercise on the material covered in the lecture. This included an Entrez exercises on the global query using controlled vocabulary and limits in PubMed, Nucleotide (example: looking for the zebrafish prolactin), Taxonomy, Protein (MLH1 example); developing gene models from a genomic-mRNA alignment using Splign or Spidey; finding related sequences and structures using precomputed links e.g. through BLINK (nonredundant protein neighbouFrs); using the NCBI Genome Resources including UniGene and Entrez Gene, MapViewer (using genome maps); using genomic BLAST (e.g. the difference between using Megablast and regular blastn); identifying sequences by optimizing various parameters and matrices (e.g. looking for short sequences such as primers or motifs versus long sequences such as genomes); PSI-BLAST analysis (looking for distantly related proteins); and translating BLAST searches and mining polymorphisms.

Protein Databases and conserved domains (Chuong Hunh)

Chuong Huynh began the session reviewing various concepts which were determined as being confusing by the participants. This was followed by a lecture on the various protein databases and how to use conserved domains and protein families in determing protein function/categorization. During the practical session, the participants identified and characterized sequences data from a multi-drug resistance tuberculosis, and determined whether sequenced bacteria was drug resistant or not and how. This provided a real world example that many molecular diagnostics labs would find useful in their day to day practice.

Est Clustering (Arthur Gruber)

Students were taught a lecture on EST clustering covering cDNA library construction, representational biases, possible contaminants, specific pipelines for processing ESTs. Also, the differences between clustering and assembly were discussed and the several available approaches for EST clustering, including UNIGENE, StackPack and TIGR Gene Indices. The pros and cons of each method were presented and discussed.

In the practical part of this module, students followed a hands on tutorial using TGICL (the TIGR Gene Indices Clustering Tools), one of the freely available and easy-to-use packages for EST clustering. An example dataset was provided and the students were able to run the program and visualize the results using TIGR’s CLVIEW program.

Multiple Alignment of Macromolecular Sequences (Sérgio Matiolli)

The problem of multiple alignment of macromolecular sequences was presented as a non trivial extension of the pairwise alignment problem. Some strategies of constructing multiple alignment were presented, with an emphasis on the hierarchical approach, which use is widespread. In the practical part of the workshops students learned the use of the software CLUSTALX.

Phylogenetic Analisis with Macromolecular Data (Sérgio Matiolli)

The problem of phylogenetic reconstruction based on macromolecular sequences was presented in the light of the general properties of molecular evolution. Three main approaches of phylogenetic reconstruction were addressed. The geometric basis of distance-based methods was discussed. The maximum parsimony and maximum likelihood principles were explained together with their phylosophical and statistical principles. Resampling methods for branching support were also presented. In the practical part the students learned the use of the software Phylip for the phylogenetic reconstruction and the use of Treeview for viewing the trees generated by Phylip.

Comparative Proetin Modelling (Paula Kuser Falcão).

With comparative or homology protein structure modeling it is possible to build a three-dimensional model for a protein of unknown structure based on one or more related proteins of known structure. The necessary conditions for getting a useful model (similarity between the target sequence and the template structures and availability of a correct alignment between them), and all the steps in protein modeling were explained and demonstrated in the Protein Modeling workshop. For the practical part, the MODELLER program (Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29, 291-325, 2000) for comparative approach to protein structure prediction was used. The students were able to perform a search for a template using the Blast program against the pdb, the alignment of the target and the template protein, the modeling of the target sequence and the visualization of the protein modeled.

Visualization and Analysis of a Protein Strutcure

Diamond STING (The Diamond STING server. Nucleic Acids Res. 2005 Jul 1;33,Web Server issue,:W29-35) is a new version of the STING suite of programs for comprehensive analysis of a relationship between protein sequence, structure, function and stability. The program STING provides visualization and analysis of a molecular sequence and structure for pdb files. SMS operates with a collection of both publicly available data (PDB, HSSP, Prosite) and its own data (contacts, interfaces, surface accessibility). The workshop was intended to show how to visualize and analyze a protein structure with the tools available in the program Sting, for example, checking the contacts, looking at its Ramachandran Plot, conservation of the amino acids in the sequence, comparison with other pdb files, etc.

Introduction to Artemis (Arthur Gruber)

This workshop was intended to familiarize the students with the basic functions of Artemis. A short mitochondrial sequence of an apicomplexan parasite was used as a dataset for illustrating the most important Artemis commands. The various graphical plots within Artemis were demonstrated including, using GC skew, GC frameplot, GC content and Karlin signature, amongst others.

The practical session consisted in the full annotation using several types of evidence, including ORF finding, similarity searches with Blastn and Blastx, compositional biases, codon usage biases, etc. Differences in DNA base content was explained using percentage GC content plots. Codon usage was explained and students were shown how to make their own codon usage tables from selected genes and use them to improve subsequent gene predictions within Artemis. The different aspects of annotation were presented, discussed and applied for this small genome, including the description of different features, gene names, ontology terms, terms, etc.

Comparative Genomics using ACT (Arthur Gruber)

Genomic synteny was explained and demonstrated using small mitochondrial genomes of three different apicomplexan parasites, illustrating differences of gene order and orientation, levels of similarity across the genomes, inversions, etc.

The students were shown how to set up an ACT session, by running Blastn and TBlastx to compare two sequences and formatting the output using MSPcrunch. The students learnt how synteny can be used to improve gene predictions and study chromosome biology.

Image Analysis from Microarray Data (Roberto Cesar Marcondes).