Computer Science 286Fall 2005

Bioinformatics Projects

The project will involve research in bioinformatics. You are encouraged to work on a project related to your interests. A list of suggested topics is attached to help you get started. You can pick a project from the list, modify a project to suit your interests, or invent your own. The key idea is to be creative either in developing a new algorithm or in implementing an existing one. Results, whether good or bad, should be compared with those obtained from existing bioinformatics tools or packages.

The project implementation may be developed for any platform. You can use C, C++, C#, Java, Matlab, or Perl. The World Wide Web contains many bioinformatics programs as well as the source code. You may use and modify such code provided appropriate acknowledgements and citations are made.

A two-page project proposal is due by the beginning of the lecture on Thursday, October 13, 2005. The proposal should give a clear description of the project and should contain absolutely no generalities and definitions. Clearly state what you are planning to do and explain how you plan to achieve it. Do not forget to specify the programming language you intend to use. It is important to get started as early as possible.

The projects are due at on Tuesday, November 29, 2005. Make sure to hand in both a technical project description with appendices and references and a disk or CD containing the source code. The approximate length of the programming project report should be 10 pages and 30 pages for non-programming projects. You do not need to include a hard copy of the source code.

A good project report will include the following:

  • Background of the problem from the literature search.
  • A clear definition of the problem.
  • An explanation and justification of methods of data analysis.
  • A description and justification of the data sources.
  • Analysis of the results and comparison with existing tools.
  • Conclusions based on the results.
  • Possible directions for future research.
  • Instructions on how to compile and execute your program, if applicable.
  • A full list of references.

List of Suggested Projects

A - Programming Projects

  1. DP Pairwise Comparison Algorithm

In this project, you should implement a dynamic programming algorithm that does pairwise comparison. Your program should allow the user to either use:

  • One penalty for gaps, or
  • Two gap penalties: one for starting a gap and one for extending a gap.

Your program should also have the following three options:

  • Local comparison
  • Global comparison
  • Semiglobal comparison

The program should allow the user to enter gap penalties.

Compare your program to an existing package.

[SM97] Setubal, J. and Meidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997.

  1. K-Band DP for Pairwise Comparison

If two sequences are similar, the best alignments have their paths near the main diagonal. It is not necessary to fill the entire matrix to compute the optimal score and alignment. A narrow band around the main diagonal should suffice. In this project, you are to implement a dynamic programming algorithm that does pairwise comparison and uses the K-Band procedure [SM97]. Your program should allow the user to either use:

  • One penalty for gaps, or
  • Two gap penalties: one for starting a gap and one for extending a gap.

Your program should also have three options:

  • Local comparison
  • Global comparison
  • Semiglobal comparison

The program should allow the user to enter gap penalties.

Compare your program to an existing package.

[SM97] Setubal, J. and Medidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997.

  1. Optimal Linear Space DP for Pairwise Comparison

In this project, you are to implement a dynamic programming algorithm that does pairwise comparison in linear space. The basic algorithm is of quadratic complexity. With respect to space, it is possible to improve the complexity from quadratic to linear. The algorithm is described in Section 3.3.1 “Space Saving” of Section 3.3: “Extensions of the Basic Algorithms” [SM97]. Your program should have three options:

  • Local comparison
  • Global comparison
  • Semiglobal comparison

Compare your program to an existing package.

[SM97] Setubal, J. and Meidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997.

  1. The Original BLAST

Basic Local Alignment Search Tool (BLAST) was first published in 1990 [AGM+90]. This project consists in reading, understanding and implementing the algorithm presented in the article and comparing it to an existing package.

[AGM+90] Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215: 403-410; 1990.

  1. PSI-BLAST

Position Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) was first published in 1997 [AMS+97]. This project consists in reading, understanding and implementing PSI-BLAST as described in the article and comparing it to an existing package.

[AMS+97] Altschul, S., Madden, T., Schaffer, W., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25, 17: 3389-3402; 1997.

  1. FASTA and FASTAP (three)

W. Pearson and D. Lipman developed FASTA, which provides a rapid way of finding short stretches of similar sequences between a new sequence and any sequence in a database [PL88]. Pearson continued to improve the FASTA method for similarity searches in sequence databases [Pea90], [Pea96]. This project consists in choosing one of the three FAST algorithms from one of the three referenced articles, reading, understanding and implementing it as described in the article and comparing it to an existing package.

[PL88] Pearson, W. and Lipman, D. Improved tools for biological sequence comparisons. Proc. Natl. Acad. Sci. 85: 2444-2448; 1988.

[Pea90] Pearson, W. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183: 63-98; 1990.

[Pea96] Pearson, W. Effective protein sequence comparison. Methods Enzymol. 266: 227-258; 1996.

  1. CLUSTAL W

CLUSTAL W is a commonly used package for multiple sequence alignment. It was published in 1994 [THG94]. This project consists in reading, understanding and implementing the algorithm presented in the article and comparing it to an existing package.

[THG94] Thompson, J., Higgins, D., and Gibson, D. CLUSTL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673-4680; 1994.

  1. Multiple Sequence Alignment and Genetic Algorithms

Dynamic programming approach to solve the MSA problem results in exponential time complexity. Genetic Algorithms are of considerable interest to researchers because they can find high scoring alignment as good as those found by other methods.

This project consists in implementing your own genetic algorithm for the MSA problem and comparing it to SAGA [NH96] or to [ZW97].

[NH96] Notredame, C and Higgins, D.G. Sequence Alignment by Genetic Algorithm (SAGA). Nucleic Acid Research, 1996, vol 24, 8, 1515-1524; 1996.

[ZW97] Zhang, C. and Wong, A. A Genetic Algorithm for Multiple Sequence Alignment. Comput. Appl. Bioscience, 13, 565-581; 1997.

[CW96] Corcoran, A.L., Wainwright, R.L. LIBGA: A User-friendly workbench for order-based genetic algorithm research; 1996.

  1. Multiple Sequence Alignment and Genetic Doping Algorithms

In Genetic Doping Algorithm, nothing is fixed. Everything, including the stochastic operators, depends on the context, which varies over time [Bus04]. Unlike traditional Genetic Algorithms, the probabilities of crossover and mutation vary from generation to generation. These numbers are interconnected, both depending on average fitness values of the population for each generation. It is very possible that genetic doping algorithms are more suitable for the multiple sequence alignment problem than traditional genetic algorithms.

This project consists in implementing your own genetic doping algorithm for the MSA problem and comparing it to SAGA [NH96] or to [ZW97].

[Bus04] Buscena, M. Genetic Doping Algorithm (GenD): theory and applications. Expert Systems, vol. 21, 2, May 2004.

[NH96] Notredame, C and Higgins, D.G. Sequence Alignment by Genetic Algorithm (SAGA). Nucleic Acid Research, 1996, vol 24, 8, 1515-1524; 1996.

[ZW97] Zhang, C. and Wong, A. A Genetic Algorithm for Multiple Sequence Alignment. Comput. Appl. Bioscience, 13, 565-581; 1997.

  1. Fragment Assembly (several)

With current technology it is impossible to directly sequence contiguous DNA stretches of more than a few hundred bases. Typically, several copies of random pieces of long DNA are cut. The task of sequencing DNA is called fragment assembly: which consists in reconstructing the original sequence from the fragments.

This project consists in choosing either the greedy algorithm described in Section 4.3.4 or one of the heuristics described in Section 4.4. [SM97]. Alternatively, you may choose a more recent approach described in [KS99]. Read, understand and implement the algorithm you chose and compare its performance to an existing package.

[KS99] Kim, S., and Segre, A., AMASS: A Structured Pattern Matching Approach to Shotgun Sequence Assembly. Journal of Computational Biology , 6(2), 1999, pp 163-186; 1999.

[SM97] Setubal, J. and Meidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997.

10. Physical Mapping of DNA (several)

Physical mapping is the process of determining the location of certain markers (landmarks) of a DNA molecule. The markers are generally small but precisely defined sequences. The resulting maps are used as basis for DNA sequencing, and for the isolation and characterization of individual genes or other DNA regions of interest. For example, see Figure 5.1 [MS97, page 144].

This project consists in choosing one of the algorithms described in Sections 5.2, 5.3, 5.4, and 5.5, reading it, understanding it, implementing it and comparing its performance to an existing package.

[SM97] Setubal, J. and Medidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997.

11. Phylogenetic Trees (several)

The reconstructing of phylogenetic trees is a general problem in biology. It is used in molecular biology to help understand the evolutionary relationships among proteins, for example.

This project consists in choosing one of the four algorithms mentioned below, choosing the appropriate referenced article(s), reading, understanding and implementing it as described in the article and comparing it to an existing package.

  • Phylogenetic Trees Based on Pairwise Distances [FD96]
  • Phylogenetic Trees Based on Neighbor Joining [SN87]
  • Phylogenetic Trees Based on Maximum Parsimony [Fel96]
  • Phylogenetic Trees Based on Maximum Likelihood Estimation [BT86], [Fel81].

[BT86] Bishop, M. and Thompson, E. Maximum likelihood alignment of DNA sequences. Journal of Molecular Biology. 190:159-165; 1986.

[FD96] Feng, D. and Doolittle, R. Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. Methods Enzymol. 266; 1996.

[Fel81] Felsenstein, J. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution. 17:368-376; 1981.

[Fel96] Felsenstein, J. Inferring phylogeny from protein sequences by parsimony, distance and likelihood methods. Methods Enzymol. 266; 1996.

[SN87] Saitou, N. and Nei, M. The neighbor joining method: a new method for reconstructing phylogenetic trees. Molecular Biology Evolution; 4:406-425; 1987.

12. Gene Prediction (several)

Gene prediction consists in identifying regions of genomic DNA that encode proteins.

Some of the existing models that identify and distinguish coding regions from non-coding regions are based on:

  • Hidden Markov Model,
  • Neural Network,
  • Probabilistic model,
  • Linear discrimination analysis,
  • Decision tree classification,
  • Quadratic discriminant analysis,
  • Stochastic context free grammars.

This project consists in choosing one of the above techniques and implementing the prediction (search) algorithm, which will be able to search a given database for genes that do code for proteins.

Your algorithm should be compared to an existing package.

[BK97] Burge C and Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 268: 78-94; 1997.

[Kro97] Krogh A: Two methods for improving performance of an HMM

and their application for gene-finding. Proc Int Conf Intell Syst Mol Biol 5: 179-186; 1997.

[Pre95] Prestridge, D. S. Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Biol. 249: 923-932; 1995.

[Rab89] Rabiner, L. R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257–285; 1989.

[RJ86] Rabiner, L. R. and Juang, B. H. “An Introduction to Hidden Markov Models,” IEEE ASSP Magazine, vol. 3, February 1986.

[Zha97] Zhang, M.Q. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl. Acad. Sci. 94: 565–568; 1997.

13. The Protein Prediction Problem (several)

The main goal in the protein prediction problem is to determine the three-dimensional structure of a protein based on its amino acid sequence. Recall that there are three levels to look at the protein-structure:

  • The primary structure is the sequence of amino acids in the chain i.e., a one-dimensional structure.
  • The secondary structure is the result of the folding of parts of the amino acid chain. The two most important secondary structures are the -helix, and the -strand.
  • The tertiary structure is the real 3-dimensional configuration of the protein under given environmental conditions (solvent, pH and temperature).

The tertiary structure decides the biochemical function of the protein. If the tertiary structure is changed, the protein normally looses its ability to perform whatever function it has, since this function depends on the geometrical shape of the active site in the interior of the molecule

This project consists in choosing an existing algorithm for protein prediction, implementing it, and comparing it to a package currently in use.

[CB00] Clote, P., and Backofen, R., Computational Molecular Biology: An Introduction, John Wiley and Sons, LTD; 2000.

14. The RNA Structure Prediction Problem (several)

Unlike DNA, which most frequently assumes its well-known double-helical conformation, the three-dimensional structure of single stranded RNA is determined by the sequence of nucleotides in much the same way the protein structure is determined by sequence. RNA structure, however, is less complex than protein structure and can be well characterized by identifying the location of commonly occurring secondary structure elements. [KR03]

This project consists in choosing an existing algorithm for RNA structure prediction, implementing it, and comparing it to a package currently in use.

[KR03] Krane, D. and Raymer, M. Fundamental Concepts of Bioinformatics; 2003.

15. The Clustering Gene Expression Problem (several)

Analysis of gene expression patterns can provide insight into relationships between a gene and its function. Clustering techniques applied to gene expression data partitions genes into clusters (groups) based on their expression patterns. Genes in the same cluster will have similar expression patterns, while genes in different clusters will have distinct expression patterns.

This project consists in choosing an existing clustering algorithm (such as CLICK [SS00]), implementing it, and comparing it to a package currently in use. [BSY99].

[BSY99] Ben-Dor, A. Shamir, R. and Yakhini, Z. Gene clustering expression patterns. Journal of Computational Biology, 6:281-297; 1999.

[SS00] Sharan, R., and Shamir, R. CLICK: A clustering algorithm with applications to gene expression analysis. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, 307-316; 2000.

16. Comparative Genomics (several)

Comparative genomics is the analysis and comparison of genomes from different species. The purpose is to gain a better understanding of how species have evolved and to determine the function of genes and non-coding regions of the genome.

[MBS+00] C. Mayor, M. Brudno, J. R. Schwartz, A. Poliakov, E. M. Rubin, K. A. Frazer, L. Pachter, I. Dubchak. VISTA: Visualizing global DNA sequence alignments of arbitrary length. Bioinformatics, 16: 1046-1047; 2000.

[LOP+02] G. G. Loots, I. Ovcharenko, L. Pachter, I. Dubchak and E. M. Rubin. Comparative sequence-based approach to high-throughput discovery of functional regulatory elements. Genome Res., 12:832-839; 2002.

17. Thermostability and Preferential Amino Acid and Codon Usage

Most organisms grow at temperatures from 20 to 50ºC, but some prokaryotes, including Archaea and Bacteria, are capable of withstanding higher temperatures, from 60º to over 100ºC. Farias and Bonato [FB02] investigated the preferential usage of certain amino acids (AA) and codons in thermally adapted organisms, by comparative proteome analysis.

This project consists in writing a program that calculates the G+C% of the genome sequences, computes the average proportion of each AA in each genome, computes the E+K/Q+H ratio for each genome, and computes the codon usage of each AA in each genome.

Perform the analysis of the whole genome sequences mentioned in the article. Your program should be able to group all non-thermophylic (mesophylic) genomes into one category and hyperthermophylic and thermophilic genomes into a second category and to compute the required statistics.

[FB02] “Preferred codons and amino acid couples in hyperthermophiles” by S.T. Farias and C.M. Bonato. Genome Biology, 2002.

18. Genome signatures in prokaryotic and eukaryotic organisms

Each genome has a characteristic "signature" defined as the ratios between the observed dinucleotide frequencies and the frequencies expected if neighbors were chosen at random (dinucleotide relative abundances). The remarkable fact is that the signature is relatively constant throughout the genome; i.e., the patterns and levels of dinucleotide relative abundances of every 50-kb segment of the genome are about the same. Campbell, Mrazek and Karlin analyzed the signatures of different genomes in [CMK99]. More precisely, they compute the G+C% of the genome sequences, the dinucleotide relative abundances of complete genomes, the genomic signature profiles of organisms and also the genomic signature difference between pairs of organisms.

This project consists in implementing an algorithm that computes the G+C% of the genome sequences, the dinucleotide relative abundances of complete genomes, the genomic signature profiles of organisms and also the genomic signature difference between pairs of organisms. Apply your program to prokaryotic and eukaryote genome sequences and present your results in a format similar to Figure 1 and Figure 2 (see [CMK99].

[CMK99] “Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA” by A. Campbell, J. Mrazek, and S. Karlin. Proc Natl Acad Sci U S A. 1999 August 3; 96(16): 9184–9189.

19. Asymmetric substitution patterns

The analyses of the genomes of three prokaryotes, Escherichia coli, Bacillus subtilis, and Haemophilus influenzae, by Lobry [Lob96] revealed a new type of genomic compartmentalization of base frequencies. There was a departure from intrastrand equifrequency between A and T or between C and G, showing that the substitution patterns of the two strands of DNA were asymmetric. The positions of the boundaries between these compartments were found to coincide with the origin and terminus of chromosome replication.

Grigoriev [Gri98] developed a method of cumulative diagrams that shows that the nucleotide composition of a microbial chromosome changes at two points separated by about a half of its length. These points also coincide with sites of replication origin and terminus for all bacteria. The leading strand is found to contain more guanine than cytosine residues.