Course Syllabus Brandeis University

Division of Graduate Profession Studies

Rabb School of Continuing Studies

I. Course Information

1. Biological Data Mining and Modeling

2. 143RBIF-112-1DL

3. Distance Learning Course

Week 1 starting on Wednesday Sep. 17, 2014. The course week runs Wednesdays through to Tuesdays. The last week ends November 25, 2014.

4. Instructor’ Contact Information

Madhu Natarajan, PhD

Please use email to arrange appointments.

5. Document Overview

This syllabus contains all relevant information about the course: its objectives and outcomes, the grading criteria, the texts and other materials of instruction, and of weekly topics, outcomes, assignments, and due dates.

Consider this your roadmap for the course. Please read through the syllabus carefully and fell freeto share any questions that you may have. Please print a copy of this syllabus for reference.

6. Course Description

  • The development of new bioinformatics tools typically involves some form of data modeling, prediction or optimization. This course introduces various modeling and prediction techniques including linear and nonlinear regression, principal component analysis, support vector machines, self-organizing maps, neural networks, set enrichment, Bayesian networks, and model-based analysis.
  • This course is not intended to explore intricacies of analysis methods and/or algorithm development but to explore how to use different approaches to analyze biological data and extract some insight into biology.
  • The didactic part of this course is designed to introduce you to (a) various analysis techniques & methods,(b) tool kits for implementing these methods, and (c) introduction to some biological/experimental methods providing the data for analysis using (a) and (b). Students will be introduced to examples of analysis from scientific literature, and are actively encouraged to identify new examples/sources and bring these to the class for discussion.Part of the expectation of the student is also to contribute to weekly discussions, especially around the pros and cons of methods, identifying when methods fail and how these translate into real life expectations of the practicing bioinformatician. It is important to realize that distance learning does not imply learning in isolation - communication is crucial to success in a DL and provides opportunities for self-exploration, collaboration with peers and learning from your own A-Ha moments when you learn by asking probing questions.I look forward to our discussions during these ten weeks.
  • Prerequisites: Probability & Statistics; Proficiency in R programming, RBIF 111.

7. Materials of Instruction

a.Required Texts

  • Bioinformatics and Computational Biology Solutions Using R and Bioconductor, Eds. R. Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit, Springer 1stEdition, 2005. ISBN 0-387-25146-4. Students from previous years have pointed to links where portions of this book are available online. I cannot vouch for these links, but I would encourage students to look online to see if the authors have shared any of this information.

b. Required Software

  • R version 3.1.1 (see )

c. Optional Text(s) / Jounals

  • The elements of statistical learning: data mining, inference, and prediction Trevor Hastie, Robert Tibshirani, Jerome H. Friedman, Springer-Verlag, New York, 2009. This book is highly recommended. It makes for very dense reading so I am not using this as a text-book, but it is a very useful reference manual for this course and for all practicing bioinformaticians.
  • An Introduction to Statistical learningGareth James, Trevor Hastie, Robert Tibshirani, Springer-Verlag, 2013. This is by the same lead authors as the previous book, but the reading matierla is a lot les dense. There are more examples with R that allow you to look at the implementation of the algorithms.
  • Handbook of Parametric and Nonparametric Statistical Procedures, D. J. Sheskin, Chapman & Hall/CRC 3rd Edition, 2003. ISBN-10 1584884401; ISBN-13 978-1584884408
  • Pattern Classification, R.O. Duda, P.E. Hart, and D.G. Stork, Wiley-Interscience 2nd Edition, 2000. ISBN 0-471-05669-3
  • A Handbook of Statistical Analyses Using R, B.S. Everitt and T. Hothorn, Chapman and Hall/CRC 1st Edition, 2006. ISBN 1-584-88539-4

d. Online Course Content

  • This is a Distance Learning (DL) course, which will be hosted at Brandeis’ LATTE site, available at The site contains the course syllabus, weekly topic notes, assignments, and discussion forums through which we will communicate during this course.

8. Overall Course Objectives

This course is intended to provide students with an understanding of:

  • What methods are commonly used for data analysis
  • How analysis results are interpreted in the context of drug discovery and development
  • How specific software tools are applied to data modeling

9. Overall Course Outcomes

At the end of the course, students will be able to:

  1. Find the appropriate method for common data analysis problems
  2. Have a sense of where to look if these methods are insufficient
  3. Be familiar with the application of commonly used software tools
  4. Be able to compose a meaningful report of the analysis

10. Course Grading Criteria

Percentages earned per assignment:

Percent / Component
N/A / Course questionnaire
50% / Homework problem sets (5 weeks)
30% / Discussion and online class participation (10 weeks).
20% / Final project

b. Grading Criteria for Discussions/Online Participation (100 raw points total per week, translating to 3% of total course per week)

  • Per GPS guidelines you must post on three different calendar days of the course week.Failure to do so will result in a 10 point deduction from your total participation points for the week.
  • There will be two discussion topics posted each week.
  • The 100 points for discussion each week are divided into two 35 point discussion responses to instructor posts, and one30 point responses to a peer post.
  • Exceptional posts are those that (for example)
  • Provide/include original analysis of the course material,
  • Provide/include analysis of the same methods on novel data sets,
  • Provide/include appended code that runs without errors,
  • Provide/include extrapolation and analysis of where methods successfully worked and where they did not,
  • Provide appropriate citation of references,
  • Are well-written (grammar/spelling).
  • Responses to peer posts will be graded on the same above criteria, but with additional requirements that responses to peer posts must clearly identify the original author/message to which the post is a comment in response, and provide novel insight beyond a simple “I agree”
  • In layman terms, the response must clearly go beyond being the equivalent of a +1 or a “Like” post.
  • Any discussion disagreements on analysis, interpretation or results MUST be polite and constructive. This is a critical and absolute requirement.

c. Make up policies.

  • Any student who misses deadlines for homework submissions can make up grades by working on additional assignments. These can be in the form of extra credits associated with already assigned homeworks, or can be materials provided upon special request.

10. Academic Integrity

All students are expected to read and understand the guidelines posted in the Academic Honesty and Student Integrity website posted above. If any part of this is not clear, please contact your instructor immediately.

II. Course Information

Week 1 (Sep 17-23)

Introduction to biological data mining and modeling

  • On the differences between data mining and modeling
  • An introduction to regression
  • An introduction to model building
  • Understanding the predictive power of modeling
  • When models go wrong
  • Overview of the field of biological data mining - applications, challenges, future directions.

Homework 1 assigned

Week 2 (Sep 24-30)

Uncertainty in Biology – Causes, concerns, approaches to deal with uncertainty.

  • Introduction to data visualization
  • Introduction to normalization

Introduction to high throughput biology

  • High throughput technologies
  • What can we reliably measure and what can it tell us about the cell?
  • Target-based compound screening
  • Cell-based screening,
  • High content screens,
  • Large scale RNAi screens.
  • Statistical analysis of screens, Z and Z’ factor, data visualization and integration.

Homework 1 due

Week 3(Oct 1-7)

Unsupervised methods – Part I

  • Hierarchical clustering
  • Principal component analysis
  • Independent component analysis

Introduction to transcription data

  • RNA profiling technologies, experimental design,
  • Data normalization,
  • Application of clustering and dimension reduction methods

Homework 2 assigned

Week 4(Oct 8-14)

Unsupervised methods – Part II

  • Unsupervised: hierarchical clustering, principal component analysis,independent component analysis
  • Set enrichment methods,
  • Meta analysis of microarray datato build on identified patterns

Homework 2 due

Week 5(Oct 15-21)

Supervised methods, model assessment and selection

  • Linear methods for regression and classification:
  • Linear discriminant analysis,
  • Logistic regression;
  • Naïve Bayes classifier;
  • Nearest-neighbor method

Homework 3 assigned

Week 6(Oct22-28)

Supervised methods, model assessment and selection - Part II

  • Regression and classification trees,
  • Neural networks,
  • Support vector machines
  • Model assessment and selection: AIC, BIC, cross-validation;

Homework 3 due

Final Project assigned

Week 7(Oct 29-Nov 04)

Meta methods

  • Boosting trees,
  • Model averaging and bagging,
  • Random forest

Homework 4 assigned

Week 8(Nov 05-11)

Integration and meta-analysis of high throughput datasets

  • Biological database, set enrichment methods, text-mining

Proteomics

  • Review of proteomics technologies: 2D gels, mass spectrometry, protein arrays, 2-hybrid methods, post-translational modification detection,
  • Analysis of protein networks, network properties.

Protein pathways and their interaction

  • Protein pathway compendia
  • Pathway comparison metrics and applications
  • Data integration examples: relevance networks, machine learning.

Homework 4 due

Week 9(Nov 12-18)

Principles of biological networks.

  • Reconstruction of networks - Graphical models:
  • Boolean networks,
  • Co-regulation networks,
  • Bayesian networks.
  • Dynamic network inference

Homework 5 assigned

Week 10(Nov 19-25)

Mechanistic modeling of biological systems.

  • Principles of mechanistic modeling: mass balance, chemical reaction systems, flux balance analysis
  • Deterministic and probabilistic modeling approaches to specific common biological problems.

Homework 5 due

Final Project due

III. Course Policies and Procedures

I. Late Policies

  • Discussion responses will be accepted late with a 5 (raw) point deduction per day.
  • Homework assignments will be accepted late with a 5 (raw) point deduction per day after the deadline.
  • Substantive responses to discussion posts will not be accepted after the deadlines.
  • Any student who misses deadlines for homework submissions can make up grades by working on additional assignments. These can be in the form of extra credits associated with already assigned homework, or can be de novo materials provided upon special request. Students cannot make up for missed discussion responses.

II. Work Expectations

  • Expect to spend about 2-4 hours per week reading the course material and anywhere from 4-8 hours doing homework, responding to discussion posts, etc.
  • Plan ahead to make sure your tasks are completed in a timely manner.
  • The final project will be an amalgamation of tasks accomplished throughout the course and will take an addition of 4-24 hours of work.
  • I will post weekly deadlines for expectations for the week.
  • A cumulative list of all expectation deadlines will also be posted on Week 1.

III. Feedback

  • Homework and class participation grades will typically be posted within a week of completion of tasks.

IV. Confidentiality

  • In the course of the class, some of you may want to post examples of real data from your day jobs or other sources. Please remember that you must not share any information that is confidential, proprietary or in any way embargoed from public disclosure.
  • Please refrain from discussion of your peer’s work or interactions with peers outside the classroom.