Discovering Genetic Variations and Improving Lives with Windows High Performance Computing Clusters

Published: May 2003

Perlegen’s Bioinformatics organization uses a Microsoft Windows-based High Performance Computing cluster to analyze individual variations in human genome data. Perlegen provides a cost-effective way for drug companies to research and develop treatments for a variety of diseases, helping to improvethe lives of millions of people who suffer from them.

Background

Perlegen Sciences is a privately-held company founded in 2000 to conductgenetics research and developtherapeutic and diagnostic products that impact and improve people's lives. Perlegen has identified and validated millions of genetic variations in humans using high density microarray technology. These variations, which occur in about 0.1% of the sequence that comprises human DNA, are known as single nucleotide polymorphisms, or SNPs (pronounced “snips”). They are responsible for the traits that distinguish one individual from another – including differences in disease susceptibility and variations in drug metabolism that can impact the effectiveness of therapeutic treatments. Many of today’s most debilitating and costly illnesses, including heart disease, diabetes, cancer, and migraines have a significant genetic component – Perlegen’s technology will provide researchers with new insights into such diseases, and new tools for crafting effective therapies.

Perlegen combines this information about the natural genetic variations with highdensity microarray whole genome scans to compare millions of genetic variations in thousands of individuals at an unprecedented level of resolution. This makes whole genome scanning of patient populations a cost effective tool in determining the genetic factors involved in disease and drug response.

Based on this technology platform, Perlegen has developed partnerships with leading pharmaceutical companies to conduct ongoing research into how variations in genes and regulatory sequences are associated with disease, drug response and other traits. Through these collaborations, Perlegen is accelerating the discovery and development of pharmaceutical and diagnostic products by enabling:

  • Discovery of novel potential drug targets and markers which predict drug response
  • Prioritization of drug targets for further development
  • Stratification of clinical trial participants for drug efficacy and side effect susceptibility
  • Expansion of use for drugs already on the market
  • Development of new pharmaceutical products and diagnostic tests

Solution

In order to conduct such studies, Perlegen first had to locate these variations in a representative human population. In its SNP discovery effort, Perlegen performed full genome scans of the DNA of 50 unique individuals – nearly ten times the genetic content analyzed by Celera and the Human Genome Project in generating the draft of the Human Genome. It did so using high density oligonucleotide microarray wafers and a Microsoft®Windows® 2000-based compute cluster used to process the information on the wafers. Perlegen was able to complete this activity in less than eighteen months – under budget and ahead of schedule.

With the discovery effort nearing completion and a robust data processing and analysis facility in place, the informatics team turned its attention to the development of an enhanced Laboratory Information Management System (LIMS) that would form the foundation for its genotyping and association studies.

Perlegen’s approach to the implementation of informatics systems that support both of these efforts is based on a development philosophy that includes:

  • Exploitation of commodity hardware and software components and low-cost distributed computing;
  • Use of industry-standard RDBMS technology and database-centric applications;
  • Use of current enterprise software development tools and methodologies; and
  • Development of highly leveraged applications employing both native client and web client architectures.

“This is the architectural blueprint that provides the context for Perlegen’s development efforts,”according to Bruce Moxon, Director of Bioinformatics at Perlegen. “The novel approaches and unprecedented scale of our operations led us to select development tools and implementation platforms that would allow us to best leverage rich software development frameworks and concentrate on the challenging problems before us.”

Database-centric application development is a key theme that differentiates Perlegen from many bioinformatics organizations and provides significant leverage in the day-to-day management and analysis of large datasets. The database-centric approach greatly facilitates real-time tracking, monitoring, and reporting – not just of laboratory activities, but also of the complex analytic tasks and their results.

“With simple [Microsoft] SQL queries and standard reporting tools, such as Microsoft Excel PivotTable® views, Perlegen is able to quickly obtain current and historical (trend) views of a wide range of operational metrics,” stated Pascual Starink, Manager of Laboratory Informatics at Perlegen. “This standardized approach to high throughput informatics enables us to provide a production-oriented set of services for our internal and external research partners.”

Much of Perlegen’s analysis requires processing of very large datasets – typically with a wide range of data subsetting and reporting requirements. Perlegen employs techniques and tools that have been developed in support of commercial VLDB (very large database) and Data Warehousing and Mining systems, including dimensional modeling and parallel ETL approaches, to leverage the state-of-the-art in large scale data management.

“Increasingly, bioinformatics organizations in genomics and proteomics companies are realizing the value of using commercial tools and approaches to enterprise data management challenges,” says Mr. Moxon. “The leverage that such tools provide is critical in converting new technologies and protocols in early phase biotechnology companies into scalable, replicable production biology.”

Solution Details

Perlegen’s computing infrastructure was developed around the database-centric distributed computing model. This model utilizes scalable commercial relational database technology in conjunction with commodity Network Attached Storage (NAS) technology, Linux and Windows-based distributed computing “farms”, and Gbit-over-copper networking to effectively support the required range of computing activities. This affords Perlegen the ability to incrementally scale its computational and data management infrastructure to meet the needs of its internal R&D teams, and of its growing set of partners and customers. Dell Intel-basedcompute nodes are configured into a centrally managed distributed computing farm that is used both to process Perlegen-generated wafer data and to provide more traditional sequence analysis and annotation capabilities. The rack-mount dual-processor nodes can be configured and managed remotely, providing a scalable computing infrastructure that affords capacity on demand.

Perlegen’s Laboratory Information Management System (LIMS) is used toacquire, track, manage, and monitor information associated with laboratory operations in its experimental studies. This includes: experiment scheduling; sample and reagent acquisition and tracking and management of inventories; instrument and environment operational monitoring and data collection; and chain-of-custody tracking and electronic enforcement of Standard Operating Procedures (SOPs). The LIMS system includes modules supporting secure web-based remote data entry and data access of blinded study data, wireless handheld PocketPCs with integrated barcode scanners, and a unique lab workflow engine that allows new activities and protocols to be quickly brought online. Microsoft Data Transformation Services (DTS) are used to automatically publish experimental data from the LIMS to a multi-terabyte Oracle analytic database. The LIMS system was developed with Microsoft .NET development environment, using Microsoft Visual C#® and the Microsoft SQLServer™ 2000 database. It is currently in operation in support of Perlegen’s Genotyping and Disease Association collaborations.

The processing of the microarray wafer data is a high throughput application, requiring both massive datasets and extensive computation. TheWindows-based compute farm is managed by Perlegen’s Production Computing Task Management System. This system provides for scheduling of compute tasks (Windows applications) that process the data on the compute cluster and store results to the analytic database. It employs a database-centric execution and monitoring component that affords policy-based prioritization, management by exception and immediate notification in case of application error, and cluster status and monitoring (including trending) using standard SQL-based reporting tools.

Benefits

As of January, 2003, the system has been in operation for nearly eighteen months, processing and tracking over 100 terabytes of information. At peak processing, daily system throughput exceeded 500GB a day, accomplished through the execution and management of over 15,000 daily computational tasks. During this time, nearly 6000 high density oligonucleotide arrays (wafers) were scanned and analyzed. Each of these wafers consists of 60 million individual DNA probes, and generates a little over 8 GB of raw data when scanned.

Perlegen has benefited greatly in its effort from the effective use of Microsoft technologies, including Windows compute clusters, SQLServer 2000, and the .NET development environment. Microsoft .NET is software for connecting people, information, systems, and devices.Some of the key benefits include:

  • Outstanding overall system reliability and availability
  • Low total cost of ownership (TCO)
  • Scalability to support growing business requirements; and
  • Rapid application development and deployment

“Perlegen’s informatics infrastructure has enabled unprecedented insight into human genetic variations,” states Perlegen’s Chief Information Officer, Greg Brandeau. “We will use this knowledge to explore the genetic cause of disease and drug response so that ultimately we can make a difference in people’s lives.”

Conclusion

Perlegen Sciences’ mission is to improve lives through better understanding of the molecular basis of disease and drug response. Perlegen’s approach employs whole genome scanning, a powerful technology that generates large amounts of data and requires sophisticated data management and analysis capabilities. Perlegen has been successful in meeting its aggressive research and business goals through the development of a world-class informatics infrastructure based on commercial information technologies – including key components from Microsoft.

For More Information

For more information about Microsoft products and services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Information Centre at (877) 568-2495. Customers who are deaf or hard-of-hearing can reach Microsoft text telephone (TTY/TDD) services at (800) 892-5234 in the United States or (905) 568-9641 in Canada. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information using the World Wide Web, go to:

© 2003 Microsoft Corporation. All rights reserved.

This case study is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.

Microsoft, PivotTable,Visual C#, andWindows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.