CDIS White Paper

IData Intensive Science: Challenge and Opportunity

We propose a Center for Data Intensive Science (CDIS) that will provide the research foundation and crucial integrative development for the methodologies, algorithms, software, and technology tools required to enable revolutionary advances in data intensive science over the next twenty years. The Center’s focus will be on the challenging opportunities offered by immense, often heterogeneous and/or distributed data collections. Today, the largest of these is nearing a Petabyte (1015 bytes) in size, with next-decade sizes of Exabytes (1,000 Petabytes) expected. Disciplines with archives approaching this size already are encountering major computational science and computer science challenges to extraction of scientific insights and results. Disciplines that have avoided data-intensiveness are stimulated to rethink basic methodological and data management strategies to achieve new levels of performance by exploiting such archives. Both cases implicitly involve major knowledge management issues.

The central issue is to unlock, efficiently, the implicit value of ultra-large datasets at a qualitatively higher level than that of a simple repository. Size alone is not the profound challenge nor opportunity. Ultra-large archives involve data with unprecedented complexity and information content. The problem is to combine data access, processing, information analysis, knowledge representation, and presentation with the associated problem of simultaneously scheduling (“co-scheduling”) the requisite computing, data handling and networking resources on an unprecedented scale and with a level of national and worldwide distribution of resources not previously encountered. CDIS is founded on major expertise in Data Grids[1], new data base technologies, knowledge representation methods and standards, pattern recognition and feature extraction methodologies, data representation and visualization, thin-client devices, high performance computing, and complementary expertise in important disciplinary research. These make it possible to treat the ensemble of data stores, computing resources, and networks in those scientific disciplines as a unified, managed system and to address an entirely new class of data-intensive scientific problems.

The recent EU-US Workshop on Large Scientific Databases Report[2] summarized the situation: “It is not very difficult to put some scientific data into a file system and…work with it….But when the database grows, things get more difficult…” (pp 12-13) Because ultra-large, complex archives are generated at substantial expense, the Report says that “Without effective ways to retrieve, analyze, and manipulate these data, that expense will not yield the benefits to society that we might expect.” (p.4) The Report identified crucial research on “...the creation and use of large scientific databases” (p. 4). It recommended “...funding application driven, multi-disciplinary research...preferably directly connected to scientifically interesting research.” (p. 1; Emphasis added) This Center will pursue that agenda.

Key problem areas are: magnitude and complexity of the collections; geographical dispersion (the archives, associated software and hardware, and the scientific teams); representation and modeling of information content; discovery, analysis, representation, and management of relationship knowledge; and scalability and coupling of integrative technologies (databases, analysis and management systems, access devices, storage systems, grids and data-handling middleware, and networks). We emphasize that the unique characteristics of scientific information (e.g., continuous data, large objects, high degrees of uncertainty, often poorly understood relationships) mean that one cannot look to rapidly advancing commercial technologies to solve these problems.

This Center’s mission is to meet these challenges via a focused, systematic, multidisciplinary approach. CDIS will: (1) Relate the current challenges and limitations in data access, processing and analysis to the underlying technologies, then propagate strategies and prototypical methods to deal with the problems at the leading edge (size and complexity, geographical spread; CPU, device I/O, net bandwidth and management, and similar limits; new algorithms, data structures, or feature analysis.) (2) Carry on a research program to extend the range and scope of “solvable” data problems, to increase working efficiency and ease (e.g. through location and storage-medium independence). Problems at the leading edge will receive intensive treatment, so that the near-term outcomes are new scientific results plus improved capabilities for handling data-intensiveness. (3) Identify needs and solutions in data intensive problem solving that are common to several disciplinary specialties, with particular attention to methods for linking knowledge representations and information management. (4) Design, prototype, and deploy software tools to deal with those shared problems. CDIS also will describe and develop new methods and approaches to correlate feature extraction with knowledge structures. As appropriate, the software will be added to existing libraries or to a library of new open standard methods and tools for large-scale data handling and analysis applicable to more than a single scientific field. (5) Make new toolsets, (e.g., the GriPhyN PVDG tools to Grid-enable applications) to create or substantively enhance scientific environments for data-handling, utilization, visualization, and collaboration to make large-scale, high-complexity data analysis more accessible to the scientific communities. (6) Provide educational and research opportunities for U.S. students and faculty and linkages among research institutions and industry to facilitate two-way transfer of research results. Implicit issues include scalability and inter-operability methods and standards, re-purposing of commercial data intensive application methods (e.g., data mining, portals, thin clients), and effective coupling of new developments in data-intensive specialties to science teaching.

Long-term exponential increases in data storage and transmission capacity, driven by price declines in media, make Petabyte local storage systems and Exabyte aggregate storage capacities likely for many science communities over the next decade. Aggregates are an especially challenging opportunity. They involve large-scale data migration among different media (with attendant timing, scheduling, performance issues), complex internal structures, and relationships. Since I/O rates have not kept up with rapidly growing storage system sizes in recent years, data transport and its management as a “long transaction” will grow in importance, especially in the context of simultaneous transactions across a complex network set. Data-intensive science also is developing because interaction increasingly involves geographical and temporal dispersion. Trends are to exploit these developments aggressively by generating and retaining large, heterogeneous aggregations of potentially or actually interesting data. Examples include observational data, video, audio, VR experiences, and simulations (discipline-specific and abstract distributed system). In disciplines that long have been data-limited to the point of discarding most computed results (e.g., materials simulation and modeling), rethinking of methodologies to include pattern mining and hypothesis testing on vast collections of computed data (including fused archives and federated databases) has begun.

New storage capabilities alone are inadequate to realize the scientific potential of ultra-large, complex archives. “It is generally much easier to accumulate data …than to use it effectively. The goal must be to increase the productivity of the working scientist … [by providing] more intelligent and powerful tools with which to query and analyze the data and by helping to organize communities of scientists in their collective enterprises.” [3] The Report on a Future Information Infrastructure for the Physical Sciences,[4] though oriented more toward scientific literature, nevertheless called for a “…conceptual change that allows us to better integrate information types (e.g., text, data, images, animations) as well as information at various states of analysis (raw data, partially processed data, ..., analyzed data in varying degrees of analyses....). Such a conceptual change would facilitate the serendipity and insights that can be gained by dealing with these multiple information types and their interrelationships.” [Emphasis added.] Straightforward adoption of existing database technology is not viable.[5] Developed primarily in the business environment, today’s highly specialized database technology does not address such crucial questions as:

How can a community collaboratively create, manage, organize, catalog, annotate, analyze, process, visualize, and extract knowledge from a distributed Exabyte-scale aggregate of heterogeneous data?
How can an individual scientist or small workgroup make discoveries with a Petabyte of heterogeneous data? Presuming the data to have been created and organized properly, what methods and tools are essential to analyze, process, visualize, and extract knowledge optimally from such an aggregate?
How is the presupposed creation, proper organization, and analysis of Petabyte archives to be done with maximum benefit, least redundancy, and minimum diversion of resources from the fundamental science effort? How can recent work on standards to describe knowledge structures and their associations with information resources best be exploited for these tasks?
What would be the impact on algorithm design and performance, hence problem solution, for computational specialties (e.g., simulations) of using ultra-large archives of computed data (instead of discarding most of the computed data)? What about comparing with correspondingly large archives of experimental data?
How can recent progress in object-oriented software and databases, targeted at access to complex and diverse data types, be adapted to work efficiently on the Petabyte-to-Exabyte scale, across an ensemble of networks?
Which infrastructure traits (e.g., scalability, interoperability, new query processing modes, security, performance) are prerequisites to realizing the scientific potential of data-intensiveness and how should they be achieved?

Only a coordinated, multi-specialty attack will yield solutions that are transferable and lasting across disciplines. A useful prototype for lasting transferability is Netlib,[6] which provides documented, tested mathematical software for incorporation in a wide variety of applications, both scientific and commercial. For example, in linear algebra Netlib-accessible software long since has relieved scientists in any discipline of having to write linear algebra code.

A new opportunity in physical and biological science is archive-based research. Called collection-based research in Ref. 2, this mode is familiar in social sciences, e.g., analysis of census data over multiple decades, though not in automated form as envisioned here and in Ref. 2. The potential is to examine the archive (alone or in fusion with others) for insights and results not necessarily related to the basic motives for establishment of the archive. Archive -based research requires that the algorithms and underlying rules for representing essential features of the collection be published, that the rules be deposited interoperably with the collection, and that there be a logic system that applies those rules within the context of a discipline-specific knowledge base. Reliable, standard methods to meet these requirements are unavailable at present for scientific data archives.

IIData Intensive Disciplines: Shared Features

Data-intensiveness is a shared feature of many important, far-reaching research efforts on the fundamental forces of nature, biomolecules and biological systems, atmospheric and geophysical phenomena, the composition and structure of the universe, and advanced materials. A summary of key projects follows.

Physics & Materials: Gravity wave searches detect gravitational waves of pulsars, supernovae and events leading to black holes. Primary US examples are LIGO[7] and the recently proposed LISA (NASA-ESA)[8] space-based experiment. The data accumulation rate will be approximately 100 - 200 Terabytes per year by 2002. Planned upgrades will raise that to 500 Terabytes per year by the middle of the decade. The data are heterogeneous and often involve intensive calculations (high flops/byte). Digital full-sky surveys map and catalog the entire sky at high resolution. Terabyte-sized data sets exist already.[9] Federated, highly distributed, heterogeneous data archives are in prospect, leading to a vast, searchable, several Petabyte database.[10]High energy and Nuclear Physics collision detectors produce a very dense, heterogeneous data stream. Today high-energy physics data storage rates are several 100 Terabytes per year per experiment.[11] The Large Hadron Collider[12] expects several Petabytes per year (in 2006), rising as beam intensities grow. Nuclear Physics experiments (Jefferson Lab, Relativistic Heavy Ion Collider) currently yield 10 to 20 MB/sec year round. Next generation experiments involve 100 MB/s rates. The typical archive is heterogeneous by beam, background, trigger, target condition, calibration and simulation data. High intensity X-ray scattering[13] from synchrotron sources probes the 3-D atomic-scale structure of advanced materials and biological molecules. An existing diffraction station on the Materials Research Collaborative Access Team (MRCAT) beamline at the Advanced Photon Source currently has Terabytes/month data rates. Limited only by detector technology and sample positioning, this rate is expected to grow to over an Exabyte per month for different samples. Materials and molecular modeling[14] today involves wholesale data abandonment. Such modeling, of both biomolecular and novel materials, is notoriously compute-intensive since current strategies minimize data-intensiveness. Even so, only a tiny fraction of the computed data is analyzed via a small number of scalar and vector functions, plus qualitative visualization. A single study nevertheless can yield several Terabytes of heterogeneous data in a few months.

Biology: In genomics sequence databases are growing at an exponential rate. Useful information often depends upon heuristics to approximate the output of np hard algorithms.[15] With DNA micro-array technology, every gene in an organism for the first time can be monitored. Microarray time course studies can generate more than a Gigabyte of data per experiment. A completely new area of statistics and bioinformatics – analysis of DNA microarray experiments – is emerging. “Taking over where genomics leaves off”,[16]proteomics develops a comprehensive catalog of proteins expressed in individual cells, and then generates and tests hypotheses concerning their functions.[17] Additional data interrelate proteins into metabolic and regulatory pathways. The relation between the three dimensional structure of proteins and their roles in these pathways requires analysis of yet another kind of database. Both are growing rapidly. For example, the Protein Data Bank (which holds a collection of x-ray and NMR biomolecular structures), currently at 100 GB, has been doubling in size every 3 years. Proteomics also exemplifies knowledge relationships that should be expressible via a knowledge base. Ontology mapping based upon knowledge bases is needed to correlate information between genomics databases, protein data banks, and molecular trajectory databases.[18]Three-dimensional scans of whole organisms are exemplified by the Human Brain Project[19] (time series of human brain 3D scans) yielding hundreds to thousands of Terabyte outputs. The related NPACI Neuroscience thrust area[20] is supporting the federation of neuroscience databases of brain images, an example of the use of knowledge systems to define relationships between the physical and biological components of nervous systems.

Earth sciences: Systematic satellite earth observations provide complete, multiple-wavelength observations to yield improved understanding the Earth as an integrated system. The best-known example is the Earth Observing System,[21] which will record Petabytes of data in the early part of this decade. Next-generation climate models of the air-water-ice-land system at adequate spatial resolution will easily generate hundreds of Terabytes of data. A worldwide scientific community analyzes and interprets this output. A major challenge is robust extraction of features of interest, such as extreme events. The characterization rules for climatic features need to be represented in a knowledge base that is used to compare results from multiple climate models with tight coupling between knowledge representations and scientific data feature extraction.

Notably, the shared features of these data intensive specialties go beyond mere magnitude to essential characteristics: geographical dispersion, collaborative analysis, information density, heterogeneity and complexity (many types, multi-faceted associations among items and subsets, etc.), need for knowledge representation and feature extraction based on often complex criteria, very large numbers of named collections, and scheduling jobs for which both CPU resources and data movement transactions have to be accounted. Implicit also is the transition from file storage to scientific databases to handle increased complexity. These intrinsic features generate shared interests: (a) automated and semi-automated techniques (e.g., agents) for detecting and extracting “significance” from complex, multi-modal, often continuous (not discrete as in commercial data mining) data; (b) methods for assembly of information and then knowledge from multiple geographically distinct sources, (c) techniques by which massive datasets can be utilized as community resources and operated upon by associated communities, and (d) integration of ontologies.

IIIResearch Program

III.1Directions and Objectives

Little organized effort has been invested in finding ways to exploit the opportunities offered by these shared features. Valuable methodologies, algorithms, and toolkits devised within one specialty often are customized for the specialty in which they originated and go largely unknown by researchers in other areas therefore. Inter-specialty progress also is blocked by the lack of any systematic, coordinated curriculum to educate students in this area. The situation is exacerbated by lack of interchange with the Information Technology (IT) industry. IT solutions typically are unfamiliar in disciplinary research; hence, the potential for re-purposing (commercial to research) goes unexploited. Successes in disciplines other than Computer Science enter the business sector unsystematically at best. (Only in the specific area of data mining is there a dedicated large scientific database effort that includes the relationship to commercial applications.[22] In contrast, the treatment of large, complex scientific archives as envisioned in Ref. 2 and proposed here is applications-driven and broadly based.)