2004 Dagstuhl Workshop

Data Mining: The Next Generation

Rakesh Agrawal

Johann Christoph Freytag

Raghu Ramakrishnan

Introduction

Data Mining has enjoyed great popularity in recent years, with advances in both research and commercialization. The first generation of data mining research and development has yielded several commercially available systems, both stand-alone and integrated with database systems; produced scalable versions of algorithms for many classical data mining problems; and introduced novel pattern discovery problems.

In recent years, research has tended to be fragmented into several distinct pockets without a comprehensive framework. Researchers have continued to work largely within the parameters of their parent disciplines, building upon existing and distinct research methodologies. Even when they address a common problem (for example, how to cluster a dataset) they apply different techniques, different perspectives on what the important issues are, and different evaluation criteria. While different approaches can be complementary, and such a diversity is ultimately a strength of the field, better communication across disciplines is required if Data Mining is to forge a distinct identity with a core set of principles, perspectives, and challenges that differentiate it from each of the parent disciplines.

Further, while the amount and complexity of data continues to grow rapidly, and the task of distilling useful insight continues to be central, serious concerns have emerged about social implications of data mining. Addressing these concerns will require advances in our theoretical understanding of the principles that underlie Data Mining algorithms, as well as an integrated approach to security and privacy in all phases of data management and analysis.

We believe that it is timely to bring together researchers from a variety of backgrounds to re-assess the current directions of the field, to identify critical problems that require attention, and to discuss ways to increase the flow of ideas across the different disciplines that Data Mining has brought together. We propose a workshop to foster such a discussion.

Workshop Theme

The success of Data Mining depends on many constituencies (e.g., academia, tool vendors, policy advocates and regulators), each with their own agendas and concerns, and some focus is desirable to ensure good interactions. We will focus the workshop on research directions, and specifically, directions that will lead to increased use of techniques and perspectives drawn from the different disciplines involved in KDD. The workshop participants will be asked to identify promising research problems for the next 5 years, using three criteria:

·  Is this problem real? Will the practice of data mining be significantly improved by advancing the state of the art?

·  Does the problem have sufficient depth and breadth to engage the research community?

·  Does the problem cut across boundaries of traditional disciplines like Database Systems, Machine Learning, and Statistics? Will it lead to increased collaborations and cross-disciplinary research?

Some candidate problems are listed below, and are intended to serve as a seed for further discussion:

1. Compositional Data Mining: Can we develop compositional approaches and
optimization of multi-step mining "queries" to efficiently explore a large
space of candidate models using high-level input from an analyst? The goal
is to reduce the time taken to explore a large and complex dataset
iteratively.

·  Examples of real applications that made use more than one data mining operation.How was the composition achieved? How it could have been different? What was missing?

·  Illustrative examples of how compositional use of mining techniques can be useful.

·  Thoughts on primitive operations, algebra of composition, opportunities for optimization, incorporation of domain knowledge.

2. Query Centric vs. Data Centric Data Mining: Techniques arising in Database Systems are typically query centric, and seek to retrieve patterns from data that match patterns specified by a query. In contrast, techniques arising in Machine Learning and Statistics are typically data-driven, and seek to generate patterns or data descriptions that characterize (interesting or large) subsets of data.

·  Are the two approaches reconcilable? What could be the meeting grounds?

·  Examples of applications where the two approaches have been, or can be, used synergistically.

3. Designing for security and privacy: How can we enable effective mining while controlling access to data according to specific privacy and security policies?

·  What are the limits to what we can learn, given a set of governing policies?

·  Issues in mining across enterprises? Issues in mining in a service-provider environment?

4. Tight integration of mining with relational database systems: How can we improve data mining environments to store data mining results and their provenance in a secure, searchable, sharable, scalable manner? Given a set of ongoing mining objectives, how should the data in a warehouse be organized, indexed, and archived?

·  Do we need to extend SQL to support mining operations? What is the appropriate granularity? Operations such as clustering or light-weight operations that can be used to implement clustering and other higher-level operations? Examples of the two approaches.

·  Do we need to extend SQL to store and reason about mining algorithms and derivations?

·  Design principles for mining environments.

Participants will be invited to make a case for other problems as well. However, the workshop will seek to discuss a small number (say, 3-4) of problems in depth. In addition, we hope that the workshop will lead to a better understanding of the structure of the field. KDD has brought together to machine learning, statistics, and database communities. Increasingly, other communities have also focused on mining activities. Examples include text, natural language, and multimedia mining. However, the sheer breadth of tasks and techniques has led to relatively little communication across the subgroups. Is this likely to continue as the norm? Are there useful synergies between these diverse groups?

The participant list covers various well-known people as well as young scientists from both industry and academics. It is our hope that the seminar will improve the understanding of this rapidly growing and changing field, and stimulate new collaborations between the different communities.

Workshop Agenda

The workshop will run Monday through Friday, and will emphasize informal presentations, discussions, and provide opportunities for participants to work on ideas in small groups.

Monday

All participants will make short presentations, explaining their backgrounds and recent research activity related to the Workshop, in order for everyone to get acquainted.

We will also solicit feedback on the specific problems and topics to be discussed during the remainder of the workshop.

Tuesday through Thursday

This will be the working period of the workshop, and will feature selected presentations in the mornings, followed by loosely structured panels and discussions in the afternoon. Evenings will be left open for small groups or individuals to work on their own.

Friday

The final day of the workshop will feature a morning plenary session in which we take stock of the workshop discussions, and determine an agenda for follow-up work. We expect that many ideas that arise during the workshop will need some discussion in preparation for extended collaborations, and so the afternoon will be left open for small group interactions.