Research Proposal: Know-how Data Mining & Knowledge Discovery-

Constructing Rule-Based Knowledge for Data Mining

Ahmed Sameh

Department of Computer & Information Systems

Prince Sultan University,

P.O.Box 66833, Riyadh 11586

March, 2010

  1. Motivation

It is a known fact that present-day data mining tools are powerful but require significant expertise to implement effectively. An apparent need for building a “Know-How” knowledge base for this field is very much in demand. The distinctive feature of the proposed Hub is the vertical stacking (integration) of traditional data mining methods with both statistical methods and visualision techniques. We believe that the iterative and interactive application of blends of algorithms and techniques from these three areas will provide better insight and capability in analysing available data. Early decisions on models and techniques are not recommended. Exploring and navigating through the available data without such early commitment will lead to better exploration of the solution space. Subsequent convergence and exploitations of particular models and techniques will be justified and more logical. Data Mining algorithms and Statistical analysis are complementing each others since data mining tools are mainly about “hypothesis generation”, where as Statistical Analysis are mainly about “hypothesis testing”. The iterative and interactive application of compatible techniques and methods from both fields will facilitates cross checking, verification, validation and cross validation and thus will be most beneficial for the “Know-how” knowledge base.

For example with respect to “Models and Patterns”, we have three known models: Prediction models (e.g. regression), Probability distribution models (e.g. Parameteric, Markov), and Structured data models (e.g. time series, and Transition distribution like Hidden Markov and spatial models). In all these models we have both data mining algorithms as well as statistical analysis algorithms that both generate and test these models.

As for “patterns”, we have two types: Global patterns (e.g. clustering), and local patterns (e.g. Outlier detection, Bump hunting, scan statistics, and association rules). Again in these patterns, we have both data mining algorithms as well as statistical analysis algorithms that both generate and test these patterns.

The proposed Hub will make extensive use of blends of these vertical stacking (integration) of traditional data mining methods with both statistical methods and visualization techniques in developing a “Know-how” rule based system.

Results of the Hub’s activities will provide the contexts within which the “Know-how” knowledge will be generated and added to the proposed “Know-how” knowledge based system.

All the Hub’s activities will provide learning opportunities for developing and growing (acquiring) a “Know-how” knowledge base. The suggested system can be built using any of the following techniques: rule-based techniques, inductive techniques, symbol manipulation techniques, case-based techniques, and/or qualitative techniques (such as model-based, temporal, reasoning, and neural networks). It will have an inference engine that will reason and search (compose) solutions. It will also have a learning module (knowledge acquisition) that will allow the system to learn and improve its performance through exposure to learning opportunities (contexts and decisions). It will also have an explanation module that would allow answers for reasoning questions such as: Why, What if, What is, How, and Why not.

The “Know-how” knowledge base is a sort of a decision support system for those who may not have the technical or strategic experience necessary to chart an effective roadmap to uncover the valuable predictive insights hidden within their existing data. The “know-how” knowledge base willprovide:

  • How and where to get startedwith a specific data set
  • Causes of failure to straight forward application of a specific tool, and how pitfalls can be avoided
  • Relevant Case studies that reveal the rewards of proper algorithm selection, proper design and careful implementation when dealing with a specific data set
  • Why establishing an internal predictive modeling practice is within one’s reach – will also establish a roadmap for a specific data set
  • Tips, tricks and techniques for a specific data set preparation, method selection, validation methods, and gluing of appropriate data mining, statistical, and visualization methods (making use of the stacking approach)
  • Interactive guru session with explanations

Resources (meta-knowledge) and direction on how to move forward with confidence