Van Data Naar Informatie

Blok 3.2 DATA MINING

van data naar informatie

Doel van de Cursus

Course Description:

In this course the student will be made familiar with the main topics in Data Mining, and its important role in current Computer Science. In this course we’ll mainly focus on algorithms, methods, and techniques for the representation and analysis of data and information.

Course Objectives:

To get a broad understanding of data mining and knowledge discovery in databases.
To understand major research issues and techniques in this new area and conduct research.
To be able to apply data mining tools to practical problems.

Organisatie en opzet van de Cursus

Lecturers: Dr. E.N. Smirnov, Dr. R.L. Westra

Course Methodology:

The course consists of five components. A series of lectures; practical exercises; a project; student lectures, and a final exam. In the lectures the main theoretical aspects will be presented by the lecturers. In the practical exercises assignments should be solved by the students. They can cooperate in groups and submit a two-page paper after at most one week. During the project students will cooperate in groups to work out the project objectives. Depending on the preferences of the students the project can be internal or external. In the first case (internal project) the project is given only for the data-mining course. The duration of the project is one week. In the second case (external project) the project is given for all the three courses from Block 3.2. The duration of the project is three weeks. In addition to the practical exercises and the project the students will also collaborate in groups to present student lectures on specific advanced topics. Finally, the course terminates with a closed-book exam.

Beoordelingen en Tentamen

3.2.1 Course Exercises

At the beginning of this course the students will form couples. Each couple should register itself before the end of the first week to the lecturers. During the course these couples will receive 8 exercises. An exercise consists of theoretical or practical tasks which should be solved, e.g. with weka, within one week after issue, and must be submitted in a written report (maximum 2 pages) to the lecturers by email. Each report is graded as ‘insufficient’ (0 points), ‘moderate’ (0.125 points), or ‘sufficient’ (0.25 points).

3.2.3 Student Lectures

In this part of the course, the project groups will have to choose one of eight more advanced topics in Data Mining. The student groups will present a lecture on the topic chosen, based on a literature study. The lecture is presented by one group member and the other group members are obliged to answer the questions in the subsequent discussion. Furthermore, each group is obliged to pose one question to each other presentation. The grade for the student lectures is based on the average of the grades of the lecturers and the student groups. This can earn at maximum 1 point.

3.2.4 Final Course Exam

At the end of this course, there is a closed book exam of three hours. This contributes at maximum 10 points.

3.2.5 Project

Regardless of the type of the project chosen (internal/external) students will form groups to work out the project objectives, and at the end of the project time will present their results, and will submit a final report. Both presentation and report will be evaluated by all the lecturers in the block. Together, they can contribute 10 points.

3.2.5 Final Grade

The final grade of this course is composed of: 75% of the sum of the grades of course exercises, student lectures and final exam (with a maximum of 10 points), plus 25% of the grade of the project.

LES 1: Introductie

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996), Data Mining to Knowledge Discovery in Databases:
Hand, D., Manilla, H., Smyth, P. (2001), Principles of Data Mining, MIT press, Boston, USA

Wat is Data Mining?

data informatie kennis
patronen structuren modellen

Het nut van Data Mining

steeds meer zeer grote databases TB (TeraBytes)
N datapunten en K componenten (velden) per datapunt
niet toegankelijk voor snelle inspectie
onvolledigheid, ruis, foute opzet
verschillende soorten getallen, alfanumeriek, betekenisvolle velden
noodzaak om analyse te automatiseren

Toepassingsgebieden

astronomische databases
marketing/investment
telecommunicatie
industrieel
biogenetica

Historische Context

in statistiek negatieve betekenis:
gevaar voor overfitting en foutieve generalisatie

Data Mining Subdisciplines

Databases
Statistiek
Knowledge Based Systems
High-performance computing
Data visualization
Patroon herkenning
Machine learning

Data Mining -methoden

Clustering
klassificatie (off- en on-line)
(auto)-regressie
visualisatie middels: optimale projecties en PCA (principal component analysis)
discrimnant analyse
decompositie
parameterisch modeleren
niet-parameterisch modeleren

Onderdelen van Data Mining algorithmen

model representatie
model evaluatie
search/optimisatie

Data Mining algorithmen

Decision trees/Rules
Nonlinear Regression en Klassificatie
Example-gebaseerde methoden
hele zwik: NN, GA, ...

Data Mining en Statistiek

wanneer Statistiek en wanneer DM?
is DM een soort Statistiek?

Data Mining en AI

AI is instrumenteel in het vinden van kennis in gote blokken data

Mathematische Principes in Data Mining

Deel I: Het Verkennen van de Data Ruimte (Data Space)

* Understanding and Visualizing Data Space

Provide tools to understand the basic structure in databases. This is done by probing and analysing metric structure in data-space, comprehensively visualizing data, and analysing global data structure by e.g. Principal Components Analysis and Multidimensional Scaling.

* Data Analysis and Uncertainty

Show the fundamental role of uncertainty in Data Mining. Understand the difference between uncertainty originating from statistical variation in the sensing process, and from imprecision in the semantical modelling. Provide frameworks and tools for modelling uncertainty: especially the frequentist and subjective/conditional frameworks.

Deel II: Het Vinden van Structuur in Data Space

* Data Mining Algorithms & Scoring Functions

Provide a measure for fitting models and patterns to data. This enables the selection between competing models. Data Mining Algorithms are discussed in the parallel course.

* Searching for Models and Patterns in Data Space

Describe the computational methods used for model and pattern-fitting in data mining algorithms. Most emphasis is on search and optimisation methods. This is required to find the best fit between the model or pattern with the data. Special attention is devoted to parameter estimation under missing data using the maximum likelihood EM-algorithm.

Deel III: Mathematische Modellering van Data Space

* Descriptive Models for Data Space

Present descriptive models in the context of Data Mining. Describe specific techniques and algorithms for fitting descriptive models to data. Main emphasis here is on probabilistic models.

* Clustering in Data Space

Discuss the role of data clustering within Data Mining. Showing the relation of clustering in relation to classification and search. Present a variety of paradigms for clustering data.

Voorbeelden

Astronomische Databases

Phylogenetische bomen uit DNA-analyse

Example 1: Phylogenetic Trees

The last decade has witnessed a major and historical leap in biology and all related disciplines. The date of this event can be set almost exactly to November 1999 as the Humane Genome Project (HGP) was declared completed. The HGP resulted in (almost) the entire humane genome, consisting of about 3.3.109 base pairs (bp) code, constituting all approximately 35K humane genes. Since then the genomes of many more animal and plant species have come available. For our sake, we can consider the humane genome as a huge database, existing of a single string with 3.3.109 characters from the set {C,G,A,T}.

This data constitutes the human ‘source code’. From this data – in principle – all ‘hardware’ characteristics, such as physiological and psychological features, can be deduced. In this block we will concentrate on another aspect that is hidden in this information: phylogenetic relations between species. The famous evolutionary biologist Dobzhansky once remarked that: ‘Everything makes sense in the light of evolution, nothing makes sense without the light of evolution’. This most certainly applies to the genome. Hidden in the data is the evolutionary history of the species. By comparing several species with various amount of relatedness, we can from systematic comparison reconstruct this evolutionary history. For instance, consider a species that lived at a certain time in earth history. It will be marked by a set of genes, each with a specific code (or rather, a statistical variation around the average). If this species is by some reason distributed over a variety of non-connected areas (e.g. islands, oases, mountainous regions), animals of the species will not be able to mate at a random. In the course of time, due to the accumulation of random mutations, the genomes of the separated groups will increasingly differ. This will result in the origin of sub-species, and eventually new species. Comparing the genomes of the new species will shed light on the evolutionary history, in that: we can draw a phylogenetic tree of the sub-species leading to the ‘founder’-species; given the rate of mutation we can estimate how long ago the founder-species lived; reconstruct the most probable genome of the founder-species.

Voorbeeld 1: Phylogenetic Trees

Voorbeeld 2: data mining in astronomie

Oefeningen:

Wat voor structuur vind je in de ASCI-file: DAMdataset1.mat ? [op mijn web-page:

Referenties:

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996), Data Mining to Knowledge Discovery in Databases:

Hand, D., Manilla, H., Smyth, P. (2001), Principles of Data Mining, MIT press, Boston, USA