Blok 3.2 DATA MINING
van data naar informatie
Doel van de Cursus
Course Description:
In this course the student will be made familiar with the main topics in Data Mining, and its important role in current Computer Science. In this course we’ll mainly focus on algorithms, methods, and techniques for the representation and analysis of data and information.
Course Objectives:
- To get a broad understanding of data mining and knowledge discovery in databases.
- To understand major research issues and techniques in this new area and conduct research.
- To be able to apply data mining tools to practical problems.
Organisatie en opzet van de Cursus
Lecturers: Dr. E.N. Smirnov, Dr. R.L. Westra
Course Methodology:
The course consists of five components. A series of lectures; practical exercises; a project; student lectures, and a final exam. In the lectures the main theoretical aspects will be presented by the lecturers. In the practical exercises assignments should be solved by the students. They can cooperate in groups and submit a two-page paper after at most one week. During the project students will cooperate in groups to work out the project objectives. Depending on the preferences of the students the project can be internal or external. In the first case (internal project) the project is given only for the data-mining course. The duration of the project is one week. In the second case (external project) the project is given for all the three courses from Block 3.2. The duration of the project is three weeks. In addition to the practical exercises and the project the students will also collaborate in groups to present student lectures on specific advanced topics. Finally, the course terminates with a closed-book exam.
Beoordelingen en Tentamen
3.2.1 Course Exercises
At the beginning of this course the students will form couples. Each couple should register itself before the end of the first week to the lecturers. During the course these couples will receive 8 exercises. An exercise consists of theoretical or practical tasks which should be solved, e.g. with weka, within one week after issue, and must be submitted in a written report (maximum 2 pages) to the lecturers by email. Each report is graded as ‘insufficient’ (0 points), ‘moderate’ (0.125 points), or ‘sufficient’ (0.25 points).
3.2.3 Student Lectures
In this part of the course, the project groups will have to choose one of eight more advanced topics in Data Mining. The student groups will present a lecture on the topic chosen, based on a literature study. The lecture is presented by one group member and the other group members are obliged to answer the questions in the subsequent discussion. Furthermore, each group is obliged to pose one question to each other presentation. The grade for the student lectures is based on the average of the grades of the lecturers and the student groups. This can earn at maximum 1 point.
3.2.4 Final Course Exam
At the end of this course, there is a closed book exam of three hours. This contributes at maximum 10 points.
3.2.5 Project
Regardless of the type of the project chosen (internal/external) students will form groups to work out the project objectives, and at the end of the project time will present their results, and will submit a final report. Both presentation and report will be evaluated by all the lecturers in the block. Together, they can contribute 10 points.
3.2.5 Final Grade
The final grade of this course is composed of: 75% of the sum of the grades of course exercises, student lectures and final exam (with a maximum of 10 points), plus 25% of the grade of the project.
LES 1: Introductie
- Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996), Data Mining to Knowledge Discovery in Databases:
- Hand, D., Manilla, H., Smyth, P. (2001), Principles of Data Mining, MIT press, Boston, USA
Wat is Data Mining?
- data informatie kennis
- patronen structuren modellen
Het nut van Data Mining
- steeds meer zeer grote databases TB (TeraBytes)
- N datapunten en K componenten (velden) per datapunt
- niet toegankelijk voor snelle inspectie
- onvolledigheid, ruis, foute opzet
- verschillende soorten getallen, alfanumeriek, betekenisvolle velden
- noodzaak om analyse te automatiseren
Toepassingsgebieden
- astronomische databases
- marketing/investment
- telecommunicatie
- industrieel
- biogenetica
Historische Context
- in statistiek negatieve betekenis:
- gevaar voor overfitting en foutieve generalisatie
Data Mining Subdisciplines
- Databases
- Statistiek
- Knowledge Based Systems
- High-performance computing
- Data visualization
- Patroon herkenning
- Machine learning
Data Mining -methoden
- Clustering
- klassificatie (off- en on-line)
- (auto)-regressie
- visualisatie middels: optimale projecties en PCA (principal component analysis)
- discrimnant analyse
- decompositie
- parameterisch modeleren
- niet-parameterisch modeleren
Onderdelen van Data Mining algorithmen
- model representatie
- model evaluatie
- search/optimisatie
Data Mining algorithmen
- Decision trees/Rules
- Nonlinear Regression en Klassificatie
- Example-gebaseerde methoden
- hele zwik: NN, GA, ...
Data Mining en Statistiek
- wanneer Statistiek en wanneer DM?
- is DM een soort Statistiek?
Data Mining en AI
- AI is instrumenteel in het vinden van kennis in gote blokken data
Mathematische Principes in Data Mining
Deel I: Het Verkennen van de Data Ruimte (Data Space)
* Understanding and Visualizing Data Space
Provide tools to understand the basic structure in databases. This is done by probing and analysing metric structure in data-space, comprehensively visualizing data, and analysing global data structure by e.g. Principal Components Analysis and Multidimensional Scaling.
* Data Analysis and Uncertainty
Show the fundamental role of uncertainty in Data Mining. Understand the difference between uncertainty originating from statistical variation in the sensing process, and from imprecision in the semantical modelling. Provide frameworks and tools for modelling uncertainty: especially the frequentist and subjective/conditional frameworks.
Deel II: Het Vinden van Structuur in Data Space
* Data Mining Algorithms & Scoring Functions
Provide a measure for fitting models and patterns to data. This enables the selection between competing models. Data Mining Algorithms are discussed in the parallel course.
* Searching for Models and Patterns in Data Space
Describe the computational methods used for model and pattern-fitting in data mining algorithms. Most emphasis is on search and optimisation methods. This is required to find the best fit between the model or pattern with the data. Special attention is devoted to parameter estimation under missing data using the maximum likelihood EM-algorithm.
Deel III: Mathematische Modellering van Data Space
* Descriptive Models for Data Space
Present descriptive models in the context of Data Mining. Describe specific techniques and algorithms for fitting descriptive models to data. Main emphasis here is on probabilistic models.
* Clustering in Data Space
Discuss the role of data clustering within Data Mining. Showing the relation of clustering in relation to classification and search. Present a variety of paradigms for clustering data.
Voorbeelden
- Astronomische Databases
- Phylogenetische bomen uit DNA-analyse
Example 1: Phylogenetic Trees
The last decade has witnessed a major and historical leap in biology and all related disciplines. The date of this event can be set almost exactly to November 1999 as the Humane Genome Project (HGP) was declared completed. The HGP resulted in (almost) the entire humane genome, consisting of about 3.3.109 base pairs (bp) code, constituting all approximately 35K humane genes. Since then the genomes of many more animal and plant species have come available. For our sake, we can consider the humane genome as a huge database, existing of a single string with 3.3.109 characters from the set {C,G,A,T}.
This data constitutes the human ‘source code’. From this data – in principle – all ‘hardware’ characteristics, such as physiological and psychological features, can be deduced. In this block we will concentrate on another aspect that is hidden in this information: phylogenetic relations between species. The famous evolutionary biologist Dobzhansky once remarked that: ‘Everything makes sense in the light of evolution, nothing makes sense without the light of evolution’. This most certainly applies to the genome. Hidden in the data is the evolutionary history of the species. By comparing several species with various amount of relatedness, we can from systematic comparison reconstruct this evolutionary history. For instance, consider a species that lived at a certain time in earth history. It will be marked by a set of genes, each with a specific code (or rather, a statistical variation around the average). If this species is by some reason distributed over a variety of non-connected areas (e.g. islands, oases, mountainous regions), animals of the species will not be able to mate at a random. In the course of time, due to the accumulation of random mutations, the genomes of the separated groups will increasingly differ. This will result in the origin of sub-species, and eventually new species. Comparing the genomes of the new species will shed light on the evolutionary history, in that: we can draw a phylogenetic tree of the sub-species leading to the ‘founder’-species; given the rate of mutation we can estimate how long ago the founder-species lived; reconstruct the most probable genome of the founder-species.
Voorbeeld 1: Phylogenetic Trees
Voorbeeld 2: data mining in astronomie
Oefeningen:
- Wat voor structuur vind je in de ASCI-file: DAMdataset1.mat ? [op mijn web-page:
Referenties:
- Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996), Data Mining to Knowledge Discovery in Databases:
- Hand, D., Manilla, H., Smyth, P. (2001), Principles of Data Mining, MIT press, Boston, USA
1