Data mining processes

The process of data mining comprises four consecutive processes: -

information and feature selection (data retrieval),

data processing (data cleansing),

discovery or mining (data analysis via numerical methods) and

interpretation of findings (the interpretation of numerical results in ways that make it possible for relatively unskilled end users to understand core findings).

Software applications referred to as 'data mining tools' attempt to automate these four data mining process

Information and feature selection

The feature selection process involves isolating relevant information within the database system and concentrating analysis on promising features within this information.

In real-world situations, relevant information is often unknown before this process begins. With respect to the relationships being sought from the data, most information stored in an organization's database system is either partially relevant, irrelevant or redundant.

Data processing

Data processing can be thought of as two interactive steps: data retrieval or selection, and data preparation or pre-processing. Each of these steps should be controlled and automated by the data mining tool.

Data retrieval or selection

The key objective of data retrieval in data mining is to extract features that lead to the best predictive performance. Mapping raw data into features is not always a straightforward task, however, with the degree of complexity depending on the type of raw data retrieved. Figure 5.2 illustrates a typical retrieval model.

In the case of standard data, the retrieval process is simple. The data mining tool extracts the data in numerical or Boolean form from the data warehouse (we will talk about data warehouses in the later sections of this unit) by using a query language such as SQL. However, the data retrieval process is not such a straightforward exercise when the data mining tool is required to handle specialized data in the form of webpages and documents.

In recent years, research has focused on document retrieval from databases and the Internet. There is currently an increasing trend towards merging information retrieval and database functions. As such, support of text retrieval is becoming an important attribute of most data mining systems. There are two types of text retrieval:

  1. Boolean -- where queries are expressed as Boolean expressions and documents are retrieved based on their satisfaction of the Boolean expressions.
  2. Ranking -- where documents are ranked based on the statistical properties of the queries and the documents.

Commercial object-relational database management systems (DBMS) that accommodate text retrieval are currently available.

Data preparation or pre-processing

The data preparation process depends on the mining or discovery technique applied. In general, data preparation involves data transformation, reducing the effects of outliers and Data normalization; it may also require the data mining tool to deal with issues such as missing data values and textual data.

A great deal of effort has been expended using commercially available tools to analyse data, and by developing analysis tools that meet the requirements of particular applications. Despite the sophistication of these tools, problems usually exist with real-world data, so some form of data pre-processing is usually required to intelligently analyse these data.

This means that commercial or research data mining tools should provide data pre-processing facilities for use before or even during the actual numerical analysis of the data. Various objectives may exist when pre-processing data. In addition to solving data problems, such as eliminating corrupt data or replacing missing attribute values, changing the structure of data in order to prepare it for more efficient data analysis may also be useful. Very common examples of such pre-processing are the conversion of categorical data (low, medium, high) into numerical data (0, 0.5, 1, for instance), the normalization of numeric values, and the conversion of qualitative textual information into quantitative data.

Some additional details related to problem data that need to be handled in the data processing or pre-processing stage are as follows:

  • Data transformation

Data transformation involves converting the raw data retrieved from a data warehouse into a representation that describes useful features to the data mining tool.

Corrupted and noisy data

A data set may include attributes based on measurement or subjective judgements, both of which may give rise to errors. Non-systematic errors of this kind are usually referred to as 'noise'. Noise causes problems when generating descriptions and when using these descriptions in predictions.

  • Missing data

In many casesdata is not available for some small fraction of a data set. Estimates for these data values may be necessary to make the application of the selected mining or discovery algorithm viable. Data mining tools should provide facilities to deal with missing data.

Outliers are defined as extreme data points that lie outside the range of the data values being examined. Outliers are one of the most disruptive influences for quantitative models, and they are frequently attributed to erroneous data entry. The most obvious method to determine if a data point is an outlier is to calculate its z-score, which represents the distance a data point lies above or below a sample mean, measured in units of standard deviations.

  • Data normalization

Some mining or discovery methods require normalized data. For example, neural networks generally require that data be mapped into a range lying within defined limits; and the Nearest neighbour method requires data to be normalized in order to avoid inappropriate overweighting of larger data values. Data mining tools must preempt the need for normalization and automatically normalize data prior to their presentation to the algorithm used for the mining or discovery process. When a data set has been normalized, and the data and discovery process is complete, the results are generally in the normalized format and the data mining tool must re-map the results to their original space.

  • Textual data

As mentioned earlier, data mining may not only deal with numerical formats. Data may be in the form of text. The ideal data mining tool should be designed to convert qualitative information in the form of text into quantitative information on which conventional discovery or mining techniques can operate.

There is a long history in the text retrieval literature of using key word weighting to classify and rank documents. The development of this technique is quite mature; in fact, most of the search engines on the Internet make use of this technique.

  • Test set creation and out of sample testing

A data mining tool should be responsible for testing the correctness of discovered descriptions. This can be accomplished by the data mining tool splitting a data set into two sets: an in sample, used to construct descriptions; and an out of sample, used to test the accuracy of these descriptions.

Discovery or mining

Central to the overall data mining process is the discovery or mining sub-process, which analyses retrieved data in order to identify the distinct relationships (if any) that exist within a given data set.

Unlike databases, data mining tools do more than just query data. In addition to the standard querying facilities they additionally incorporate numerical methods in the form of artificial intelligence technologies (e.g. neural networks and genetic algorithms), machine learning techniques (e.g. decision trees and rule induction) and conventional mathematical methods (such as those found in statistics and applied mathematics).

It is best to understand the kinds of knowledge extracted during mining by examining the type of questions that can be answered by data mining tools. Query tools can respond to questions such as, 'Which product is the most popular this year?' This query can be easily expressed and processed by finding the product with maximum sales. On the other hand, data mining tools are designed to ask, 'What are the factors that make product X popular?' In other words, data mining techniques are useful in the absence of known patterns in a database. They are used to supplement the information given in the data description, and are designed to assist in the discovery of commercially valuable knowledge lying hidden in vast corporate data resources.

For many years both conventional mathematical techniques, as well as artificial intelligence technologies (neural networks) and machine learning techniques (decision tree and propositional rule learning), have been used to identify data trends. While the complete list of mining techniques is almost infinite, there are a number of techniques that have gained particular favour with the data mining community in recent years. These techniques are often viewed as the cornerstone of the mining process and are increasingly being integrated into modern data mining tools.

Decision trees

Decision trees were originally developed for statisticians to automate the process of determining which fields in a database were actually useful, or more specifically, correlated to the particular problems they were trying to understand. Partially because of this history, decision tree algorithms tend to automate the entire process of hypothesis generation and validation much more completely and in a much more integrated way than other mining techniques. They are also particularly adept at handling raw data with little or no pre-processing. Decision trees are built from historical data and are therefore a form of supervised learning.

The general idea behind decision tree technologies is that once it is built, a decision tree can classify an observation into a finite number of classes. Consider the very simple example of a decision tree designed to formulate a stock recommendation, shown as Figure 5.3. This decision tree is constructed such that it first distinguishes whether a stock is cyclical, defensive or financial. For each of these stock categories a key determinant -- business confidence, equity market volatility or interest rates -- is then used to determine whether a stock is a buy or a sell.

Decision tree algorithms commonly induce sets of rules from a data set, either directly or via an induced tree. The idea of tree induction is to construct a decision tree from a set of observations. It is usual to do so by 'growing' the tree, i.e. by successively splitting leaves. Tree construction is easiest when there is an exact partition of the problem space -- i.e. one that classifies every observation correctly. The alternative, in which the distributions of observations from the classes overlap, is often called a 'noisy classification problem'. For the exact case, the tree must be grown until every observation is classified correctly.

In general, the strength of decision trees comes from their ability to create understandable models that are highly automated in their construction. Most techniques provide automated out of sample testing and incorporate specific embedded technology that prevents over-fitting.

Decision trees react well to missing and noisy data. However, because the predictions are made by well-defined decisions, dirty or noisy node values are well known for causing misleading results.

One of the most prominent limitations of decision trees is the lack of flexibility at nodes when dealing with numerical values. For example, suppose a decision tree's node 'return' tests return > 2% in order to classify a stock as a buy. The decision tree will classify two stocks as a buy even though one may have a return just above the threshold 2.1% and the other may have a return well above the threshold 200%.

Linear discrimination is one form of decision tree that does not suffer from a lack of flexibility. In linear discrimination, weighted contributions of attributes are combined arithmetically and compared to a threshold, so the corresponding descriptions of classes are thus arithmetic rather than logical. Despite its advantages, this type of decision tree algorithm has yet to appear in the form of a viable computer-based model.

A further weakness of decision trees is that decisions made at nodes may miss interactions among various continuous variables due to the Non-linearity of the problem space. All decision trees are by definition linear models.

Probabilistic rules, which are described in the next section, are a good substitute for decision trees in problems that demand flexible numerical testing at nodes and the ability to cope with Non-linearity in the problem space.

Rule induction

Rule induction is one of the major forms of mining and is perhaps the most common form of knowledge discovery in unsupervised learning systems. It is the form of mining that most people associate with data mining, namely 'mining' for gold through a vast database. Rule induction on a database can be a massive undertaking in which all possible patterns are systematically pulled out of the data, and then their accuracy and significance is calculated, which expresses the strength of the relationship and the likelihood of reoccurrence.

There are two major types of rule induction: propositional and first-order logic. More advanced forms of rule induction include probabilistic rules and association rules.

Propositional and first-order (predicate) logic rules. Propositional rules are based on propositional calculus. To understand the formulation of these rules it is necessary to firstly understand their connection to calculus. A calculus is a formal theory or formal system that has the following components:

  • countable set of symbols or alphabetical characters
  • statement forms (or formulas) consisting of the combination of symbols and elementary statement forms (formulas)
  • a set of axioms that are initial formulas
  • the rules for deriving formulas from axioms or other formulas. This is called the 'rule inference'.

The variables within a database form the symbols or alphabetical characters of the system. These propositions are connected by connectives, usually negation (NOT), conjunction (AND), disjunction (OR) and conditional (IF … THEN). The connected propositions in the data set form propositional rules. These are similar to the axioms or formulas in calculus. More simply, propositional rules are logical expressions of the form'If X then Y' or 'if X and Y, then Z'.

In a similar manner to decision trees, most propositional rule-learning algorithms use a growing and pruning strategy. In general, learning algorithms start with an initial rule set and iteratively improve their rule sets using Heuristic techniques. Another common approach is to actually generate a decision tree, then convert this to an equivalent rule set and finally Prune the rule set that has been generated in order to eliminate any unnecessary rules.

Propositional rules are useful when formulating and manipulating propositions and reducing them to a specified form. They do not, however, have the power to break down the propositions and work within the internal structure of a proposition. Consider, for example:

  • Every positive return can be a buy.
  • XYZ has a positive return.
  • Consequently: XYZ can be a buy.

A propositional rule cannot accommodate the above because it does not break down the structure of the proposition. Hence, it does not recognize that a positive return is an instance of every positive return.

First-order rules apply Predicate logic. The rules are broken down into a predicate and the subject of the predicate, for example, the predicate parent(X,Y) is true if the person denoted by a variable X is a parent of another person denoted by a variable Y.

In short, unlike first order rules, propositional rules lack the ability to connect two objects. Thus propositional rules are considered less powerful than first-order logic rules.

Systems that learn sets of rules have a number of desirable properties. Rule sets are relatively easy for people to understand, so they are the most comprehensible classification models. Rule learning systems usually also outperform decision trees. The techniques for learning propositional rule sets can often be extended to the first-order case. However, one of the most well known limitations of propositional and first order logic rules is their lack of flexibility when dealing with numerical values and their inability to cope with non-linear problems.

Probabilistic rules. Probabilistic rules are an extension of first-order rules with the ability to handle uncertain or weighted facts. To gain insight into probabilistic rules consider the rule set below which is aimed at predicting daily movements in currency exchange rates.

The product of the specific weight of each keyword. For example, in the case of rule 2, the total weight of the rule is given by STERLING_ADD(T-1) × BOND_STRONG (T-2) = 0.5 × 0.7 = 0.35. The total weight of the rule set is calculated by adding the weights of each individual rule. Based on the weight of the rule set, a decision is made according to a threshold set by an expert user.

As probabilistic rules are extensions of first-order rules (whenever all the facts are either true or false, probabilistic rules degenerate to first order rules) they also inherit all the strengths of rule classifiers. That is, probabilistic rules provide the advantage of very comprehensible models and generate results relatively quickly.