Strategies of Data Mining
Data mining is an effective set of analysis tools and techniques used in the decision support process. However, misconceptions about the role that data mining plays in decision support solutions can lead to confusion about and misuse of these tools and techniques.
Introduction
Databases were developed with an emphasis on obtaining data; more data meant more information. Professionals trained in decision support analysis analyzed such data and discovered information in the form of patterns and rules hidden in the relationships between its various attributes. This assisted in the business decision process by providing feedback on past business actions and helped to guide future decisions. The volume of captured data has grown to the extent that there is too much data from which to discover information easily. For example, sampling, a technique designed to reduce the total amount of data to be analyzed for meaningful information, fails because even a marginally statistical sample of data can mean millions of records.
In the business world, the current emphasis on data warehouses and online analytical processing (OLAP) registers this need to convert huge volumes of data into meaningful information. This information can then be converted into meaningful business actions, which provide more data to be converted into more information, and so on in a cyclical manner, creating a "closed loop" in the decision support process. Ideally, this "closed loop" behavior is the key behind such decision support strategies, recursively improving the efficacy of business decisions.
Unfortunately, because most businesses implement only the data warehouse and OLAP portions of this closed loop, they fail to secure true decision support. For example, obtaining customer demographic data and account data from an online transaction processing (OLTP) database, cleaning and transforming the data, translating the prepared data into a data warehouse, constructing and aggregating the data warehouse data into OLAP cubes for presentation, and then making such data available through data marts still does not provide such necessary insight as to why certain customers close their accounts or why certain accounts purchase certain services or products. Without this information, the business actions that attempt to reduce the number of closed accounts or improve sales of certain services or products can be ineffectual or even cause more harm than good.
It is frustrating to know that the information you want is available but only if the right questions are asked of the data warehouse or OLAP cube. The data mining tools in Microsoft® SQL Server™ 2000 Analysis Services provide a way for you to ask the right questions about data and, used with the right techniques, give you the tools needed to convert the hidden patterns and rules in such data into meaningful information.
Another use for data mining is supplying operational decision support. Unlike the closed loop decision support approach, in which the time between the discovery of information and the business decision resulting from the information can take weeks or months and is typically used to provide long-term business decision support, operational decision support can happen in minutes and is used to provide short-term or immediate decision support on a very small set of cases, or even on a single case.
For example, a financial client application can provide real-time analysis for customer support representatives in a banking call center. The client application, by using a data mining model to analyze the demographic information of a prospective customer, can determine the best list of products to cross-sell to the customer. This form of data mining is becoming more and more common as standardized tools, such as Analysis Services, become more accessible to users.
What Is Data Mining?
Simply put, data mining is the process of exploring large quantities of data in order to discover meaningful information about the data, in the form of patterns and rules. In this process, various forms of analysis can be used to discern such patterns and rules in historical data for a given business scenario, and the information can then be stored as an abstract mathematical model of the historical data, referred to as a data mining model. After a data mining model is created, new data can be examined through the model to see if it fits a desired pattern or rule. From this information, actions can be taken to improve results in the given business scenario.
Data mining is not a "black box" process in which the data miner simply builds a data mining model and watches as meaningful information appears. Although Analysis Services removes much of the mystery and complexity of the data mining process by providing data mining tools for creating and examining data mining models, these tools work best on well-prepared data to answer well-researched business scenarios—the GIGO (garbage in, garbage out) law applies more to data mining than to any other area in Analysis Services. Quite a bit of work, including research, selection, cleaning, enrichment, and transformation of data, must be performed first if data mining is to truly supply meaningful information.
Data mining and data warehouses complement each other. Well-designed data warehouses have handled the data selection, cleaning, enrichment, and transformation steps that are also typically associated with data mining. Similarly, the process of data warehousing improves as, through data mining, it becomes apparent which data elements are considered more meaningful than others in terms of decision support and, in turn, improves the data cleaning and transformation steps that are so crucial to good data warehousing practices.
Data mining does not guarantee the behavior of future data through the analysis of historical data. Instead, data mining is a guidance tool, used to provide insight into the trends inherent in historical information.
For example, a data warehouse, without OLAP or data mining, can easily answer the question, "How many products have been sold this year?" An OLAP cube using data warehouse data can answer the question, "What has been the difference in volume of gross sales for products for the last five years, broken down by product line and sales region?" more efficiently than the data warehouse itself. Both products can deliver a solid, discrete answer based on historical data. However, questions such as "Which sales regions should be targeted for telemarketing instead of direct mail?" or "How likely is it that a particular product line would sell well, and in which sales regions?" are not easily answered through data warehouses or OLAP. These questions attempt to provide an educated guess about future trends. Data mining provides educated guesses, not answers, towards such questions through analysis of existing historical data.
The difficulty typically encountered when using a data mining tool such as Analysis Services to create a data mining model is that too much emphasis is placed on obtaining a data mining model; very often, the model itself is treated as the end product. Although you can peruse the structure of a data mining model to understand more about the patterns and rules that constitute your historical data, the real power of data mining comes from using it as a predictive vehicle with current data. You can use the data mining model as a lens through which to view current data, with the ability to apply the patterns and rules stored in the model to predict trends in such data. The revealed information can then be used to perform educated business decisions. Furthermore, the feedback from such decisions can then be compared against the predicted result of the data mining model to further improve the patterns and rules stored in the model itself, which can then be used to more accurately predict trends in new data, and so on.
A data mining model is not static; it is an opinion about data, and as with any opinion, its viewpoint can be altered as new, known data is introduced. Part of the "closed loop" approach to decision support is that all of the steps within the loop can be increasingly improved as more information is known, and that includes data mining models. Data mining models can be retrained with more and better data as it becomes available, further increasing the performance of such a model.
Closed Loop Data Mining
Closed loop data mining is used to support long-term business decision support by analyzing historical data to provide guidance not just on the immediate needs of business intelligence, but also to improve the entire decision support process.
The following diagram illustrates the analysis flow used in closed loop data mining.
In closed loop data mining, the analysis improves the overall quality of data within the decision support process, as well as improves the quality of long-term business decisions. Input for the data mining model is taken primarily from the data warehouse; Analysis Services also supports input from multidimensional data stores. The information gained from employing the data mining model is then used, either directly by improving data quality or indirectly by altering the business scenarios which supply data, to impact incoming data from the OLTP data store.
For example, one action involving closed loop data mining is the grooming and correction of data based on the patterns and rules discovered within data mining feedback. As mentioned earlier, many of the processes used to prepare data for data mining are also used by data warehousing solutions. Consequently, problems found in data during data mining generally reflect problems in the data in the data warehouse, and the feedback provided by data mining can improve data cleaning and transformation for the whole decision support process, including data warehousing and OLAP.
Closed loop data mining can take either a continuous view, in which data is continually analyzed against a data mining model to provide constant feedback on the decision support process, or a one-time view, in which a one-time result is generated and recommended actions are performed based on the provided feedback. Decisions involving closed loop data mining can take time, and time can affect the reliability of data mining model feedback. When constructing a data mining model for closed loop data mining, you should consider the time needed to act on information. Discovered information can become stale if acted on months after such information is reported.
Also, the one-time result process can be performed periodically, with predictive results stored for later analysis. This is one method of discovering significant attributes in data; if the predictive results differ widely from actual results over a certain period of time, the attributes used to construct the data mining model may be in question and can themselves be analyzed to discover relevance to actual data.
Closed loop data mining can also supply the starting point for operational data mining; the same models used for closed loop data mining can also be used to support operational data mining.
Operational Data Mining
Operational data mining is the next step for many enterprise decision support solutions. Once closed loop data mining has progressed to the point where a consistent, reliable set of data mining models can be used to provide positive guidance to business decisions, this set of data mining models can now be used to provide immediate business decision support feedback in client applications.
The following diagram highlights the analysis flow of operational data mining.
As with closed loop data mining, input for the data mining model is taken from data warehousing and OLTP data stores. However, the data mining model is then used to perform immediate analysis on data entered by client applications. Either the user of the client application or the client application itself then acts upon the analysis information, with the resulting data being sent to the OLTP data store.
For example, financial applications may screen potential credit line customers by running the demographic information of a single customer, received by a customer service representative over the telephone, against a data mining model. If this is an existing customer, the model could be used to determine the likelihood of the customer purchasing other products the financial institution offers (a process known as cross-selling), or indicate the likelihood of a new customer being a bad credit risk.
Operational data mining differs from the more conventional closed loop data mining approach because it does not necessarily act on data already gathered by a data warehousing or other archival storage system. Operational data mining can occur on a real-time basis, and can be supported as part of a custom client application to complement the decision support gathered through closed loop data mining.
Client-based data mining models, duplicated from server-based data mining models and trained using a standardized training case set, are an excellent approach for supporting operational data mining. For more information about how to construct client-based data mining models, see "Creating Data Mining Models" in this chapter.
Top of page
The Data Mining Process
Analysis Services provides a set of easy-to-use, robust data mining tools. To make the best use of these tools, you should follow a consistent data mining process, such as the one outlined below:
• / Data SelectionThe process of locating and identifying data for data mining purposes.
• / Data Cleaning
The process of inspecting data for physical inconsistencies, such as orphan records or required fields set to null, and logical inconsistencies, such as accounts with closing dates earlier than starting dates.
• / Data Enrichment
The process of adding information to data, such as creating calculated fields or adding external data for data mining purposes.
• / Data Transformation
The process of transforming data physically, such as changing the data types of fields, and logically, such as increasing or decreasing granularity, for data mining purposes.
• / Training Case Set Preparation
The process of preparing a case set for data mining. This may include secondary transformation and extract query design.
• / Data Mining Model Construction
The process of choosing a data mining model algorithm and tuning its parameters, then running the algorithm against the training case set to construct a data mining model.
• / Data Mining Model Evaluation
The process of evaluating the created data mining model against a case set of test data, in which a second training data set, also called a holdout set, is viewed through the data mining model and the resulting predictive analysis is then compared against the actual results of the second training set to determine predictive accuracy.
• / Data Mining Model Feedback
After the data mining model has been evaluated, the data mining model can be used to provide analysis of unknown data. The resulting analysis can be used to supply either operational or closed loop decision support.
If you are modeling data from a well-designed data warehouse, the first four steps are generally done for you as part of the process used to populate the data warehouse. However, even data warehousing data may need additional cleaning, enrichment, and transformation, because the data mining process takes a slightly different view of data than either data warehousing or OLAP processes.