Dr. Bjarne Berg DRAFT

DATA MINING ISSUES & USES

(CONDENSED ARTICLE SUMMARY)

Data mining is an emerging technology that has created enormous value for companies and organizations. However, it has been argued that most companies are still not fully exploring the data assets within their domain. “Organizations may have terabytes of data, but very little information” has been a common complaint (while the benefits of data mining are many, the following discussion is primarily centered on business problems).

Data mining has often been viewed as the soltion for all these ‘ills”. From a business perspective, one benefit of data mining is the ability to conduct market basket analysis. This is clustering techniques that can demonstrate what products are being sold as a group based on certain characteristics from dimensions such as time, demographics and sociographics attributes. Another data mining technique is the ability to conduct macro economic studies and correlating financial and operational data with economic conditions in society at large i.e. does interest rates or unemployment levels have an impact on our sales, and if so, are there different levels of impact in certain regions or market segments?

In addition, organizations can do product segmentation and customer segmentation based on volume sales, product sales and profitability. These segmentations can be based on real boundaries instead of arbitrary parametrics (such as how many miles you fly on a plane) i.e. what truly drives profitability? Is volume sales, shipment costs, costs of refunds, credit and manufacturing, or is it reduction in setup costs and marketing? We can also use data mining to examine the relative strength of relationships between these variables and examine where to concentrate our resources to build value trees that modifies the company’s behavior. One can argue that, if no behavior modification occurs, the data mining effort has only historical value and no real benefit to the organization.

We can also build balanced scorecards for organizational units and the consolidated enterprise based on the relative contribution of the variables to the profitability of the enterprise. These models can leveraged in the creation of predictive models based on regression, decision trees, neural networks and several other powerful methods. The core benefit is the ability to provide better forecasts, budgets and plans, thereby creating a true intelligent enterprise that extracts information from our vast amounts of data.

A logical next step after this has been established is the ability to create behavioral scores and models for targeted marketing to individuals as well as predictive models for credit management, material consumption and cash flow management. A potential additional benefit of the behavioral models is also the ability to detect behavioral changes among our customers and thereby react in marketing and our product mix in a much more targeted and faster manner than most companies do today.

A potential powerful usage of data mining is the ability to creating and leveraging dynamic pricing models such as those used by airlines, hotels and car rental agencies. These models are sophisticated dynamic models that look at the inventory levels, the seasonality and the historical sales trends based on volumetric and sales performance from past situation, to dynamically assign prices to individual product segments or products, thereby maximizing revenues and profits on sales. In addition, the models can reduce prices when demand is soft and thereby take full advantage of the real market conditions as they unfold, instead of static pricing based on assumptions that may no longer hold true.

The key to this type of dynamic pricing is the ability to take advantage of the price elasticity in markets that previously may have been treated as single homogeneous markets due to lack of knowledge on how the segments behaved. i.e. the lingerie chain Victoria Secret used to have a uniform product mix in each store. However, after some basic data mining in sales trends, they discovered that in the Northern states the sale of products were pre-dominantly in black in color, while Southern states had a statistical significant preference for white products. Further examination found that those markets for those products were highly inelastic in price and demand. As a result, the company not only changed the product mix in new different markets, they also changed the pricing based on the product preferences and the price elasticity. The core idea was that the company did not even know that these markets were different and that the price sensitivity of the products was different in each market.

A major issue around all these statistical models is the lack of clean, integrated and single location of data. Companies used to have hundreds of system that captured different parts of the transactions and often the data was sparsely populated with accurate data. After the ERP revolution in the 1990s, many of these issues are now better addressed, but there are still significant data cleaning issues of longitudinal data and data from multiple source systems. Some data mining tools ‘solves’ this by simply substituting missing data elements with averages, while other are more sophisticated and can substitute the data points with randomly generated data based on the mean and standard deviation of the other existing data points in the sample or the population. The reason for this is often that the data that is there is rather accurate, but that some data points are missing.

However, sometimes the problem is that the data cannot be trusted due to errors. In that case the data can be assigned quality confidence levels and the resulting findings of the statistical methods used on the data sets have then to be interpreted trough the ‘lenses of those quality confidence levels.

The overarching issue with data mining is the need for a repository of detailed level data. The detailed level is needed since the variance of aggregated data tends to understate the true variability of the source data and thereby invalidating many of the assumptions of commonly used statistical methods. The low level data often results in Terabytes of granular level data that has significant costs to maintain. These issues are compounded by the need for robust statistical tools that often have a very high software licensing costs, and the need for highly skilled labor that demands premium salaries. The risk of using unskilled labor is that they often do not understand which techniques are appropriate for a given problem, and often when the right technique is selected, they tends to not understand the assumptions of the techniques and the risks to validity and reliability when those assumptions are violated. As a result, unskilled data miners tend to find irrelevant and insignificant findings due to lack of statistical skills. As a result or these issues and high costs, data mining at any significant level has so far been a tool for primarily large corporations engaged in high volume transactional activities or for micro and macro economists at banks, insurance companies, universities and in the government.