CHAPTER 12

DATA MINING KNOWING THE UNKNOWN

CHAPTER 12

DATA MINING KNOWING THE UNKNOWN

TEST YOUR UNDERSTANDING

1.  Why is DM a process and not an end in itself? Explain.

Although DM can produce knowledge, and discover new patterns, it is incapable of extracting meaning. The human intervention is still needed.

2.  Describe the differences and similarities among DM, machine learning, and business intelligence. How are they related?

·  Business intelligence (BI) is a global term for all processes, techniques, and tools that support business decision making based on information technology. The approaches can range from a simple spreadsheet to an advanced decision support system. Data mining is a component of BI.

·  The objective of data mining is to optimize the use of available data and reduce the risk of making wrong decisions. Data mining is a business process concerned with finding understandable knowledge from very large real-world databases. Statistics and machine learning are considered to be the analytical foundations upon which DM was developed.

·  Machine learning (ML) has focused on making computers learn things for themselves. Machine learning is the automation of the learning process that is a crucial function in any intelligent system. Its methodology includes learning from examples, reinforcement learning, and supervised or unsupervised learning. ML is a scientific discipline considered to be a sub-field of artificial intelligence.

3.  “DM can be thought of as a form of advanced statistical techniques.” Do you agree with this statement? Why or why not?

DM is not a form of advanced statistical techniques, because though DM uses statistical techniques to discover hidden facts contained in databases, find patterns, and subtle relationships, its overall function is broader and more sophisticated since it has to infer rules that allow the prediction of future results. Hence, statistical techniques are one of many tools that DM uses in performing its tasks.

4.  “DM is a tool to develop intelligent systems.” Define intelligence, explain how systems could have intelligent behavior, and discuss this statement.

According to the Oxford dictionary, intelligence is the power to learn, understand, and know. This definition applies to humans. With the evolution of the processing power of computers, many scientists started to claim that computers could do anything human beings could do and sometimes better or faster. Turing defined intelligent behavior of a system as the ability of performing perfect imitation of humans. No machine is able to pass this test. However, machines can now perform some intelligent tasks that help humans to solve their problems. DM, for example, can extract hidden patterns from large sets of data. This task cannot be achieved by humans because of their poor computational efficiency. DM can capture or discover some knowledge that would remain useless without the direct intervention of humans to understand the meaning and take action.

5.  Describe the differences between OLAP and DM. When would you use each tool?

OLAP: Online analytical processing tools give the user the capability to perform multidimensional analysis of the data. This approach uses computing power and graphical interfaces to manipulate data easily and quickly at the convenience of the user. The focus is showing data along several dimensions. The manager should be able to drill down into the ultimate detail of a transaction and zoom up for a general view.

Using a combination of machine learning, statistical analysis, modeling techniques, and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.

6.  What are the limitations of OLAP? How is DM able to overcome them?

OLAP has two limitations:

·  It does not find patterns automatically.

·  It does not have powerful analytical techniques.

DM overcomes these limitations by using a combination of machine learning, statistical analysis, modeling techniques, and database technology.

7.  What is the role of DM in e-business?

DM applications for CRM are integrated with e-sales functions, in order to create the customer-centric firm. DM applications are the first line in understanding the customer and an integral key to segmenting the market.

8.  Describe, with examples, when you would use predictive DM and when you would use descriptive DM.

The goal of a DM descriptive task is to understand, explain, or discover relationships among data sets. It looks for similarity and dissimilarity in data. In contrast, a predictive task is concerned with future behavior. This task is time driven. Predicting company bankruptcy or customer response to marketing campaign are examples of predictive DM.

9.  Explain how DM is used in the health sector and in the telecommunications industry.

In the health-care business: Keeping pace with the rate of technological and medical advancement provides a significant challenge. Cost is a constant issue in this ever-changing market. Early DM activities have focused on financially oriented applications. Predictive models have been applied to predict length of stay, total charges, and even mortality.

In the telecommunications industry: Keeping pace with the rate of technological change provides a significant challenge to businesses throughout the telecommunications industry. In addition to this, deregulation is changing the business landscape, resulting in competition from a wide range of service providers. Finding and retaining customers is important to telecommunications providers. In addition to customer profiling, subscription fraud and credit applications are utilized throughout the industry. Concerns about privacy and security are likely to result in DM applications targeted to these areas.

10.  Explain how companies are using DM to understand their customers’ behavior and predict their intentions.

Data mining—technologies and techniques for recognizing and tracking patterns within data—helps businesses sift through layers of seemingly unrelated data for meaningful relationships, where they can anticipate, rather than simply react to, customer needs.

11.  Describe the major pitfalls faced by companies when implementing DM solutions.

Data-mining project managers stumble across some problems such as:

·  Insufficient understanding of business needs

·  Careless handling of data. Data mishandling errors include the following:

·  Over-quantifying data

·  Miscoding data

·  Analyzing without taking precautions against sampling errors

·  Loss of precision due to improper rounding of data values

·  Incorrectly handling missing values

·  Invalidly validating the data-mining model

KNOWLEDGE EXERCISES

1.  Discuss what types of industries can best benefit from DM. Which ones cannot?

Hint: Think of the ones having the most transactions and accessible data.

The financial services, health-care, and telecommunication industries are among the industries that can benefit best from DM, because they have many complicated transactions, and access to data is guaranteed either through the Internet, data warehouses, or financial reports. One of the businesses that is in need of DM is agriculture, but due to the lack of information and fluctuating data it is not benefiting from the applications. Also, industries that include similar products (e.g., ice cream, beverage) don’t require DM because their transactions are limited and simple.

2.  Statistical and DM applications both produce different results for management, even though they might use the same historical data. Discuss the similarities and differences in reporting capabilities.

The similarities between DM and statistical applications are: They both depend on formulating hypotheses and testing them, they discover hidden associations, and they can find unexpected patterns.

The differences are: in statistical applications the hypotheses are formulated manually, while using the DM applications; the hypotheses are automatically generated, in addition to other capabilities that the statistical applications can’t provide like response to extracted patterns, selection of the right actions, learning from past actions, and turning action into business value.

3.  A large online bank needs to mine data coming from many sources, including marketing, accounting, and customer databases. Discuss the best way to collect and prepare multi-source data.

The best ways to collect data for the bank is from a geographical database that includes a relational database for all the bank transactions (internal: purchasing, or external: relationships with clients) from different territories and geographical areas.

Also, data warehouses are suitable places for a large amount of data from various sources. The data preparation stage includes the following tasks: evaluating data quality, handling missing data, processing outliers, normalizing data, and quantifying data. This will help in understanding the importance of some variables and the irrelevance of others, which helps narrowing down the focus of the application.

4.  Minetise.com is an Internet company specializing in online banner ads. The company is developing an application that customizes a banner according to a customer’s historic profile. Discuss how DM can be used to develop such an application.

To develop such an application, the company must go through the virtuous DM cycle, starting by business understanding: the company must identify its purpose for using the application, they must realize the real benefit from such banners and know what problems they are most likely to encounter. The application should define the profile of the customer. According to the profile, and based on historical data, a matching banner is identified.

5.  Your manager is extremely worried about integration problems that might arise from implementing a DM application on your company’s SQL database. Some of the questions bothering him include the following: How will it integrate into the current computing environment? Will it work on our existing SQL database, or do we need anything else? How easily will the system work on our intranet? Discuss the problems and possible solutions to these questions. What other problems might your company face?

Analytical methods include querying and reporting data, data visualization, and data analysis. However, statistics and machine learning that depend primarily on SQL and other database applications are considered to be analytical foundations upon which DM was developed. DM applications provide a global approach that integrates the conventional tools in a whole process that leads to actionable knowledge. It works directly on the SQL server and allows users to access information from different sources through client/server (intranet) or Web-based query systems. Some of the questions that needs to be addressed are:

·  Will any SQL server work? Most of the new DM applications require the latest SQL server, and it can be installed easily.

·  Do we need a special type of knowledge workers and users? DM can provide the right environment to satisfy the requirements of all types of knowledge workers.

6.  Finance Trance is a stock brokerage firm. They are thinking of using DM in their customer services department. Suggest some uses and services they can offer. Also, discuss the DM tasks that are to be used.

Some of the services that can be offered are:

·  Portfolio screening: using DM applications, Finance Trance can offer their clients a high standards portfolio through scanning different companies’ stock prices, dividends, historical earnings, etc., and building a portfolio from the best options. Neural Networks are the proper DM task to be used for this service.

·  Currency Exchange Market fluctuations: where it can provide clients with a forecast of the currency exchange prices in the future which will ensure an attractive return on investment. Neural Networks are to be used for this service.

·  Loan applications processing: using DM applications, applicants will learn of their status in a short time. Classification tree is the task to be used here.

7.  An online bookstore has asked your company to develop a DM application to recommend books to customers. Your manager wants you to analyze how the company works and see what data you can pull from their data warehouses. How would you go about understanding the business and data available before starting the project? What part does this fulfill in the overall project?

This is the first stage of the Virtue DM cycle and it is called “Business Understanding” and data preparation. First of all we must determine the problems faced by the firm, this involves analyzing the company’s customer-base, market share, historical data about sales and revenues, payment methods, and other factors. The data can be retrieved from their own database through business transactions (money transfer, shipping, Web site registration, etc.). By achieving this stage, we would have a clear idea about the important issues that the DM application must address.

8.  How could a mobile phone company use DM to lower customer churn? Can it use DM to increase variables such as product development speed, marketing effort, or even customer retention?

DM can help a mobile phone company to lower customer churn in various ways. One can develop a DM model predicting which customers are more likely to renew their services and which are more likely to churn. It holds usage patterns and other important customer characteristics that can be used to identify satisfied and dissatisfied customers. It can identify to which incentives the customers respond best (more product features, extended guarantee period, etc.). Additionally, the model can determine other problems affecting the customers’ loyalty, and gives recommendations on how to solve or avoid them.

9.  During the data preparation stage, a supermarket omitted certain data fields that were later shown to have significant adverse effects on the overall DM application. Which stages of the DM process will be affected? At which stage could this problem have been detected? How do you think the problem was detected?

Omitting significant data will affect all the following stages: model building, action and decision, and evaluation. It will be detected at the model testing stage. At this stage, the model is put to test using test criteria, and if it fails the test it is either rejected or the parameters are adjusted for further testing. The proper way to detect such problems is to go through individual records before mining the data to get a feel for information, and see if at least what we know is still existent.

10.  Design a survey to glean trends from several companies that are planning to develop DM applications. This survey should help clarify the role of executive managers, the characteristics of the planned project, and the return expected from it.

This mini-project should help students understand how companies are planning for DM application, who is making the decision, and why.