Department of Information Management, NSYSU

Data mining and Knowledge Discovery

Final Project Report

Data mining Tool Application

On Car Evaluation

指導教授：黃三益教授

學生：M954020031 陳聖現

M954020033 王啟樵

M954020042 呂佳如

中華民國九十六年6月25日

1. Introduction

Background and Motivation

Auto mobiles have always been a favorite thing to salary men wanting to have a family, besides their own house. Most of young people which like to hanging around would be a car owner with ubiquitous air condition and go wherever they want, after suffering from motorcycle rides under winds, rains, and killing sun shines. In 2005, total car sales in Taiwan hit a record of 515 thousand (by survey of IEK-ITIS) which is the highest in past ten years. It mainly thanks to the sales boost in 3rd season - main brands' announcing of new and renew models, followed with promotions on price to thrust consumer's car-change. However, with the climbing price of daily goods and gasoline, congestion in town and on highways, and short of parking space, car sales are always a tough fight. Knowing what customers need, and building accepted cars are vital besides promotion wars. After all, maybe consumers like large and comfortable car no more, and prefer small but smart cars in nowadays.

Our team is going to try a data mining approach, with classification concept, to analyze the relations between attributes of a car and how it is evaluated by customers. We will try to find out which attribute(s) or any combination(s) of attributes would induce to the acceptance of a car by customers.

Dataset and Data mining techniques

We use classification methods in data mining. The ID3 decision tree learning algorithm is used to analyze the relations between the evaluation the car in the market gets and their characters. We try to find out, among the characters, which one, which ones, or which combinations are the facts effecting if the consumer satisfied with the cars.

Following is introduction of our data set quoted from the source of this dataset.

* Title: Car Evaluation Database

* Sources:

(a) Creator: Marko Bohanec

(b) Donors: Marko Bohanec ()

Blaz Zupan ()

* Past Usage:

The hierarchical decision model, from which this dataset is derived, was first presented in M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for multi-attribute decision making. In 8th Intl Workshop on Expert Systems and their Applications, Avignon, France. pages 59-78, 1988. Within machine-learning, this dataset was used for the evaluation of HINT (Hierarchy INduction Tool), which was proved to be able to completely reconstruct the original hierarchical model. This, together with a comparison with C4.5, is presented in B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

2. Data mining procedures

Ten data mining methodology steps：

Step One：Translate the Business Problem into a Data Mining Problem

Our business problem is “What kind of cars can get good evaluation?”, which is mainly focus on characters which differently evaluated cars have. Our data set contains some attributes of cars and their evaluation. Translating our business problem to a data mining problem, we try to use “Classification” method. Setting evaluation as the target attribute, we hope to find out the rules from other attributes of the cars inducing to such classification. We expect to understand what consumers base on to decide which car to buy. For example, is the price or safety most important after all, what is the combination of attributes consumers love the most, and what is they hate the most. Therefore we apply data mining techniques on this data set to discover the deciding strategy of consumers, the car buyers.

Step Two：Select Appropriate Data

What Is Available?

The dataset is from the UCI Machine Learning Repository, which comes from University of California at Irvine, Donald Bren School of Information and Computer Science. This website not only collects a variety of dataset but also arranges and explains them, where researchers can use every kind of pattern recognition or machine learning methods for classification and compare the results. This dataset is presented by Marko Bohanec and Blaz Zupan.

How Much Data Is Enough?

Data mining techniques are used to discover useful knowledge from volumes of data, that is to say, the larger data we use, the better result we can get. But there are also some scholars think that a great deal of data don’t guarantee better result than little of data. Due to the resource is limited, the larger sample will result in much load and contain lots of exceptional cases when doing data mining tasks. The dataset our team choose has 1728 instances.

How Many Variables?

The dataset consists of 1728 instances and each record contains seven attributes which are: buying price, maintenance price, number of doors, capacity in terms of persons to carry, the size of luggage boot, estimated safety of the car, and car acceptability. The attribute of car acceptability is a class label which used to classify the degree of the car that customers accept, the other attributes are viewed as predictive variables.

What Must the Data Contain?

This data set use the six prediction variables predict the degree of the car that customers accept. Those six variables can separated three constructions,

The first construction is the price constructio, it is include two fators:b_price and m_price. The second construct is hardware construct. Hardware construct contains three factors, including the number of doors (door), capacity in terms of persons to carry (person), and the size of luggage boot (size).The final construct is safety. There is only one factor, estimated safety of the car (safety). Six attributes above are all used for classification. The last attribute, CAR, car acceptability is used as the definition of class. There are four classes, unacceptable (unacc), acceptable (acc), good (good), and very good (v-good).

Attribute name / Description / Domain
b_price / Buy Price / v-high / high / med / low
m_price / Repair price / v-high / high / med / low
door / The number of the door / 2 / 3 / 4 / 5-more
person / The number of passenger / 2 / 4 / more
size / Suitcase capacity / Small / med / big
safety / Safety evalution / Low / med / high
class / level of customer acceptance / Unacc / acc / good / vgood

Step Three：Get to Know the Data

Examine Distributions

The attribute “b_price” is buy price, and “m_price” is repair cost. door is the nomber of the door. Person is the number of passenger, size is the suitcase capacity. Safty is the safty evaluation. Lastly, class is the level of customer acceptance. All attribute is class attribute, and it has unique value domain

Compare Values with Descriptions

Validate Assumptions

In this six attributes, even if is the worst category, also have some customers can accept, for example suitcase capacity (size) even if is small, still could classify to “good”. But there has two attributes quite are special, respectively is person as well as safety. In the “person ”, the value is 2, the class all are unacc; The safety attribute value is low, the class all are unacc. Therefore we may supposition this two attributes is very important for customer when they chose the car.

Ask Lots of Questions

From above, we know these two attributes is important to customers. They do not compromise on these two attributes. The reason might be that customers think a car have only two seats is not so functional to them, and they pay much attention to the safety of cars. After all, the value of life is beyond the value of money.

Step Four：Create a Model Set

Creating a Model Set for Prediction

We separated the data set into two parts, one part is used as training data set to produce the prediction model, and the other part is used as test data set to test the accuracy of our model. We used cross-validation method, which means all data from the data set might be selected into training data set and test data set.

Step Five：Fix Problems with the Data

Categorical Variables with Too Many Values

We don’t have this problem.

Numeric Variables with Skewed Distributions and Outliers

We don’t have this problem.

Missing Values

We don’t have this problem.

Values with Meanings That Change over Time

We don’t have this problem.

Inconsistent Data Encoding

We don’t have this problem.

Step Six：Transform Data to Bring Information to the Surface

Capture Trends

We don’t have this problem.

Create Ratios and Other Combinations of Variables

We don’t have this problem.

Convert Counts to Proportions

We don’t have this problem.

Step Seven：Build Models

The data mining method we used to build the model is classification. We chose the weka.Classifiers.tree.Id3 our classify method, since it shows the better result. Below are the building information.

=== Run information ===

Scheme: weka.classifiers.trees.Id3

Relation: car

Instances: 1728

Attributes: 7

b_price

m_price

door

person

size

safety

class

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Id3

safety = low: unacc

safety = med

| person = 2.0: unacc

| person = 4.0

| | b_price = vhigh

| | | m_price = vhigh: unacc

| | | m_price = high: unacc

| | | m_price = med

| | | | size = small: unacc

| | | | size = med

| | | | size = big: acc

| | | m_price = low

| | | | size = small: unacc

| | | | size = med

| | | | size = big: acc

| | b_price = high

| | | size = small: unacc

| | | size = med

| | | | door = 2.0: unacc

| | | | door = 3.0: unacc

| | | | door = 4.0

| | | | door = 5more

| | | size = big

| | | | m_price = vhigh: unacc

| | | | m_price = high: acc

| | | | m_price = med: acc

| | | | m_price = low: acc

| | b_price = med

| | | m_price = vhigh

| | | | size = small: unacc

| | | | size = med

| | | | size = big: acc

| | | m_price = high

| | | | size = small: unacc

| | | | size = med

| | | | size = big: acc

| | | m_price = med: acc

| | | m_price = low

| | | | size = small: acc

| | | | size = med

| | | | size = big: good

| | b_price = low

| | | m_price = vhigh

| | | | size = small: unacc

| | | | size = med

| | | | size = big: acc

| | | m_price = high: acc

| | | m_price = med

| | | | size = small: acc

| | | | size = med

| | | | size = big: good

| | | m_price = low

| | | | size = small: acc

| | | | size = med

| | | | size = big: good

| person = more

| | b_price = vhigh

| | | m_price = vhigh: unacc

| | | m_price = high: unacc

| | | m_price = med

| | | | size = small: unacc

| | | | size = med