Department of Information Management, NSYSU
Data mining and Knowledge Discovery
Final Project Report
Data mining Tool Application
On Car Evaluation
指導教授:黃三益 教授
學生:M954020031 陳聖現
M954020033 王啟樵
M954020042 呂佳如
中華民國九十六年6月25日
1. Introduction
Background and Motivation
Auto mobiles have always been a favorite thing to salary men wanting to have a family, besides their own house. Most of young people which like to hanging around would be a car owner with ubiquitous air condition and go wherever they want, after suffering from motorcycle rides under winds, rains, and killing sun shines. In 2005, total car sales in Taiwan hit a record of 515 thousand (by survey of IEK-ITIS) which is the highest in past ten years. It mainly thanks to the sales boost in 3rd season - main brands' announcing of new and renew models, followed with promotions on price to thrust consumer's car-change. However, with the climbing price of daily goods and gasoline, congestion in town and on highways, and short of parking space, car sales are always a tough fight. Knowing what customers need, and building accepted cars are vital besides promotion wars. After all, maybe consumers like large and comfortable car no more, and prefer small but smart cars in nowadays.
Our team is going to try a data mining approach, with classification concept, to analyze the relations between attributes of a car and how it is evaluated by customers. We will try to find out which attribute(s) or any combination(s) of attributes would induce to the acceptance of a car by customers.
Dataset and Data mining techniques
We use classification methods in data mining. The ID3 decision tree learning algorithm is used to analyze the relations between the evaluation the car in the market gets and their characters. We try to find out, among the characters, which one, which ones, or which combinations are the facts effecting if the consumer satisfied with the cars.
Following is introduction of our data set quoted from the source of this dataset.
* Title: Car Evaluation Database
* Sources:
(a) Creator: Marko Bohanec
(b) Donors: Marko Bohanec ()
Blaz Zupan ()
(c) Date: June, 1997
* Past Usage:
The hierarchical decision model, from which this dataset is derived, was first presented in M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for multi-attribute decision making. In 8th Intl Workshop on Expert Systems and their Applications, Avignon, France. pages 59-78, 1988. Within machine-learning, this dataset was used for the evaluation of HINT (Hierarchy INduction Tool), which was proved to be able to completely reconstruct the original hierarchical model. This, together with a comparison with C4.5, is presented in B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by function decomposition. ICML-97, Nashville, TN. 1997 (to appear)
2. Data mining procedures
Ten data mining methodology steps:
Step One:Translate the Business Problem into a Data Mining Problem
Our business problem is “What kind of cars can get good evaluation?”, which is mainly focus on characters which differently evaluated cars have. Our data set contains some attributes of cars and their evaluation. Translating our business problem to a data mining problem, we try to use “Classification” method. Setting evaluation as the target attribute, we hope to find out the rules from other attributes of the cars inducing to such classification. We expect to understand what consumers base on to decide which car to buy. For example, is the price or safety most important after all, what is the combination of attributes consumers love the most, and what is they hate the most. Therefore we apply data mining techniques on this data set to discover the deciding strategy of consumers, the car buyers.
Step Two:Select Appropriate Data
What Is Available?
The dataset is from the UCI Machine Learning Repository, which comes from University of California at Irvine, Donald Bren School of Information and Computer Science. This website not only collects a variety of dataset but also arranges and explains them, where researchers can use every kind of pattern recognition or machine learning methods for classification and compare the results. This dataset is presented by Marko Bohanec and Blaz Zupan.
How Much Data Is Enough?
Data mining techniques are used to discover useful knowledge from volumes of data, that is to say, the larger data we use, the better result we can get. But there are also some scholars think that a great deal of data don’t guarantee better result than little of data. Due to the resource is limited, the larger sample will result in much load and contain lots of exceptional cases when doing data mining tasks. The dataset our team choose has 1728 instances.
How Many Variables?
The dataset consists of 1728 instances and each record contains seven attributes which are: buying price, maintenance price, number of doors, capacity in terms of persons to carry, the size of luggage boot, estimated safety of the car, and car acceptability. The attribute of car acceptability is a class label which used to classify the degree of the car that customers accept, the other attributes are viewed as predictive variables.
What Must the Data Contain?
This data set use the six prediction variables predict the degree of the car that customers accept. Those six variables can separated three constructions,
The first construction is the price constructio, it is include two fators:b_price and m_price. The second construct is hardware construct. Hardware construct contains three factors, including the number of doors (door), capacity in terms of persons to carry (person), and the size of luggage boot (size).The final construct is safety. There is only one factor, estimated safety of the car (safety). Six attributes above are all used for classification. The last attribute, CAR, car acceptability is used as the definition of class. There are four classes, unacceptable (unacc), acceptable (acc), good (good), and very good (v-good).
Attribute name / Description / Domainb_price / Buy Price / v-high / high / med / low
m_price / Repair price / v-high / high / med / low
door / The number of the door / 2 / 3 / 4 / 5-more
person / The number of passenger / 2 / 4 / more
size / Suitcase capacity / Small / med / big
safety / Safety evalution / Low / med / high
class / level of customer acceptance / Unacc / acc / good / vgood
Step Three:Get to Know the Data
Examine Distributions
The attribute “b_price” is buy price, and “m_price” is repair cost. door is the nomber of the door. Person is the number of passenger, size is the suitcase capacity. Safty is the safty evaluation. Lastly, class is the level of customer acceptance. All attribute is class attribute, and it has unique value domain
Compare Values with Descriptions
Validate Assumptions
In this six attributes, even if is the worst category, also have some customers can accept, for example suitcase capacity (size) even if is small, still could classify to “good”. But there has two attributes quite are special, respectively is person as well as safety. In the “person ”, the value is 2, the class all are unacc; The safety attribute value is low, the class all are unacc. Therefore we may supposition this two attributes is very important for customer when they chose the car.
Ask Lots of Questions
From above, we know these two attributes is important to customers. They do not compromise on these two attributes. The reason might be that customers think a car have only two seats is not so functional to them, and they pay much attention to the safety of cars. After all, the value of life is beyond the value of money.
Step Four:Create a Model Set
Creating a Model Set for Prediction
We separated the data set into two parts, one part is used as training data set to produce the prediction model, and the other part is used as test data set to test the accuracy of our model. We used cross-validation method, which means all data from the data set might be selected into training data set and test data set.
Step Five:Fix Problems with the Data
Categorical Variables with Too Many Values
We don’t have this problem.
Numeric Variables with Skewed Distributions and Outliers
We don’t have this problem.
Missing Values
We don’t have this problem.
Values with Meanings That Change over Time
We don’t have this problem.
Inconsistent Data Encoding
We don’t have this problem.
Step Six:Transform Data to Bring Information to the Surface
Capture Trends
We don’t have this problem.
Create Ratios and Other Combinations of Variables
We don’t have this problem.
Convert Counts to Proportions
We don’t have this problem.
Step Seven:Build Models
The data mining method we used to build the model is classification. We chose the weka.Classifiers.tree.Id3 our classify method, since it shows the better result. Below are the building information.
=== Run information ===
Scheme: weka.classifiers.trees.Id3
Relation: car
Instances: 1728
Attributes: 7
b_price
m_price
door
person
size
safety
class
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Id3
safety = low: unacc
safety = med
| person = 2.0: unacc
| person = 4.0
| | b_price = vhigh
| | | m_price = vhigh: unacc
| | | m_price = high: unacc
| | | m_price = med
| | | | size = small: unacc
| | | | size = med
| | | | | door = 2.0: unacc
| | | | | door = 3.0: unacc
| | | | | door = 4.0: acc
| | | | | door = 5more: acc
| | | | size = big: acc
| | | m_price = low
| | | | size = small: unacc
| | | | size = med
| | | | | door = 2.0: unacc
| | | | | door = 3.0: unacc
| | | | | door = 4.0: acc
| | | | | door = 5more: acc
| | | | size = big: acc
| | b_price = high
| | | size = small: unacc
| | | size = med
| | | | door = 2.0: unacc
| | | | door = 3.0: unacc
| | | | door = 4.0
| | | | | m_price = vhigh: unacc
| | | | | m_price = high: acc
| | | | | m_price = med: acc
| | | | | m_price = low: acc
| | | | door = 5more
| | | | | m_price = vhigh: unacc
| | | | | m_price = high: acc
| | | | | m_price = med: acc
| | | | | m_price = low: acc
| | | size = big
| | | | m_price = vhigh: unacc
| | | | m_price = high: acc
| | | | m_price = med: acc
| | | | m_price = low: acc
| | b_price = med
| | | m_price = vhigh
| | | | size = small: unacc
| | | | size = med
| | | | | door = 2.0: unacc
| | | | | door = 3.0: unacc
| | | | | door = 4.0: acc
| | | | | door = 5more: acc
| | | | size = big: acc
| | | m_price = high
| | | | size = small: unacc
| | | | size = med
| | | | | door = 2.0: unacc
| | | | | door = 3.0: unacc
| | | | | door = 4.0: acc
| | | | | door = 5more: acc
| | | | size = big: acc
| | | m_price = med: acc
| | | m_price = low
| | | | size = small: acc
| | | | size = med
| | | | | door = 2.0: acc
| | | | | door = 3.0: acc
| | | | | door = 4.0: good
| | | | | door = 5more: good
| | | | size = big: good
| | b_price = low
| | | m_price = vhigh
| | | | size = small: unacc
| | | | size = med
| | | | | door = 2.0: unacc
| | | | | door = 3.0: unacc
| | | | | door = 4.0: acc
| | | | | door = 5more: acc
| | | | size = big: acc
| | | m_price = high: acc
| | | m_price = med
| | | | size = small: acc
| | | | size = med
| | | | | door = 2.0: acc
| | | | | door = 3.0: acc
| | | | | door = 4.0: good
| | | | | door = 5more: good
| | | | size = big: good
| | | m_price = low
| | | | size = small: acc
| | | | size = med
| | | | | door = 2.0: acc
| | | | | door = 3.0: acc
| | | | | door = 4.0: good
| | | | | door = 5more: good
| | | | size = big: good
| person = more
| | b_price = vhigh
| | | m_price = vhigh: unacc
| | | m_price = high: unacc
| | | m_price = med
| | | | size = small: unacc
| | | | size = med
| | | | | door = 2.0: unacc