Data Mining and Knowledge Discovery (KSE525)

Assignment #1 (March22, 2016, due: April 5)

1. [10 points] In these days, various types of data sets are appearing, and many applications require or generate multiple types of data sets. Please think of such an application. For example, let's look into a social-hub application on mobile phones. This application may generate two types of data sets: graph data for social networking and trajectory data for location-based service. Your application does not have to exist in the real world and can be an imaginary one. I would like to see your rough idea (You don't have to explain your idea in detail).

2. [10 points] Please prove that the variance is an algebraic measure. In addition, please write an example showing on how the variance can be obtained from smaller subsets without recomputing it from the entire data set.

3. [10 points] Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. (Do not use any software, do the math by yourself.)

1)What is the mean of the data? What is the median?

2)What is the mode of the data? Comment on the data’s modality (e.g., bimodal).

3)Can you find the first quartile () and the third quartile () of the data? Please use the first method covered in class.

4)Give the five-number summary of the data.

5)Show a boxplot of the data.

4. [10 points] It is important to define or select similarity measures in data analysis. However, there is no commonly accepted subjective similarity measure. Results can vary depending on the similarity measures used. Nonetheless, seemingly different similarity measures may be equivalent after some transformation.

Suppose we have the following 2-D data set:

/ 1.5 / 1.7
/ 2 / 1.9
/ 1.6 / 1.8
/ 1.2 / 1.5
/ 1.5 / 1.0

1)Consider the data as 2-D data points. Given a new data point, as a query, rank the database points based on similarity with the query using Euclideandistance, Manhattandistance, supremumdistance, and cosine similarity.

2)Normalize the data set to make the norm of each data point equal to 1. Use Euclidean distance on the transformed data to rank the data points.

5. [10 points] Convert the following 2-dimensional points into 1-dimensional points by using the principle component analysis (PCA) technique. You can use any software (e.g., R, Matlab) to calculate covariance and eigenvalue/eigenvector.