Cluster Analysis (CLA) – also known as Q analysis, typology construction, classification analysis, or numerical taxonomy
- The primary objective of CLA is to classify objects into relatively homogeneous groups based on the set of variables considered. Objects in a group are relatively similar in terms of these variables, and different from objects from other groups.
- When used in this manner, CLA is obverse of FA (factor analysis), in that it reduces the number of objects, not variables (as FA does), by grouping them into a much smaller number of clusters.
- CLA can also classify variables into relatively homogeneous groups based on the set of objects considered (inverse of the above)
- CLA places similar observations into groups, trying to
- Minimize within-group variance (i.e. build groups with homogeneous contents)
and, at the same time,
- Maximize between-group variance (i.e. build heterogeneous groups)
- CLA is different from MDA (Multiple Discriminant Analysis) in that groups are suggested by the data and not defined a priori
- CLA is only descriptive and exploratory (in the same way as the MDS techniques are), has no statistical foundation nor requirements for the data (as FA, MDA, or Multiple Regression Analysis have; e.g. normality, linearity, or homoscedasticity)
- The only concerns are:
- The sample should be representative (and without any outliers –they should be deleted after proper screening)
- There should be no multicollinearity among the variables – related variables may be weighted more heavily, thereby receiving improper emphasis in the analysis.
- Naturally-occurring groups must exist in the data, but the analysis cannot confirm the validity of these groups.
- Purposes of CLA
- Data reduction (to assess structure), usually followed by other multivariate analysis.
- For example, to describe differences in consumers’ product usage behaviour, the consumers may first be clustered into groups, and then the differences among the groups may be examined using MDA (multiple discriminant analysis).
- Hypotheses development
- For example, a researcher may believe that attitudes toward the consumption of diet vs. regular soft drinks could be used to separate soft-drink consumers into separate groups. Hence, he/she can use CLA to classify soft-drink consumers by their attitudes about diet vs. regular soft drinks, and then profile the resulting clusters for demographic similarities and/or differences.
- Classification of objects or variables
- For example, used in marketing for:
- Segmenting the market (e.g. consumers may be clustered on the basis of benefits sought from the purchase of a product)
- Understanding buyer behaviours (e.g. the buying behaviour of each cluster may be examined separately)
- Identifying new product opportunities (e.g. brands in the same cluster compete more fiercely with each other than with brands in other clusters)
- Selecting test markets (e.g. by grouping cities into homogeneous clusters, it is possible to select comparable cities to test various marketing strategies)
- Steps in CLA
- Define the variables on which clustering of objects will be based
- Note: inclusion of even one or two (not to mention more than that) irrelevant variables may distort a clustering solution!
- Results of CLA are only as good as the variables included in the analysis (each variable should have a specific reason for being included; if you cannot identify why a variable should be included in the analysis – exclude it).
- Remember: CLA has no means of differentiating relevant from irrelevant variables (it is the researcher who has to perform this task)
- Hint: Always examine the results and eliminate the variables that are not distinctive across the derived clusters, and then repeat the CLA.
- The variables should be selected based on past research, theory, or a consideration of the hypotheses being tested
- Select a similarity measure from one of the following:
- Correlational measures(analyze patterns across the variables)
- For each pair of objects, the correlation coefficient is calculated across all the variables
- Distance measures (analyze the proximity between objects across the variables)
- The Euclidean distance (or it’s square) – the most popular choice
- The Manhattan distance (or city-block distance)
- and many other distance measures (e.g. Chebychev distance, Minkowski power distance, Mahalanobis distance, cosine, chi-square (for counts data)
1)Note: the choice of the distance will have a great impact on the clustering solution; therefore, use different distances and compare the results.
2)When variables have different scales (e.g. semantic differential 7-point rating scale, 9-point Likert-type scale, 5-point Likert-type scale) or are measured in vastly different units (e.g. percentages, dollar amounts, frequencies), they should be re-scaled in one of the following ways:
* Standardize each variable (i.e. from each variable value subtract the mean value and then divide the difference by the standard deviation – this produces so called Z scores). [This is the most widely used approach.]
* Divide the variable values only by their standard deviation
* Divide the variable values only by their mean
* Divide the variable values by their range
* Divide the variable values by their maximum
*Note: Not only the variables can (or should) be standardized, but the objects (i.e. cases or respondents) may be standardized as well. Standarizing respondents (so-called “ipsitizing”) helps sometimes to remove so-called “response-style effects”.
- Association measures (for nominal or ordinal measurements).
- Select a clustering procedure
- Hierarchical clustering = stepwise clustering procedure involving a combination/division of the objects into clusters
- Agglomerative clustering (starts with each object in a separate cluster, and then clusters are formed by grouping objects into bigger and bigger clusters – like snowballs)
1)Linkage methods
(i) Single linkage algorithm (based on the minimum distance between the two closest points of two clusters)
(ii) Complete linkage algorithm (based on maximum distance between the two furthest points of two clusters)
(iii) Average linkage algorithm (based on the average of the distances between all pairs of objects from each of the clusters)
2)Variance methods
(i) Ward’s method (based on the
squared Euclidean distances from each object to the cluster’s mean)
3)Centroid method (based on the distances between the clusters’ centroids, i.e. their means for all of the variables)
- Divisive clustering (just opposite: starts with all the objects grouped in one cluster, and then clusters are divided or split into smaller groups)
*Note: Agglomerative clustering is more popular than divisive clustering in marketing research
* Within agglomerative clustering, the most popular approaches are: average linkage and Ward’s methods. They have been shown to perform better than the other procedures.
* Squared Euclidean distances should be used with the Ward’s and centroid methods.
- Non-hierarchical clustering (also referred to as K-means clustering)
- Sequential threshold method
- Parallel threshold
- Optimizing partitioning method
- Disadvantages of non-hierarchical methods:
- The number of clusters must be pre-specified in advance and the selection of cluster centres (seed points) is arbitrary
- The clustering results may depend on how the centers are selected
- Advantages of non-hierarchical methods:
- Are better for large data sets
- Note: One may use hierarchical and non-hierarchical methods in tandem:
- First, an initial clustering solution is obtained with a hierarchical method, for example, such as average linkage or Ward’s procedure
- Then, the number of clusters and cluster centroids so obtained (in hierarchical method) may be used as inputs to the optimizing partitioning method.
- Decide on the number of clusters – there are no hard and fast rules, the following guidelines are available:
- A number of clusters may be determined by theoretical, practical, or conceptual considerations. For example, if the purpose of clustering is to identify market segments, management may want a particular number of clusters.
- The relative sizes of the clusters may be taken into account for determining the number of clusters (e.g. if with six clusters, one of the clusters contains only one or two objects, one may decide to create only five clusters, thus increasing the size of each of them)
- In hierarchical clustering, the following outputs are obtained:
- The icicle plot (will be presented and explained in class)
- The agglomeration schedule
- The dendogram
- The number of clusters can be determined based on the analysis of the above outputs (will be explained in class)
- In non-hierarchical clustering, the ratio of total within-group variance to between-group variance can be plotted against the number of clusters. The point at which an elbow (or sharp bend) occurs indicates an appropriate number of clusters (will be explained in class)
- Interpret the clusters
- involves naming (assigning a label) to each cluster
- MDA may be applied to assist in labeling the clusters
- Validate and profile the clusters
Example.Open the file World90.xls and convert it to a SPSS data file
- Standarize the variables:
- Analyze Descriptive Statistics Descriptives Paste variables (consumpt, investme, govnt, rgdppea, rgdppw, rgdp88) Click on “Save standardized values as variables” Options (leave as is) Continue OK
- Click again on the file World90.sav (at the bottom of your screen)
- Notice that the standardized variables (old names with a letter Z in front) have been added after the last column of the initial data set
- Assess multicollinearity and remove outliers (if any)
- Analyze Regression Linear Paste variable “Zrgdp88” as Dependent Paste all the remaining “Zvariables” as Independent Statistics Check all empty boxes Set standard deviations to 2.5 (instead of 3) Continue Plots Check Histogram and Normal probability plot boxes Continue Save Check all empty boxes EXCEPT “Prediction Intervals” and “Save to New File” and Export model information to XML file” Continue OK (no Options)
- Analyze Descriptive Statistics Explore Paste all Zvariables (except Zrgdp88) into Dependent List Click Display (both) Statistics Continue (no change) Plots (click on Histogram and Normality plots with test; keep Factor levels together and Stem-and-leaf checked) OK
- Analyze the outcome. This will be explained in class.
3. Start the clustering procedure