Visual Analytics in Data Stability and Data Flow Analysis
Jianping Zhou, Georges Grinstein
ABSTRACT
Coupled visualization andanalysis, a.k.a. visual analytics, has become a major methodology for data analysisand exploration. Effective visual analysisof various clustering results and categorical information withina single graphical interactive display is highly recommended. We describe a specific integration of analysis and visualizationfor partition stabilityevaluation. Partitions are decompositions of a datasetinto a family of disjoint subsets. Theymay refer to the results ofclustering, of groupings of categorical dimensions, of binned numerical dimensions,ofpredetermined class labeling dimensions, or prior knowledge structured in mutuallyexclusive format (one data item associated with one and only one outcome).Partition or cluster stability analysis identifies near-optimal structure, buildsensembles, or conducts validation.
We developed a set of partition stability metrics and a new visualization tool, CComViz (Cluster ComparisonVisualization), for mutual comparison and evaluation of multiple partitioning of thesamedataset.We've added a novel layout algorithm for informatively rearranging theorder of the records and dimensions. With these techniques, we are able tovisualize data stability, data flow, density distribution and hierarchy, and datacorrelation at the record, at the group and at the dimension levels withina single graphical interactive display. We provide an example application ofCComViz.
1INTRODUCTION
Partition comparison and evaluation is a common task in comparative data analysis. In this context,partitions refer to the results ofclustering in flat format, of groupings of categorical dimensions, of binned numerical dimensions,ofpredetermined class labeling dimensions, or prior knowledge structured in mutuallyexclusive format (one data item associated with one and only one outcome).Partition stability analysis deals with comparison and evaluation of multiple partitions in order to identify near-optimal structure, buildensembles, or conduct validation.Partition stability is a measure regardingthe consistenceof records within clusters or data categories.Stable records characterize the main features of their belonging clusters while unstable records indicate outliers and anomalies. The results of partition stability analysis often answer questions like
- Which data items are strongly clustered together? Which data items are marginally clustered?
- Is the clustering result obtained by a method better than other methods?
- What is the optimal number of clusters over multiple clustering runs with different settings?
- How consistent, or stable, are data memberships over multiple partitions?
- How data flow starting from one partition to others?How data volumes change? What are their density distributions and hierarchical relationships?
The integration of analysis and visualizationfor partition stabilityevaluation has become a major methodology in exploratory data analysis. We have developed aset of partition stability metrics and a new visualization tool,CComViz (Cluster ComparisonVisualization), to carry out this methodology. We proposed a novel layout algorithm used in CComVizfor informatively rearranging therecord display order. With these techniques, we are able tovisualize data stability, data flow, density distribution and hierarchy, and datacorrelation at the record, at the cluster and at the dimension levels withina single graphical interactive display.
2BACKGROUND
Partition stability metrics characterize diverse partition stability analyses. In terms of distance, correlation or similaritymeasure between partitions or clusters to be compared, partition stability metrics are divided into partition-based metrics and cluster-based metrics. There are totally four levels of partition stability, which arerespectively calledrecord stability, group stability, dimension stability and dataset stability. Record stability measuresthe record’s tendency to follow a trend. Group stability measuresthe stability of a cluster or a partitioning group across different partitions. And dimension stability is overall difference of a partition from others. Finally, dataset stabilityreflects the dataset’s complexity indicatingthe chance of data pattern existence with underlying selection of partitions. Partition-based stability metrics include only metrics at the dimensionand the dataset levels, and are often used for estimating the number of clusters [1] [2][3][4].Some external and relative criteria[5] used in cluster analysis, including Rand index [6], Fowlkes-Mallows index [7] [8]and Hubert’s statistic [5], can be used to define partition-based stability metrics. Cluster-based stability metrics consist of metrics at all levels. In general, any cluster distance, correlation or similarity measures are options for defining cluster-based stability metrics. These options include F-measure[9], label distance, label correlation coefficient [11], cluster similarity [10]. More cluster distance measures can be found in the references [12] [13].
Information theory [14] is traditionally used to measure the amount of shared information between two variables. Strehl and Ghosh[15], Fred and Jain [16]adopted mutual information index from information theory and defined two consistency measures applied to a pair of partitions. These measures can be further used for partition-based stability metrics.A study using cluster-based stability metrics for perturbation-based cluster analysis was conducted by Kaze and Grinstein [11]. Perturbation-based cluster analysis is a kind of partition stability analysis, which investigates the response of various clustering methods to random noise added to input data in order to find out the most stable clustering through a large quantity of experiments. To support this technique, the authors defined stability metrics at the record, the cluster, and the dimension levels. The calculations of these stability measures are based on comparisons between the original (unperturbed) clustering and every perturbed clustering. Thesemetrics require the setting of number of clusters to be the samein every perturbation run, which equal to the number of clusters of the unperturbed clustering. In addition, the correspondent cluster pairs between unperturbed and perturbed clustering results, known as label correspondence problem [15] [18], need to be recognized. For perturbation-based cluster analysis, Bittner et al. [17] presented a partition-based stability analytics, called WARP (Weighted Average Discrepant Pairs). FOM (Figure of Merit) is another kind of partition stability analysis, proposed by Yeung et al.[22]. This technique is analogous to leave-one-out validation. In every run, one dimension is left out and serves as the validation dimension. Clustering is applied to the remaining dataset. And acomparison then takes place between the validation dimension and the clustering result. As every dimension is chosen as the validation dimension, the average similarity is able to measure the partition stability.
Visualization provides an intuitive and interactive means to help partition stability analysis. Traditionally, interactive comparison of multiple partitions, projected inmultiple visualization instances and arranged side by side, is a straightforward approach to perform partition stability analysis. However due to the limitations of computer screen space, as well as human short-term memory, the comparison with multiple graphic displays often cause efficiency problem. Apparently, a single graphic interactive display is a solution to address this problem. Chen et al.[1] proposed a tree graph to express the similarities and hierarchical grouping relationships of various clustering results.Expression Profiler [23] visualization package developed at EBI (European Bioinformatics Institute) presents a method for comparing various clustering results in either flat or hierarchical structures. For the comparison of two flat clustering results, a bipartite graph is used to express their relationships in [23]. Each side of the bipartite graph is a group of a clustering result. An algorithm is applied to rearrange the original cluster order, such that some clusters are merged into “superclusters” nodes and the number of crossing edges is minimized.After node arrangement and merge, the weight of each edge between two nodes is proportionalto the number of overlapped records in both nodes. SM (Stable Matrix) visualization tool,proposed by Cvek [24], visualizes a symmetric 2D matrix in heatmap. The element of this matrix is calculated by SM (i, j) =nij / N; where N is the total number of clustering results involved in the comparison, and nijis the number of clustering results which recognize thepair of records (i, j)staying in the same cluster. The record index order in SM is carefully arranged by using an algorithm, so that the record stability statuses and relationships are well visually presented.Parallel Sets [25] and Interactive Sankey Diagrams[26] are variants of parallel coordinates [27]. Unlike parallel coordinates, each record in these visualizations has a unique position on each axis regardless of its value. When record display order on each axis is appropriately arranged, these visualization tools work well for displaying data flows, density distributions and hierarchies of multiple variables. In order to reduce clutter and occlusion, a set of rules are defined in these tools based on grouping and sorting.
3PARTITION STABILITY METRICS
In order to perform mutual comparison and evaluation of multiple partitions, we developed a set of cluster-based partition stability metrics at the record, the cluster and the dimension levels. For illustrating the calculation of these metrics, we created a synthetic dataset, as shown in Table 1, which contains three categorical dimensions.
Table 1 Synthetic Dataset
Record Index / C1 / C2 / C30 / c11 / c22 / c31
1 / c12 / c22 / c32
2 / c11 / c21 / c33
3 / c11 / c22 / c31
4 / c11 / c22 / c33
5 / c12 / c22 / c33
6 / c11 / c21 / c33
7 / c11 / c21 / c33
8 / c12 / c21 / c32
9 / c13 / c22 / c31
10 / c13 / c22 / c32
11 / c12 / c21 / c33
12 / c13 / c22 / c33
13 / c13 / c22 / c32
14 / c13 / c21 / c32
15 / c13 / c21 / c31
3.1 Record Stability
For a record with index k and a given set of partitions N = {C1, C2, …, Cn}, cmF(k)represents a partitioning group of Cm and contains the record k, i.e. cmF(k)Cm,CmN, k cmF(k), the record stability for record k,RS(k),is defined.
In above equation, D(ca, cb) is called D measure, which may be a reciprocal distance, correlation or similarity measure between partitioning group ca and cb. This measure is symmetric, i.e. D(ca, cb) = D(cb, ca). As we reviewed in the background, a variety of metrics can serve as this measure. In this paper, we adopt the label correlation coefficient [11], which is defined as
Below is an example from the synthetic dataset to show the calculation.Considering 0 c11, 0 c22, 0 c31, the label correlation coefficient for record with index 0 is calculated.
Table2lists all RS values for the synthetic dataset.
Table 2 Record Stability Value
Record Index / RS0 / 0.44
1 / 0.41
2 / 0.55
3 / 0.44
4 / 0.47
5 / 0.36
6 / 0.55
7 / 0.55
8 / 0.39
9 / 0.48
10 / 0.51
11 / 0.44
12 / 0.36
13 / 0.51
14 / 0.4
15 / 0.3
Various D measures,although diverse definitions, share a common feature for which the higher the D measure value, the higher the proportion of records whose dimension values fall into the paired partitioning groups. As a result, these high proportional records form a pattern. So, if the average D measure value of multiple partitioning groups which a specific record falls into is high, we could conjecture this record has high tendency to follow a trend across the multiple partitions. In reverse situation, if no patterns are associated with a specific record, this record could be considered as an outlier or anomaly.
The range of RS depends on the range of D measure. In the case of corCl serving as the D measure, RS is between (0, 1]. Its value for a record will be 1 when all partitioning groups which the record falls into are the same.
RS value is a mutual comparison result. From a statistical point of view, this measure reflects this record’s stability feature in overall.
3.2Group stability
Group stability, GS, is the average record stability of a partitioning group.
Table 3contains GS results for each portioning group of the synthetic dataset.
Table 3 Partitioning group and group stability
Group / Members / GSc11 / {0,2,3,4,6,7} / 0.5
c12 / {1,5,8,11} / 0.4
c13 / {9,10,12,13,14,15} / 0.43
c21 / {2,6,7,8,11,13,14,15} / 0.45
c22 / {0,1,3,4,5,9,10,12} / 0.44
c31 / {0,3,9,15} / 0.42
c32 / {1,8,10,13,14} / 0.44
c33 / {2,4,5,6,7,11,12} / 0.47
3.3Relative DimensionStability
Given two partitions, Ca and Cb, their relative partition stability, RPS, is defined.
where m is the record count of the dataset. Due to the symmetric property of D measure, RPS is also symmetric.
3.4DimensionStability
Dimensionstabilityis defined.
When corCl serves as the D measure, RDS and DS values for the synthetic dataset are calculated and shown in Table 3.
Table 3 PRS and PS results
Partition / RPS (Ci,C1) / RPS (Ci,C2) / RPS (Ci,C3) / PS (Ci)C1 / 1 / 0.43 / 0.51 / 0.47
C2 / 0.43 / 1 / 0.45 / 0.44
C3 / 0.51 / 0.45 / 1 / 0.48
3.5DatasetStability
Dataset stability is defined.
In later section, we will seepartition stability metrics defined above are not only the measures of data feature but also the criteria for achieving visual aesthetics in our new developed visualization tool.
4CCOMVIZ VISUALIZATION AND LAYOUT ALGORITHM
In order to seek a visualization tool more suitable for partitionstability analysis, we developed CComViz (Cluster Comparison Visualization) visualization. CComVizutilizes the samedata representation model as that of Parallel Sets [25] and Interactive Sankey Diagrams[26]. Fig. 2 depicts how data are projected in CComVizusing the synthetic dataset. In this figure, the numbers within boxes represent record indices.
Like parallel coordinates [27], CComViz lays dimension axes out in parallel. But these axes are not coordinates. Every axis is an expression of individual record display order and represented as a set of continuous color bars. Each color bar corresponds to a partitioning group with its length scaled according to its density in dimension. In CComViz, data items are also represented as polylines that link their points on every axis. The point position of each data item on axis is determined by its rank in the record display order. Since each data item has a unique rank on axis, no more than one data item is projected to the same point on each axis (without considering the rounding of projection position to screen pixel). Therefore, no more than one data item is projected to the same polyline. This feature benefits in particular categorical dimensions in contrast to regular parallel coordinates, in which data items with the same category value are projected to the same point on axis. In CComViz, one of the most important visual properties is called hot dimension. The hot dimension, like C3 in Fig. 1, is a selected dimension by which all record lines are colored.Any active dimension can be selected as the hot dimension. Choosing different hot dimension provides a convenient way to view how records in a group of one dimension fall into groups of other dimensions. In this way, data flow, density distribution and hierarchy starting from different dimension can be easily explored.
Unlike parallel coordinates, CComViz focuses on achieving visual analytics rather thangeometrical data mapping. The observation of geometrical data properties in CComViz may be disabled, meaning that the position of record projection has no direct connection to its value.Although this may inadvertently impact on numerical dimensions, by means of data and tool linking mechanism, such as under the UVP (Universal Visualization Platform) [28], this disadvantage can be compensated through using other visualization tools. As a matter of fact, parallel coordinates and CComViz share many common characteristics and interfaces. Using both visualization tools in an interactive and linked environment can generate a strong and unified power for comparative cluster analysis.
In CComViz, the feature that no more than one record is projected to the same point on each axis avoids record overlap. But clutter due to crossing remains a concern. As seen in Fig. 1,without proper record display order rearrangement, the severe edge crossings make it difficult to observe data patterns and data relationships. This problem has to be solved. Otherwise, this data representation model is useless for most real-world datasets. Beyond that, from a visualization point of view, visually representing data with less clutter is not enough. Data representation should amplify human cognition [29]. Now that CComViz is intended to display data stability, data flow, density distribution and hierarchy across multiple partitions, the record and dimension display order needs to enhance readability of these data features so cognition can be amplified.
Information visualization is fundamentally dependent upon the properties of human perception. According to visualization study [29], grouping, sorting, and positioning are among the most effective ways to support human perceptual inference, enhance patterndetection, as well as reduce information search time and memory usage. Another influential factor in human perceptualinference is related to semantics. As CComViz is designed to distinguish stable and unstable records, we expect their visual appearances to be different. Relying on semantic associations, to make stable records look smooth and unstable records fluctuate can help with human perceptual inference.
In order to achieve the matching between data representation model and human perception model, we brought above concepts into a methodology in developing a sophisticated layout algorithm for record and dimension display order rearrangement in terms of selection of the hot dimension and partition stabilitymeasures. The mechanism behind the methodologyis based on consideration and utilization of user perception modeling, partition stability measurement, and interactive functions involved in the course of data exploration and pattern recognition. The methodology used in the layout algorithm makes CComViz an ideal tool to visualize data stability, data flow, density distribution and hierarchy, as well as data correlation across a number of partitions within a single graphical display.
In CComViz, the process for rearranging record and dimension display order is in a pipeline sequence. The computation of record display order rearrangement is based on dimension display order. Any change in adding, removing or reordering dimensions will cause a change in the record display order.
4.1 Dimension Display Order Rearrangement
For partition comparison and evaluation, it is important to identify distinctions and variations between similar partitions. The similarity is defined by different metrics, such as clustering criteria and partitioning consistency. Since similar partitions have a high degree of partitioning consistency, there should be fewer crossing lines if similar partitions are adjacent and the display orders of partitioning groups along their axes maintain a correspondent relationship (To be discussed later). Therefore, from visual aesthetics point of view, putting similar partitions together assists the observation of outliers or odd records while from human visual perception point of view, it facilitates comparison and perceptual inference.