Special Interest Activity 1 By Pallavi Patel

Data Visualization Techniques for Database

ABSTRACT

There are so many visualization techniques available today for the data visualization. Demand of data visualization is on the top because of efficient visual format of data. The Goal of this document is to focus on the concept of data visualization for database. Main purpose of this document is to explore 1D, 2D and 3D representation of XML and CSV data using GGOBI. Besides that, document contains the basic idea about various data visualization techniques.

TABLE OF CONTENTS

  1. Introduction and Motivation
  2. Data Visualization Techniques
  3. Data Processing
  4. Classification of Data Visualization Techniques
  5. Dynamic/Interaction Techniques
  6. About GGOBI
  7. History
  8. Features
  9. Getting your data in GGOBI
  10. Data Visualization in GGOBI
  11. Future Work
  12. References

1. INTRODUCTION AND MOTIVATION

Data are very valuable. We collect data from almost every very in our environment. Nowadays we use electronic devices to store and process data from our environment. For example if we look at the spreadsheet of any company’s sales data, we can not understand that data by just looking it. But whenever we look at the graphs or charts of sales detail of any company, it makes sense by just looking it. We have all organizations with huge databases. Even for our weather and forecast detail, we collect a vast amount of data. By visualization of techniques we can understand data more quickly and more efficiently.

The goals of visualization techniques are: data analysis and data representation (Keahey, 1999). When we analyze data, first of all we generate hypotheses and after that we verify the data. We predict the behavior of future data. For data simplicity, we use numerical solutions such as data mining algorithms. But when we increase the dimension and variables data pattern gets complicated. In this case data can be represented visually. In this paper, I will provide basic knowledge on data visualization techniques and some examples of the data visualization.

I will show use of visualization tools and I will conclude with giving you a preview on the direction of the further research on data visualization.

2. DATA VISUALIZATION TECHNIQUES

Data visualization is the mapping of data into a Cartesian manner (Daniel). Data visualization means analyze and search database to find potential information (Daniel). The visualization allows the user to interact with data by changing the data and observe the actions and reactions of data (Daniel). This gives the better insight of data structure (Daniel). Data visualization needs involvement of human with ability to explore knowledge from database (Daniel). The greatest challenge for visualizing data is to find a good spatial represent in multi dimension. To interact and gaining information we need to scale the data for the multi dimensional representation. It means we need some variables so that we can coordinate the data with them for represent it. It is difficult to represent data such as company’s product quality, employee satisfaction. Data visualization techniques also allow to represent such data and information visualization by examine and evaluating the data in multi dimension.

2.1. Data Processing

Data processing for data visualization has two tasks: data formation and algorithms (Daniel). Data formation means to prepare data for the visualization. It formats the data into the data sets for the processing. We can normalize and project data for the multidimensional representation. The purpose of algorithm is to recognize the pattern of the data. From data processing we will have information to be communicated.

Data processing techniques:

Data Processing Techniques are as follows (Daniel):

1) Techniques for dimension reduction

In this technique, we determine the minimal subset of the data component which has main variation of the data.

2) Sub setting technique

In this technique, we are sampling the database by determining the representative subset of database and querying the database by determining the prior fixed subset of database.

3) Segmentation technique

In this technique, we create segment of database based on attribute value or range.

4) Aggregation technique

In this technique, we create aggregation upon attribute value or topological properties.

2.2. Classification of Data Visualization Techniques

There are many visualization techniques available today. We can classify those techniques as following (Daniel):

1) Geometric Techniquesare the large group of the visualization methods.

Scatter plots view is the method to visualize the data in multiple views. They use scatter plots in n x n matrix, where n is the number of dimensions. It represents the correlations and connections between variables so that we can compare the possible connections. (See figure-3)

Parallel co-ordinates is the most common method for the geometric representation. In this method dimensions are represented by parallel lines which are equally spaced. They are scaled so that the bottom of the axis stands for the lowest possible value and top of the axis correspond the highest value. A data lines are drawn to join the variable where data are located. (See figure-5)

Other methods of geometric techniques are landscapes, projection pursuits, projection views, hyper slice etc.

2) Icon-based techniquesrepresent the high-dimensional data using icons or glyphs. They are graphical representation of the data. They are very live but they have poor organization while representation. For large amount of data, it is very difficult to use icon-based method to represent them in efficient manner. It provides the set of tools for visualizing data using stick, figures, shape coding, color icons, title bars, etc.

3) Pixel-based techniquesprovides various methods such as recursive pattern, circle segment, spiral and axes method, for visualizing data.

4) Hierarchical techniquesenable the user to view data in hierarchical format. It contains the coordinate recursively. There are various methods available for hierarchical representation such as dimensional stacking, tree map, cone tree, word within word, etc.

5) Graph-based techniquesinclude methods for represent data in 2D format. They represent relation between variables on the horizontal axis and vertical axis.

Line graph is the method to represent data using lines for connecting variables. It contains straight-line charts, poly line-charts and curved-line charts.(See figure-14)

Bar Chart is the method to represent data using bars.(See figure-8)

6) Hybrid techniques provide methods which can use any combination from above techniques. It provides more flexibility over database with many types of data. It is worth to use hybrid techniques because it provides efficient visualization of large scale database.

2.3. Dynamic / Interaction Techniques

Goals of dynamic / Interaction Techniques are to provide dynamic visualization and interaction with visualization for the effective exploration of data (Daniel). Following are the techniques of Dynamic / Interaction visualization (Daniel).

1) Data-to-Visualization Mapping: This is the technique in which dynamic or interactive mapping of the data attributes to the parameters of the visualization is done. Parameters are as follows:

• x-, y-, and z-axes

• Color and size of icons, links, etc.

2) Projections:In this technique visualization of the remaining parameters are done in to 2D or 3D. Automatic variation results in an animation of the data can be found by the projections.

3) Filtering (Selection, Querying): In this technique subsets of database are determined for the dynamic / interaction visualization. Selection means we can select desired subset and Querying means; we can select specification of properties of the desired subset.

4) Linking & Brushing: This technique allows the multiple visualizations of the same data. We can made interactive changes in the one visualization that will automatically displayed in the other visualizations.

5) Zooming: This technique allows the visualization of the large amount of data in reduced form to provide overall behavior of the data.

6) Detail on Demand: By this technique, we can interactively obtain more details of the visualized data such as the attribute values and icon or additional attribute values of the data.

In next part of the document, I will focus on the data representation using various data visualization techniques by GGOBI.

3. ABOUT GGOBI( )

GGOBI is the visualization program for exploring high-dimensional data. It provides highly dynamic and interactive graphics. It enables user to understand and interact with data using 1D or 2D tours, scatter plots, bar charts, parallel co-ordinates plots, multiple view, tree view, line charts. It provides features like brushing and identification. Most important function of GGOBI is missing value.

3.1. History of GGOBI

The GGOBI is the extension of earlier version of the software, XGOBI.It was developed under charge of Bellcore. XGOBI was called Xdataviewer, written in Lisp by Andreas Buja, Catherine Hurley, John McDonald and Werner Stuetzle, while at the University of Washington, Seattle.It was initially developed for viewing data matrices. The goal of GGOBI is to provide tools to view data in high-dimension. It handles various types of data such as non-spatial data (employee satisfaction etc.) GGOBI is extends the features of XGOBI. It has goal to provide functions for relating tables, catering to longitudinal measurements etc.

3.2. Features

3.2.1.Brush in linked plot is used to view the cases with low or high values on some variables and examine the behavior of variables in relation to other variables.

3.2.2.Tour in high dimension is used to view separation between clusters in high dimension. We can view rotation of data. It allows manual and automatic projection over data.

3.2.3.It can be extended using plug-in module. Display plug-in is used to create high-quality publication graphics, associated with R package.

3.2.4.Methods such as Scatter plots, bar charts, spine plots and histograms, parallel coordinate plots, scatter plot matrices are available in GGOBI.

3.2.5.Pan and Zoom features are available.

3.2.6.High- dimensional tool that contains drawing tools.

3.3. Getting your data in GGOBI

3.3.1 GGOBI input using XML format

GGOBI’s XML format contains a rich variety of data attributes and relationships that includes missing values, encoding, records, levels, type of each variable and symbols. XML is markup and structure that input data for the visualization. It checks validity and well-formed ness of the data. Validate the xml files and then use that file to input data. It is very easy to define new DTDs to represent different inputs such as property or resource files, layout specification, graph description and description of plots. XML supports reading of compressed files and parsing.

The format of the file is described by the DTD ggobi.dtd. Each file starts with the XML declarations that identify it as XML file, the document type and its associated DTD.

<?xml version="1.0"?>

<!DOCTYPE ggobidata SYSTEM "ggobi.dtd">

<ggobidata count=”2” </ggobidata>

The tag ggobidata indicates the root tag for the document. Attribute count is used to specify that document contains more than one data set.

<data name=”employee”>

The data tag contains the entries for data set. Name will appear in the title bar of GGOBI windows.

<activeColorScheme name=”YlGn 7”>

This tag contains the 265 different color schemes. You can select color by giving name attribute. You can also select your color format.

<description</description>

Tag contains the description of the dataset.

<variables count="4"</variables

Variable tag contains the variable names. They can be continuous or categorical.

Continuous real variables can be included by tag <realvariable</realvariable>

Continuous Integer variables use the tag <integervariable

Categorical variables can be included in the document by tag <cagegoricalvariable>

If your data values are numbers instead of strings, and all levels of interest are present in the data then use tag level. For example if we have category of variable employee_satisfaction, then code is like below.

categoricalvariable name="employee_satisfaction">

<levels count="3" />

</categoricalvariable>

Default value of level are 1,2 and 3 and the level names may be L1, L2 etc.

categoricalvariable name="employee_satisfaction">

<levels count="3">

<level value="0">low</level>

<level value="1">medium</level>

<level value="2">high</level>

</levels>

</categoricalvariable>

To add data <record> tag can be used.

<records count="4" color="4" glyphType="2" glyphSize="2" missingValue=".">

</records>

Record contains default color, glyph type and size and missing values. GlyphSize can be in range from 0 to 7, and glyphType can be from 0 to 6, where 0 represents a single-pixel point and 6 a filled circle.

Figure-1 shows the document ggobi.dtd (

Figure-1

<!ENTITY % glyphTypes " plus | x | oc | or | fc | fr | . ">
<!ENTITY % Boolean " true | T | TRUE | false | F | FALSE ">
<!ELEMENT description ( #PCDATA )>
<!ATTLIST description
source CDATA #IMPLIED
<!ELEMENT variables ( variable|realvariable|categoricalvariable|integervariable|countervariable )*>
<!-- Allows the name to be the only entry -->
<!ELEMENT variable ( #PCDATA )>
<!ELEMENT realvariable ( (description)?,(quickHelp)? )>
<!ELEMENT categoricalvariable (description?, quickHelp?, levels, time?)?>
<!ELEMENT countervariable EMPTY>
<!ELEMENT integervariable (description?, quickHelp?, time?)>
<!ELEMENT level (#PCDATA)>
<!ATTLIST level
value CDATA #IMPLIED
<!ELEMENT levels (level)+>
<!ATTLIST levels
count CDATA #REQUIRED
<!ELEMENT quickHelp (#PCDATA)>
<!ELEMENT time (#PCDATA)>
<!ELEMENT records ( record )*>
<!ELEMENT record (#PCDATA|int|real|na|string)*>
<!ELEMENT ggobidata ((brush)?, (activeColorScheme)?, (data|edges)+)>
<!-- I don't see how a datad can have a colormap -->
<!ELEMENT data ((description)?, (colormap)?, (variables)?, records, (edges)?)>
<!ELEMENT color ( #PCDATA) >
<!-- id is a number or either of "fg" and "bg" -->
<!ATTLIST color
id CDATA #IMPLIED
r CDATA #IMPLIED
g CDATA #IMPLIED
b CDATA #IMPLIED
range CDATA #IMPLIED
name CDATA #IMPLIED
<!ELEMENT edge EMPTY>
<!ELEMENT edges (data*, edge+)>
<!ATTLIST edges
name CDATA #IMPLIED
count CDATA #REQUIRED
data CDATA #IMPLIED
<!ELEMENT activeColorScheme ( #PCDATA ) >
<!ATTLIST activeColorScheme
name CDATA #IMPLIED
file CDATA #IMPLIED
<!ELEMENT colormap (color)+ >
<!ATTLIST colormap
size CDATA #REQUIRED
range CDATA #IMPLIED
file CDATA #IMPLIED
type CDATA "(xml | table)"
<!ELEMENT red (#PCDATA)>
<!ELEMENT green (#PCDATA)>
<!ELEMENT blue (#PCDATA)>
<!ATTLIST records
count CDATA #REQUIRED
missingValue CDATA #IMPLIED
color CDATA #IMPLIED
glyph CDATA #IMPLIED
glyphSize ( 0 | 1 | 2 | 3 | 4 | 5 ) #IMPLIED
glyphType ( %glyphTypes; ) #IMPLIED
<!ATTLIST variables
count CDATA #REQUIRED
<!ENTITY % VariableAttributes "name NMTOKEN #IMPLIED
nickname NMTOKEN #IMPLIED
transformName NMTOKEN #IMPLIED
missingValue CDATA #IMPLIED
time CDATA #IMPLIED
min CDATA #IMPLIED
max CDATA #IMPLIED">
<!ATTLIST variable
%VariableAttributes;
<!ATTLIST integervariable
%VariableAttributes;
<!ATTLIST countervariable
%VariableAttributes;
<!ATTLIST realvariable
name CDATA #IMPLIED
nickname NMTOKEN #IMPLIED
quickHelp CDATA #IMPLIED
time CDATA #IMPLIED
transformName CDATA #IMPLIED
missingValue CDATA #IMPLIED
min CDATA #IMPLIED
max CDATA #IMPLIED
<!ATTLIST categoricalvariable
name NMTOKEN #IMPLIED
nickname NMTOKEN #IMPLIED
quickHelp CDATA #IMPLIED
time CDATA #IMPLIED
transformName NMTOKEN #IMPLIED
missingValue CDATA #IMPLIED
min CDATA #IMPLIED
max CDATA #IMPLIED
levels CDATA "auto"
count CDATA #IMPLIED
<!ATTLIST record
label CDATA #IMPLIED
id CDATA #IMPLIED
color CDATA #IMPLIED
glyph CDATA #IMPLIED
glyphSize ( 0 | 1 | 2 | 3 | 4 | 5 ) #IMPLIED
glyphType ( %glyphTypes; ) #IMPLIED
hidden ( %Boolean; ) "FALSE"
source CDATA #IMPLIED
destination CDATA #IMPLIED
<!ATTLIST edge
source CDATA #REQUIRED
destination CDATA #REQUIRED
color CDATA #IMPLIED
width CDATA #IMPLIED
<!ATTLIST data
name CDATA #IMPLIED
nickname NMTOKEN #IMPLIED
count CDATA #IMPLIED
missingValue CDATA #IMPLIED
color CDATA #IMPLIED
glyphSize ( 0 | 1 | 2 | 3 | 4 | 5 ) #IMPLIED
glyphType ( %glyphTypes; ) #IMPLIED
<!-- to initialize the brushing color and glyph -->
<!ELEMENT brush ( #PCDATA )>
<!ATTLIST brush
color CDATA #IMPLIED
glyph CDATA #IMPLIED
<!ATTLIST ggobidata
count CDATA #IMPLIED
ids (alpha) #IMPLIED
<!ELEMENT real ( #PCDATA )>
<!ELEMENT int ( #PCDATA )>
<!ELEMENT string ( #PCDATA )>
<!ELEMENT na EMPTY>

3.3.2 GGOBI input using CSV format

GGOBI reads CSV (Comma-Separated Variables) files, with “,” and carriage return. This is the format actually made popular by Excel. The file returns as a record Delimiter. We can save Excel file with .CVS extension. They are extremely basic, they do not allow the color and glyph specification. They can not be used for linking data sets and display of graphs.

Height / Weight / Salary / HR / DA / Bonus / Company
1 / 191 / 131 / 4000 / 150 / 15 / 104 / IBM
2 / 185 / 134 / 5000 / 147 / 13 / 105 / IBM
3 / 200 / 137 / 6000 / 144 / 14 / 102 / IBM
4 / 173 / 127 / 3000 / 144 / 16 / 97 / IBM
5 / 171 / 118 / 7000 / 153 / 13 / 106 / IBM
6 / 160 / 118 / 4500 / 140 / 15 / 99 / IBM
7 / 188 / 134 / 5500 / 151 / 14 / 98 / IBM
8 / 186 / 129 / 6060 / 143 / 14 / 110 / AIRTEL
9 / 174 / 131 / 3030 / 144 / 14 / 116 / AIRTEL
10 / 163 / 115 / 4020 / 142 / 15 / 95 / AIRTEL
11 / 190 / 143 / 1000 / 141 / 13 / 99 / AIRTEL
12 / 174 / 131 / 2500 / 150 / 15 / 105 / AIRTEL
13 / 201 / 130 / 3400 / 148 / 13 / 110 / AIRTEL
14 / 190 / 133 / 6570 / 154 / 15 / 106 / AIRTEL
15 / 182 / 130 / 3525 / 147 / 14 / 105 / TOYOTO
16 / 184 / 131 / 6753 / 137 / 14 / 95 / TOYOTO
17 / 177 / 127 / 5686 / 134 / 15 / 105 / TOYOTO
18 / 178 / 126 / 9776 / 157 / 14 / 116 / TOYOTO
19 / 210 / 140 / 2522 / 149 / 13 / 107 / TOYOTO
20 / 182 / 121 / 4363 / 147 / 13 / 111 / TOYOTO

3.4. Data Visualization in GGOBI

Consider the document test.xml displayed in figure-2.

Figure-2

<!DOCTYPE ggobidata SYSTEM "ggobi.dtd">
ggobidata
brush color="6" glyph="fc 3"/>
data name="employee_detail">
description source="ftp://...">
Data comparison between employee details of three different companies.
</description
variables count="7">
realvariable name="hight" nickname="t1">
descriptionHight of the employee
</description
quickHelpHight of the employee</quickHelp
</realvariable
realvariable name="weight" nickname="t2"/>
realvariable name="salary"/>
realvariable name="HR" nickname="a1"/>
realvariable name="DA" nickname="a2"/>
realvariable name="Bonus" nickname="a3"/>
categoricalvariable name="Companies">
levels count="3">
level value="1"> IBM </level
level value="2"> Airtel </level
level value="3"> TOYOTO</level
</levels
</categoricalvariable
</variables
records count="20" color="2" missingValue="NA">
record label="IBM" color="2">
191 131 5300 150 15 104 1
</record
record label="IBM" color="2">
185 134 5000 147 13 105 1
</record
record label="IBM" color="2">
200 137 4546 144 14 102 1
</record
record label="IBM" color="2">
173 127 4636 144 16 97 1
</record
record label="IBM" color="2">
171 118 4976 153 13 106 1
</record
record label="IBM" color="2">
160 118 4788 140 15 99 1
</record
record label="IBM" color="2">
188 134 5000 151 14 98 1
</record
record label="Airtel" color="5">
186 129 5768 143 14 110 2
</record
record label="Airtel" color="5">
174 131 1200 144 14 116 2
</record
record label="Airtel" color="5">
163 115 3000 142 15 95 2
</record
record label="Airtel" color="5">
190 143 3245 141 13 99 2
</record
record label="Airtel" color="5">
174 131 4677 150 15 105 2
</record
record label="Airtel" color="5">
201 130 3552 148 13 110 2
</record
record label="Airtel" color="5">
190 133 5397 154 15 106 2
</record
record label="TOYOTO" color="1">
182 130 7000 147 14 105 3
</record
record label="TOYOTO" color="1">
184 131 2500 137 14 95 3
</record
record label="TOYOTO" color="1">
177 127 3678 134 15 105 3
</record
record label="TOYOTO" color="1">
178 126 5685 157 14 116 3
</record
record label="TOYOTO" color="1">
210 140 5487 149 13 107 3
</record
record label="TOYOTO" color="1">
182 121 5151 147 13 111 3
</record
</records
</data
</ggobidata

Figure: 3 – Scatter plot display of the employee data