The Voluminous Amount of Data Now Available, and Ever Increasing, Requires New Techniques

Dissertation Idea Paper

Gregory A. Vaughn Sr.

Research Question

Is there a set of factors that provide association rules for the best combination of cube elements? Cans a multidimensional data model using an adaptive piecewise constant approximation or linear regression to reduce sparcity, satisfy these rules? Can these associations be visualized?Can next level of modeling should also combine the concept encapsulation of the object-oriented model to support recent trends in distributed computing?

Data Mining and Computing

Contrary to purely retrieval efforts, data mining 5“looks for relations and associations between phenomenon that are not known beforehand”. 6Data mining (also known as Knowledge Discovery in Databases - KDD) has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data”. Data mining goes beyond statistical data analyses and computational algorithms and aims to be a major part of the business intelligence that supports business decisions.

Research databases either from business organizations, universities or research centers are usually created from transactional data. More often than not these databases contain information that goes undetected by the organizations and their researchers, and fails to be recognized and used by the organizations that own and maintain them. KDD also known as Data Mining looks to unearth the hidden yet meaningful data within. A question that naturally derives from this circumstance is, are there best methods or practices to discovering this undiscovered wealth.

KDD13 is seen by some as being a different activity from statistical analysis, one which takes into account “non-statistical” issues. “Examples of “non-statistical” issues in KDD include the following

1. Data cleaning

What can be done to locate and ameliorate the pervasive problems of invalid or incompletedata?

2. “First cut” analysis

What can be done to automatically provide an initial assessment of the patterns and potentially useful or interesting knowledge in a database? The aim here is, realistically, to automate someof the basic work that is now done by skilled human analysts.

3. Hypothesis generation

What can be done to support, or even automate, the finding of plausible hypotheses in the data? Found hypotheses would, of course, need to be tested subsequently with statisticaltechniques, but where do you get “the contenders” in the first place?”

Some researchers feel that efforts should be focused on “the hypothesis generation problem for KDD. Because hypothesis space is generally quite large…, it is normally quite impossible to enumerate and investigate all the potentially interesting hypotheses.

However better results may be obtained by utilization of older statistical techniques on the first two issues prior to any meaningful hypotheses generation can begin.

If information retrieval from large or small data repositories is based on pre-supposed ideas about the data, and there is a plan for the extraction of information from the data on hand, which is 5“exogenous from the extraction itself”, (an example might be a request from the FDA Federally funded clinic for data on patients who received Metroformin XR, Glipizide and Actos in combination were afforded a better diabetic treatment regimen. The FDA may have determined that there is some relationship to the administration of these drugs in combination and based on the summarized results may initiate further investigation of these patients) then data/attribute sparcity must be resolved (perhaps linear regression prediction).

The tools/techniques involved in the data mining process take a center stage. The multi-dimensional modeling tools for visualization of the data to be mined, e.g. online analytic processing tools (OLAP), usually appear in the form of a derived relation stored in terms of a base relation [17]. By instantiating the tuples of view in a database the view is materialized. The benefits gained are access speed (especially when the results are the product of complex computations).

…………………In Progress

Multi-Dimensional Modeling

As described by Kimbal(1997). “The Dimensional model adheres to a discipline that incorporates the relational model with restrictions. The dimensional model is composed of a table called a fact table that has multi-part keys and a set of smaller tables that are called dimension tables. The dimension tables have a single-part primary key that relates to only one of the components of the multi-part keys in the fact table. This structure is known as the "star join" and dates back to the earliest days of relational databases” [15, 16].

Muti-dimensional modeling presents data 10 “as facts with associated numerical measures, dimensional tables as mentioned above, or as textual dimensions of the facts”. In the case of treatment for a given disease, dosage and frequency would be measures while Laboratory/drug Company or regional location would form the dimensions. Researchers at SAP define multidimensional modeling in terms of the goals to be achieved.

14”The overarching goals of multi-dimensional models are:

To present information to the end-user in a way that corresponds to his normal understanding of his business/ i.e. to show the KPIs, key figures or facts from the different perspectives that influence them (sales organization, product/ material or time). In other words, to deliver structured information that the end-user can easily navigate by using any possible combination of business terms to illustrate the behavior of the KPIs.

To offer the basis for a physical implementation that the software recognizes (the OLAP engine), thus allowing a program to easily access the data required.

The Multi-Dimensional Model (MDM) has been introduced in order to achieve the first. The most popular physical implementation of multi-dimensional models on relational database system-based data warehouses is the Star schema implementation. SAP BW uses the Star schema approach and extends it to support integration within the data warehouse, to offer easy handling and allow high performance solutions”.

The steps necessary to accomplish the modeling include:

Complete understanding of the underlying processes that generate the data
Create a desired schema
Create a cube description

In progress ---- According to Hacid and Satler here in lies the strength of Multidimensional Databases.

Multi-Dimensional Data (Data Cubes)

The voluminous amount of data now available, and ever increasing, requires new techniques for discovery of information for decision making. The goal of the data spelunker is to find unusual patterns that may yield heretofore un-evidenced information. Traditional methods have used techniques that focus on data in a two dimensional plane. Current methods involving data cubes offer a new view of data that may afford many more decision-making opportunities. The relational model of data storage and retrieval is the standard of the day but tables and rows by their very nature limit the dimensionality that may naturally exist in the data.

1”Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another.” Data cubes can be constructed with many more dimensions while still affording single dimension indexing and query, but provide additional views to the data and consequently many more decision points.

Multi-Dimensional databases, with data cubes, 4instead of presenting data to the user in the form of tables presents it presents it in a form that can be manipulated by operators that can cut out pieces from large cubes, change granularity, of dimensions, and turn cubes.

Sample Table 1.

Give the three dimensions X, Y, and Z, let each represent a dimension. X = a particular year of sales (2004), let y = and area of sales, and let Z = a particular product.

X = year (2004)

Y = area (Brooklyn, Queens, Bronx, Manhattan)

Z = product (Scotch, Bourbon, Cognac, Vodka)

The cube is a set of cells, and a cell represents the association of a measure with one member in each dimension. A cube representing X, Y, and Z would look like the following:

With kind of multidimensional representation data can be viewed by each of the dimensions and aggregates derived for each dimension i.e.

SELECT * FROM Data Cube

GROUPED BY X

SELECT * FROM Data Cube

GROUPED BY Y

SELECT * FROM Data Cube

GROUPED BY Z

SELECT * FROM Data Cube

GROUPED BY GROUP SET ((X), (Y), (Z))

OR SOME VARIANT.

3“There are inherent features of the multidimensional model that make it an appropriate environment for business intelligence. The multidimensional model:

Enforces referential integrity. Each dimension member is unique and cannot be NA. If a measure has three dimensions, then each data value of that measure must be qualified by a member of each dimension.
Promotes consistency. Dimensions are maintained as separate workspace objects and are shared by measures.
Preserves the order of data. Each dimension has a default status list, which contains all of its members in the order they are stored. The default status list is always the same unless it is purposefully altered by adding, deleting, or moving members. Within a session, the user can change the selection and order of the status list; this is called the current status list. The current status list remains the same until the user purposefully alters it by adding, removing, or changing the order of its members.

Because the order of dimension members is consistent and known, the selection of members can be relative. For example, the function call

lag (sales, 12, month) compares the sales values of all months in the current status list against sales from a year ago (that is, 12 time periods earlier in the default status list for the month dimension).

Presents data as fully solved. Applications do not need to define calculations. Because of the combination of power and ease-of-use of the OLAP DML, the analytic workspace can be prepared so that the data is presented as fully solved to the application.
Manages calculated members and measures transparently. Users can define their own dimension members (often called custom aggregates), that function identically to the other dimension members and can be used transparently in any calculation. Similarly, users can define their own measures and assign values to them using any of the methods available in the OLAP DML. Throughout the session, these additions behave identically to the dimension members and objects originally provided in the workspace. Users can save their changes from one session to the next with a single DML command. “

The process for this type of mining is constant, as outlined by 2Gray et. al., 1) formulation – a query that extracts relevant data from a large database; 2) extracting – the aggregated data from the database into a file or table; 3) visualizing – the results in a graphical way; 4) analyzing – the results and formulating a new query.

Materialized Views

3A relatively new data structure, data cubes, are far more complex than their earlier purely, two dimensional relational siblings, and are that much more difficult to fathom and extract meaningful information. For this reason analyses materialized views of this complex data provide a better means of access and decision reporting. “A materialized view (summary table) can be thought of as a special kind of view, which physically exists inside the database, it can contain joins and or aggregates and exists to improve query execution time by pre-calculating expensive joins and aggregation operations prior to execution”.

In progress ………

Past Research

In progress ……..

Why use regression analyses!!!!

A number of studies have used regression techniques in attempting to derive a predictive model for single or multiple response variables on the basis of one or more of the other variables to describe columnar data entries, and have found the technique to produce less error than other available methods (8,9,10).

This Research

This study will test the efficiency of this model in discovering new associative or correctional information vs. a more traditional method.

Objectives -To access the predictability of type 2-onset diabetes from undiscovered physio/environmental predicates.

- 7To determine if association rule mining can discover strong association or correlation relationships between predicates.

Hypotheses - 1) the multidimensional model/technique will disclose new relationships and subsequently new predicates for in detecting and treating diabetes

References

“Data Cubes“, Russell Kay, MARCH 29, 2004(COMPUTERWORLD
“Data Cube: A relational Aggregation Operator Generalizing group-By, Cross-Tab, and Sub-Totals, S. Gray, et al., in Data Mining and Knowledge Discovery 1, 29-53 (1997) Kluwer Academic Publishers, Manufactured in the Netherlands
“Oracle9I Materialized Views”, An Oracle White Paper, May 2001
M. S. Hacid, and U Sattler, Modeling Multidimensional Databases: A Formal Object-Centered Approach Proc. Of the Sixth European Conference on Information Systems 1998 (ECIS98)
Paolo Giudici, “Applied Data Mining: Statistical Methods for Business and Industry”, John Wiley and Son, 2003
W. Frawley and G. Piatetsky-Shapiro and C. Matheus,” Knowledge Discovery in Databases: An
Overview”, AI Magazine, Fall 1992, pgs 213-228.
Hua Zhu, “On-Line Analytical Mining of Association Rules”, Thesis, SimonFraserUniversity 1998
Daniel Barbara and Mark Sullivan, “Quasi-Cubes: A space efficient way to support approximate multidimensional databases’, 1998.
S. Abad-Mota, Approximate Query Processing with Summary Tables in Statistical Database. In Proceedings of the 3rd Int’l Conference on Extending Database technology, Vienna, Austria, March 1992.
Paolo Giudici, Applied Data Mining: Statistical Models for Business and Industry”, John Wiley and Sons Ltd, 2003.
Torben Bach Pedersen and Christian S Jensen, “Multidimensional Database Technology”, Dec 2001, Aalborg University, IEEE Distributed Systems Online, computer.org/dsonline
MOTC: An Interactive Aid for Multidimensional Hypothesis Generation,

K. Balachandran, J. Buzydlowski, G. Dworman, S.O. Kimbrough, T. Shafer, & W. Vachula

Multi-Dimensional Modeling with BW ASAP for BW Accelerator Business Information Warehouse, SAP America Inc and SAP AG.

15. Kimball, Ralph. "A Dimensional Modeling Manifesto", DBMS. 10(9). 1997 Aug.

16. "Star Schemas and STARjoin? Technology", A Red Brick Systems White Paper.

17. “What is the Data Warehousing Problem? (Are Materialized Views the Answer)”, Ashish

Gupta, Inderpal Sigh Mumick, VLDB 1996: 602 , ww.sigmod.org/vldb/conf/1996/P602.PDF

13. "Star Schemas and STARjoin? Technology", A Red Brick Systems White Paper.

4. Kimball, Ralph. "A Dimensional Modeling Manifesto", DBMS. 10(9). 1997 Aug.

1. Date, C. J. "A Fruitful Union", Computerworld. 27(24): 130. 1994 Jun 14.

Raden, Neil. "Modeling the Data Warehouse", Manuscript of an article by Neil Raden that was excerpted in the January 29, 1996 issue of Information Week,