Overview of Various Technical Aspects in SDWH

in partnership with

Title: / Overview of various technical aspects in SDWH
WP: / 3 / Deliverable: / 3.4
Version: / 3.0 (final) / Date: / September 2013
Author: / Valerij Žavoronok / NSI: / Statistics Lithuania

ESS - NET

On micro data linking and data warehousing
in production OF BUSINESS STATISTICS

Overview of various technical aspects in SDWH

Lina Amšiejūtė, Maksim Lata,Valerij Žavoronok

Lithuania

Summary

1.Introduction

2.Access layer

3.Interpretation and Data Analysis layer

4.Integration layer

5.Source Layer

6.Common models and approaches

7.Conclusion

1.Introduction

This review is intended as an overview of software packages existing on the market or developed on request in NSIs in order to describe the solutions that would meet NSI needs, implement SDWH concept and provide the necessary functionality for each SDWH level given in 3.1 deliverable.

In a generic SDWH system we identified four functional layers, starting from the most detailed bottom level up to the top of the SDWH architecture where conceptual level is placed. The ground level corresponds to the area where the process starts, while the top of the pile is where the data warehousing process finishes. There are:

IV - access layer intended for the final presentation, dissemination and delivery of the information that we need. In this layer the data organization must support automatic dissemination systems and free analysts but always statistical information is macro data.

III - interpretation and data analysis layer is specifically for statisticians and enables any data analysis, data mining and support to design production processes or data reuse.

II - integration layer is where all operational activities needed for any statistical production process are carried out. The integration process includes micro data record linkage and metadata handling which can be realized before or after any review or editing activities, in function of the statistical process.

I - source layer is the level in which we locate all the activities related to storing and managing internal (surveys) or external (archives) raw data sources.

This reflects a conceptual organization in which we consider the first two levels as operational IT infrastructures and the last two layers as the effective data warehouse. The first two levels have to support processes with a high numbers of transactional updates while the last two layers are for data manipulations. Depends on its functionalities we consider the use of different tools to achieve goals.

2.Access layer

The principal purpose of data warehouse is to provide information to its users for strategic decision-making. These users interact with the warehouse throughout Access layer using end user access tools. The examples of some of the end user access tools can be:

Specialised Business Intelligence Toolsfor data access

Business intelligence tools are a type of software that is designed to retrieve, analyse and report data. This broad definition includes everything from Reporting and Query Tools, Application Development Tools to Visual Analytics Software, Navigational Tools(OLAP viewers). The main makers of business intelligence tools are:

Oracle
Microsoft
SAS Institute
SAP
Tableau
IBM Cognos
QlikView

Office Automation Tools (for regular productivity and collaboration instruments)

By Office automation tools we understand all software programs which make it possible to meet office needs. In particular, an office suite therefore usually contains following software programs: word processing, a spreadsheet, a presentation tool, a database, a scheduler.One of the most common office automation tools around:

Microsoft Office
Corel WordPerfect
iWork
IBM‘s Lotus SmartSuite
OpenOffice (open source/freeware).

Graphics and Publishing tools

Graphics and publishing tools are tools with ability to create one or more infographics from a provided data set or to visualize information.There are a vast variety of tools and software to create any kind of information graphics, depending on the organizations needs:

PSPP

PSPP is a free software application for analysis of sampled data, intended as a free alternative for IBM SPSS Statistics. It has a graphical user interface and conventional command-line interface. It is written in C, uses GNU Scientific Library for its mathematical routines, and plotutils for generating graphs. This software provides a basic set of capabilities: frequencies, cross-tabs comparison of means (T-tests and one-way ANOVA); linear regression, reliability (Cronbach's Alpha, not failure or Weibull), and re-ordering data, non-parametric tests, factor analysis and more.At the user's choice, statistical output and graphics are done in ASCII, PDF, PostScript or HTML formats. A limited range of statistical graphs can be produced, such as histograms, pie-charts and np-charts.PSPP can import Gnumeric, OpenDocument and Excel spreadsheets, Postgres databases, comma-separated values- and ASCII-files. It can export files in the SPSS 'portable' and 'system' file formats and to ASCII files. Some of the libraries used by PSPP can be accessed programmatically; PSPP-Perl provides an interface to the libraries used by PSPP.

SAS is a most known integrated system of software products provided by SAS Institute Inc., which enables programmers to perform: information retrieval and data management, report writing and graphics, statistical analysis and data mining, forecasting, Operations research and project management, Quality improvement, Applications development, Data warehousing (extract, transform, load), Platform independent and remote computing. SAS is driven by SAS programs, which define a sequence of operations to be performed on data stored as tables. Although non-programmer graphical user interfaces to SAS exist (such as the SAS Enterprise Guide), these GUIs are most often merely a front-end that automates or facilitates the generation of SAS programs. The functionalities of SAS components are intended to be accessed via application programming interfaces, in the form of statements and procedures.SAS Library Engines and Remote Library. SAS has an extensive SQL procedure, allowing SQL programmers to use the system with little additional knowledge.SAS runs on IBM mainframes, Unix, Linux, OpenVMS Alpha, and Microsoft Windows. SAS consists of a number of components which organizations can separately license and install as required.

SPSS

SPSS Statistics is a software package used for statistical analysis, officially named "IBM SPSS Statistics". Companion products in the same family are used for survey authoring and deployment (IBM SPSS Data Collection), data mining (IBM SPSS Modeler), text analytics, and collaboration and deployment (batch and automated scoring services).SPSS is among the most widely used programs for statistical analysis in social science. It is used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations and others.

The many features of SPSS are accessible via pull-down menus or can be programmed with a proprietary 4GL command syntax language. Command syntax programming has the benefits of reproducibility, simplifying repetitive tasks, and handling complex data manipulations and analyses. Additionally, some complex applications can only be programmed in syntax and are not accessible through the menu structure. The pull-down menu interface also generates command syntax; this can be displayed in the output, although the default settings have to be changed to make the syntax visible to the user. They can also be pasted into a syntax file using the "paste" button present in each menu. Programs can be run interactively or unattended, using the supplied Production Job Facility. Additionally a "macro" language can be used to write command language subroutines and a Python programmability extension can access the information in the data dictionary and data and dynamically build command syntax programs. The Python programmability extension, introduced in SPSS 14, replaced the less functional SAX Basic "scripts" for most purposes, although SaxBasic remains available. In addition, the Python extension allows SPSS to run any of the statistics in the free software package R. From version 14 onwards SPSS can be driven externally by a Python or a VB.NET program using supplied "plug-ins".SPSS can read and write data from ASCII text files (including hierarchical files), other statistics packages, spreadsheets and databases. SPSS can read and write to external relational database tables via ODBC and SQL.Statistical output is to a proprietary file format (*.spv file, supporting pivot tables) for which, in addition to the in-package viewer, a stand-alone reader can be downloaded. The proprietary output can be exported to text or Microsoft Word, PDF, Excel, and other formats. Alternatively, output can be captured as data (using the OMS command), as text, tab-delimited text, PDF, XLS, HTML, XML, SPSS dataset or a variety of graphic image formats (JPEG, PNG, BMP and EMF).

Stata

Stata is a general-purpose statistical software package created by StataCorp. It is used by many businesses and academic institutions around the world. Stata's capabilities include data management, statistical analysis, graphics, simulations, and custom programming.Stata has always emphasized a command-line interface, which facilitates replicable analyses. Starting with version 8.0, however, Stata has included a graphical user interface which uses menus and dialog boxes to give access to nearly all built-in commands. This generates code which is always displayed, easing the transition to the command line interface and more flexible scripting language. The dataset can be viewed or edited in spreadsheet format. From version 11 on, other commands can be executed while the data browser or editor is opened.Stata can import data in a variety of formats. This includes ASCII data formats (such as CSV or databank formats) and spreadsheet formats (including various Excel formats).Stata's proprietary file formats are platform independent, so users of different operating systems can easily exchange datasets and programs.

Statistical Lab

The computer program Statistical Lab (Statistiklabor) is an explorative and interactive toolbox for statistical analysis and visualization of data. It supports educational applications of statistics in business sciences, economics, social sciences and humanities. The program is developed and constantly advanced by the Center for Digital Systems of the Free University of Berlin. Their website states that the source code is available to private users under the GPL. Simple or complex statistical problems can be simulated, edited and solved individually with the Statistical Lab. It can be extended by using external libraries. Via these libraries, it can also be adapted to individual and local demands like specific target groups. The versatile graphical diagrams allow demonstrative visualization of underlying data.The Statistical Lab is the successor of Statistik interaktiv!. In contrast to the commercial SPSS the Statistical Lab is didactically driven. It is focused on providing facilities for users with little statistical experience. It combines data frames, contingency tables, random numbers, matrices in a user friendly virtual worksheet. This worksheet allows users to explore the possibilities of calculations, analysis, simulations and manipulation of data.For mathematical calculations, the Statistical Lab uses the Engine R, which is a free implementation of the language S Plus. The R-Project is constantly being improved by worldwide community of Developers.

STATISTICA

STATISTICA is a suite of analytics software products and solutions provided by StatSoft. The software includes an array of data analysis, data management, data visualization, and data mining procedures; as well as a variety of predictive modeling, clustering, classification, and exploratory techniques. Additional techniques are available through integration with the free, open source R programming environment. Different packages of analytical techniques are available in six product lines: Desktop, Data Mining, Enterprise, Web-Based, Connectivity and Data Integration Solutions, and Power Solutions.

STATISTICA includes analytic and exploratory graphs in addition to standard 2- and 3-dimensional graphs. Brushing actions (interactive labeling, marking, and data exclusion) allow for investigation of outliers and exploratory data analysis.Operation of the software typically involves loading a table of data and applying statistical functions from pull-down menus or (in versions starting from 9.0) from the ribbon bar. The menus then prompt for the variables to be included and the type of analysis required. It is not necessary to type command prompts. Each analysis may include graphical or tabular output and is stored in a separate workbook.

Web services tools (M2M)

Stylus Studio

Stylus Studio has many different components like a powerful Web Service Call Composer that enables you to locate and invoke Web service methods directly from within Stylus Studio XML IDE. Stylus Studio‘s Web Service Call composer supports all of the core Web service technologies like Web Service Description Language (WSDL), Simple Object Access Protocol (SOAP), and Universal Description Discovery and Integration (UDDI) – and is an ideal Web services tool for testing Web services, inspecting WSDL files, generating SOAP envelopes, and automating or accelerating many other common XML development tasks encountered when developing Web service enabled applications. Also has a powerful schema-aware WSDL editor, which can greatly simplify your work with Web Services and the Web Service Description Language (WSDL) – an XML format for describing network services as a set of endpoints operating on messages containing either document-oriented or procedure-oriented information. Stylus Studio's WSDL editor supports working with WSDL files, making editing WSDL files and validating them a breeze.

Microsoft Visual Studio

Microsoft Visual Studio contains a bunch of dedicated tools for creating and supporting web services, such as: Web Services Description Language Tool which generates code for XML Web services and XML Web services clients from Web Services Description Language (WSDL) contract files, XML Schema Definition (XSD) schema files, and .discomap discovery documents; Web Services Discovery Tool – discovers the URLs of XML Web services located on a Web server, and saves documents related to each XML Web service on a local disk. Soapsuds Tool helps you compile client applications that communicate with XML Web services using a technique called remoting.

Apache Axis

Apache Axis is an open source, XML based Web service framework. It consists of a Java and a C++ implementation of the SOAP server, and various utilities (WSIF, SOAP UDDI, Ivory, Caucho Hessian, Caucho Burlap, Metro, Xfire, Gomba, Crispyand etc.) and APIs for generating and deploying Web service applications. Using Apache Axis, developers can create interoperable, distributed computing applications. Axis is developed under the auspices of the Apache Software Foundation.

3.Interpretation and Data Analysis layer

The interpretation and data analysis layer is specifically for statisticians and would enable any data manipulation or unstructured activities. In this layer expert users can carry out data mining or design new statistical strategies.

Statistical Data Mining Tools

The overall goal of the data mining tools is to extract information from a data set and transform it into an understandable structure for further use. Aside from the main goal of the data mining tools they should also be capable to visualisatedata/information, which was extracted in data mining process. Because of this feature a lot of tools from this category have been already covered in Graphics and Publishing tools section, such as:

IBM SPSS Modeler (data mining software provided by IBM)
SAS Enterprise Miner (data mining software provided by the SAS Institute)
STATISTICA Data Miner (data mining software provided by StatSoft)
and etc.

This list of statistical data mining tools can be increased by adding some other very popular and powerful commercial data mining tools, such as:

Angoss Knowledge Studio(data mining tool provided by Angoss)
Clarabridge (enterprise class text analytics solution)
E-NI (e-mining, e-monitor) (data mining tool based on temporal pattern)
KXEN Modeler(data mining tool provided by KXEN)
LIONsolver (an integrated software application for data mining, business intelligence, and modeling that implements the Learning and Intelligent OptimizatioN (LION) approach)
Microsoft Analysis Services(data mining software provided by Microsoft)
Oracle Data Mining (data mining software by Oracle)

One of data mining tools widely used among statisticians and data miners is open source software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS

R (programming language and environment for statistical computing and graphics)

R is an implementation of the S programming language combined with lexical scoping semantics inspired by Scheme. R is a GNU project. The source code for the R software environment is written primarily in C, Fortran, and R. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems. R uses a command line interface; however, several graphical user interfaces are available for use with R. R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages. There are some important differences, but much code written for S runs unaltered. Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made. For computationally intensive tasks, C, C++, and Fortran code can be linked and called at run time. Advanced users can write C or Java code to manipulate R objects directly. R is highly extensible through the use of user-submitted packages for specific functions or specific areas of study. Due to its S heritage, R has stronger object-oriented programming facilities than most statistical computing languages. Extending R is also eased by its permissive lexical scoping rules.Another strength of R is static graphics, which can produce publication-quality graphs, including mathematical symbols. Dynamic and interactive graphics are available through additional packages.R has its own LaTeX-like documentation format, which is used to supply comprehensive documentation, both on-line in a number of formats and in hard copy.R functionality has been made accessible from several scripting languages such as Python (by the RPyinterface package), Perl (by the Statistics: R module), and Ruby (with the rsruby rubygem). PL/R can be used alongside, or instead of, the PL/pgSQL scripting language in the PostgreSQL and Greenplum database management system. Scripting in R itself is possible via littler as well as via Rscript.Other major commercial software systems supporting connections to or integration with R include: SPSS, STATISTICA and SAS.