1Das SDIL Smart Data Testbed

Autor: Prof. Dr. Michael Beigl, Prof. Dr. Bernhardt Neumair, Till Riedel, Nico Schlitter, KIT/Smart Data Innovation Lab

The Smart Data Innovation Lab (SDIL) offers big data researchers a unique access to a large variety of Big Data and In-Memory technologies. Industry and science collaborate closely in order to find hidden value in big data and generate smart data. Projects are focused on the strategic research areas of Industrie 4.0, Energy, Smart Cities and Medicine.

SDIL bridges the gap between cutting-edge research and industrial big data applications.The main goal of the SDIL is to accelerate innovation cycles using smart data approaches.In order to close today’s gap between academic research and industry problems through a data driven innovation cycle the SDIL provides extensive support to all collaborative research projects free of charge.

Figure1: The SDIL Innovation Cycle

1.1Platform

The hardware and software provided by the SDIL platform enables researchers to perform their analytics on unique state of the art hardware and software without acquiring e.g. separate licensing or dealing with complicated cost structures. To industrial data providers it gives a chance to analyze their data together with an academic partner in fully secured on-premise environment.

Figure 2: The SDIL Plattform

1.1.1SAP HANA

SAP HANA is a revolutionary platform that allows customers to explore and analyze large volumes of data in real-time, create flexible analytic models, develop and deploy real-time applications. The SAP HANA in-memory appliance is available on the SDIL Platform.

In addition, we installed the Application Function Library (AFL) on the HANA instances. The AFL is a collection of pre-delivered commonly utilized business, predictive and other types of algorithms for use in projects or solutions that run on SAP HANA. These algorithms can be leveraged directly in development projects, speeding up projects by avoiding writing custom complex algorithms. AFL operations also offer very fast performance, as AFL functions run in the core of SAP HANA in-memory DB. The AFL packageincludes:

The Predictive Analysis Library (PAL) is a set of functions in the AFL. It contains pre-built, parameter-driven, commonly used algorithms primarily related to predictive analysis and data mining. Support of multiple algorithms, e.g., K-Means, Association Analysis, C4.5 Decision Tree, Multiple Linear Regression, or Exponential Smoothing. Please refer to the official SAP HANA PAL user guide for further information (SAP HANA PAL Library Documentation).
The Business Function Library (BFL) is a set of functions in the AFL. It contains pre-built, parameter-driven, commonly used algorithms and is primarily related to the analysis of financial data. Please refer to the official SAP HANA BFL user guide for further information (SAP HANA BFL Library Documentation).

Figure 3: Hardware and software configuration for the SAP HANA System

1.1.2TerracottaBigMemory Max

Terracotta BigMemory Max is the in-memory data management platform for real-time big data applications and developed by Software AG. It supports a distributed in-memory data-storage topology, which enables the sharing of data among multiple caches and in-memory data stores in multiple JVM. BigMemory Max uses a Terracotta Server Array to manage data that is shared by multiple application nodes in a cluster. Furthermore the use of off-heap memory enables Java applications to leverage virtually all the RAM for in-memory data storage without causing garbage collection pauses.

The BigMemory Max kit is installed and available on the SDIL Platform. A single and active Terracotta Server is configured and running on this machine. The server manages Terracotta clients, coordinates shared objects and persists data. Terracotta clients run on application server along with the applications being clustered by Terracotta. The data is held in the remote server with a subset of recently used data held in each application node.

1.1.3IBM Open Platform with Hadoop and Spark

The SDIL Platform is running a Hadoop cluster with Spark that can be used to perform analytics following the map-reduce paradigm.

IBM SPSS Modeler

In addition, we provide specialized tools that build upon hadoop for further analytics. The IBM SPSS Modeler built by IBM is a data mining and text analytics software application. It provides a range of advanced algorithms and techniques that include text and entity analytics, decision making management, and optimization in order to build the predictive models and conduct various range of data analysis tasks. (ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/16.0/en/modelerusersguide_book.pdf)

IBM SPSS Analytic Server

In order to start the analysis stream on the IBM SPSS Modeler Server, one need to import the data. The IBM SPSS Modeler Server provides a number of ways to transfer the data to the analytic streams: via files (csv, json, xml and other known formats), using a DB2 database server, or using the SPSS Analytic Server.

Figure 4: Hardware and software configuration for the IBM Watson System

1.1.4VirtualizationandResourceAllocation

HTCondor

In order to use the SDIL resources efficiently and to avoid interference between users, we make use of the HTCondor batch system. This system takes care of resource management and guarantees that users get exclusive access to the requested resources. A program will run and returns after it’s finished. While it is running it consumes memory (RAM) and CPU. If many users run many programs the total available memory might not be sufficient and the program or even the whole compute server might crash. When using a batch system the system takes care of resource management and avoids system overload and crashes. To do this, users need to specify which computing task they would like to perform and what resources will be required for this task. This so called job is submitted to the batch system and the system takes care of executing of it as soon as the requested resources will become available. The users can get an overview about their submitted and running jobs via an API. Additionally, users can be informed via email when their job is finished.

1.2Communities

SDIL provides access to experts and domain-specific skills within Data Innovation Communities fostering the exchange of project results. They further provide the possibility for open-innovation and bilateral matchmaking between industrial partners and academic institutions.

1.2.1Data Innovation Community “Industrie 4.0”

Industrie 4.0 is a powerful driver of large data growth, and directly connected with the “Internet of Things”. Through the Web, real and virtual worlds grow together to form the Internet of Things. In production, machines as well as production lines and warehousing systems are increasingly capable of exchanging information on their own, triggering actions and controlling each other. The aim is to significantly improve processes in the areas of development and construction, manufacturing and service. This fourth industrial revolution represents the linking of industrial manufacturing and information technology – creating a new level of efficiency and effectiveness. Industrie 4.0 creates new information spaces linking ERP systems, databases, the Internet and real-time information from production facilities, supply chains and products.

The Data Innovation Community “Industrie 4.0” wants to explore important data-driven aspects of the fourth industrial revolution, such as proactive service and maintenance of production resources or finding anomalies in production processes.

The Data Innovation Community “Industrie 4.0” addresses all companies and research institutions interested in conducting joint research with regard to these aspects. This includes user companies as well as companies from the automation and IT industries.

1.2.2Data Innovation Community “Energy”

The energy industry is facing fundamental changes. The move towards renewable energies; the EU stipulation to install smart meters; the development of new, customer-centred business models: all these changes combine to form entirely new challenges for IT infrastructure of the energy industry. By analysing comprehensive data, both structured and unstructured, e.g. data generated by mobile device apps, web portals or social media, utility companies will be able to optimise their business processes and develop new business models. A case in point: Big Data analyses enable better consumption forecasts so that energy providers will be able to better manage and control their energy purchases on the energy markets. Thanks to Big Data, consumption rate models can be better tailored towards specific user groups, and unhappy customers can be identified more quickly – allowing for measures aimed at ensuring higher customer retention.

The Data Innovation Community “Energy” wants to explore important data-driven aspects in the area of energy, such as the demand-driven fine-tuning of consumption rate models based on smart meter generated data.

The Data Innovation Community “Energy” addresses all companies and research institutions interested in conducting joint research with regard to these aspects. This includes energy industry user companies as well as companies from the automation and IT industries.

1.2.3Data Innovation Community “Smart Cities”

Urban development and traffic management are also areas where Big Data analyses open up entirely new possibilities. By means of integrated transport communication solutions and intelligent traffic management systems, traffic in fast-growing, densely populated urban areas can be managed better. In cities, immense masses of data are generated by subway trains, busses, taxis and traffic cameras, just to name a few. The existing IT environment hardly allows for making forecasts or even extended data analyses in order to play through different traffic and transport scenarios. But that’s the only way to improve the respective services and further urban planning. Once information can be analysed in real-time, correctly interpreted and put into context with historical data, then traffic jams and dangerous situations can be identified at an early stage, leading to a significant decrease in traffic volume, emissions and driving time.

The Data Innovation Community “Smart Cities” wants to explore important data-driven aspects of urban life, such as traffic control, but also waste disposal or disaster control.

The Data Innovation Community “Smart Cities” addresses all companies and research institutions interested in conducting joint research with regard to these aspects, but also public bodies. This includes user companies as well as companies from the automation and IT industries.

1.3Data Innovation Community “Personalized Medicine”

Modern medicine as well generates increasingly larger data quantities. Reasons for this are higher resolution data from state-of-the-art diagnostic methods like magnetic resonance imaging (MRI), IT controlled medical technology, comprehensive medical documentation and the ever more detailed knowledge about the human genome. A case in point: personalised cancer therapy. There, increasing use of software aims at taking terabytes of data from clinical, molecular and medication data in diverse formats and distilling from them effective treatment options for each individual patient in real-time, in order to significantly improve treatment results.

Within the Data Innovation Community “Personalised Medicine”, important data-driven aspects of personalised medicine are to be explored, such as the need-driven care of patients, IT controlled medical technology or even web-based patient care.

The Data Innovation Community “Personalised Medicine” addresses all companies and research institutions interested in conducting joint research with regard to these aspects. This includes industry user companies and clinics but also companies from the automation and IT industries.

1.4Legal, security and curation as cross-cutting activities

Template agreements and processes ensure fast project initiation at maximum legal security fit to the common technological platform. A standardized process allows anyone to set up a new collaborative project at SDIL within 2 weeks.

Once you have successfully registered for the SDIL service, partners are allowed to upload and work with your data on the SDIL Platform. The data providers can upload their data using the SFTP or SCP protocols. All users get a dedicated private home directory for their files. For projects involving multiple users a project directory is available which is only accessible by the project members.

The SDIL platform is protected by several layers of firewalls. Access to the platform is only possible via dedicated login machines and only to users which were approved beforehand in our identity management system. The hardware itself is operated in a segregated server room with a dedicated access control system.Any data processing takes place in compliance with German data protection rules and regulations. Data sources are only accessible if such access was expressively granted by the data provider in advance.Toprotect against data loss we do frequent encrypted backups to our tape library. All data is deleted from the platform after the project finished.

The SDIL guarantees a sustainable invest to all partners by curating industrial data sources, best practices and code artifacts, that are contributed on a fair share basis. Furthermore, it actively includes open data and open source developments to augment the unique industrial grade solutions provided within the platform.

Seite 1