Copyright © 2015, Capgemini. All Rights Reserved. Confidential for Open Group use only.

This is an unapproved Standards Draft, Subject to Change.

Open Group Standard

Open Business Data Lake (O-BDL) Conceptual Framework

Copyright © 2015 Capgemini. All Rights Reserved. Confidential for Open Group use only.

The following material is the intellectual property of Capgemini and is a fast-track submission.

This material should be treated as Open Group Confidential.

Copyright © 2016Capgemini

All rights reserved.

Confidential for Open Group use only.

Open Group Standard

Open Business Data Lake (O-BDL) Conceptual Framework

ISBN:TBA

Document Number:TBA

Published by The Open Group, Month 2016.

Comments relating to the material contained in this document may be submitted to:

The Open Group, Apex Plaza, Forbury Road, Reading, Berkshire, RG1 1AX, United Kingdom

or by electronic mail to:

Contents

1.1Objective

1.2Overview

1.3Conformance

1.4Normative References

1.5Terminology

1.6Future Directions

2.1Analytics

2.2Batch, Micro Batch

2.3Big Data

2.4Business Ecosystem

2.5Enterprise Data Warehouse (EDW)

2.6Knowledge

2.7Master Data

2.8Master Data Management (MDM)

2.9Metadata

2.10System of Record (SoR)

2.11System of Engagement (SoE)

2.12System of Automation (SoA)

2.13System of Insight (SoI)

2.14Platform

2.15Real-Time Reponse

2.16Near Real-Time Response

2.17Interactive Response

2.18Structured Data

2.19Semi-Structured Data

2.20Unstructured Data

3.1O-BDL Concept of Operation

3.2Relevant Business Scenarios for an O-BDL (Informative)

3.2.1Enterprise Data Warehouse (EDW) Off-Load

3.2.2Discovery Platform

3.2.3Big Data Apps

3.2.4Data-Driven Enterprise

3.2.5Ecosystem of Data-Driven enterprises

4.1Data-Related Concepts

4.1.1Data

4.1.2Metadata

4.1.3Event

4.1.4Stream

4.1.5Insight

4.2Ingestion-Related Concepts

4.2.1Batch Ingestion

4.2.2Real-Time Ingestion

4.2.3Micro-Batch Ingestion

4.2.4Metadata Generation

4.3Processing-Related Concepts

4.3.1Lambda Architecture

4.3.2Batch Processing Workflow

4.3.3Analytics

4.3.4Analytics Engine

4.3.5Real-Time Processing

4.3.6Business Compartments

4.3.7Actions – Service Layer

4.3.8Existing IS Landscape

4.4Unified Data Management

4.4.1Reference Data Management (RDM)

4.4.2Audit and Policy Management

4.4.3Information Security

4.5Unified Operations

4.5.1System Monitoring

4.5.2System Management

A.1The Open Platform 3.0

A.2The TOGAF Standard

A.3The ArchiMate Standard

A.4The IT4IT Standard

A.5The O-DEF Standard

1Introduction...... 1

1.1Objective...... 1

1.2Overview...... 1

1.3Conformance...... 1

1.4Normative References...... 1

1.5Terminology...... 1

1.6Future Directions...... 2

2Definitions...... 3

2.1Analytics...... 3

2.2Batch, Micro Batch...... 3

2.3Big Data...... 3

2.4Ecosystem...... 5

2.5Enterprise Data Warehouse (EDW)...... 5

2.6Knowledge...... 6

2.7Master Data Management (MDM)...... 7

2.8Metadata...... 7

2.9Open Platform 3.0...... 7

2.10Platform...... 8

2.11Real-Time, Near Real-Time, Interactive Response Time...... 8

2.12Structured Data, Semi-Structured Data, Unstructured Data...... 8

3Overview...... 10

3.1Definition...... 10

3.2How Does the O-BDL Work?...... 10

3.3Relevant Business Scenarios for the O-BDL...... 12

3.3.1Enterprise Data Warehouse (EDW) Off-Load...... 12

3.3.2Discovery Platform...... 12

3.3.3Big Data Apps...... 13

3.3.4Data-Driven Enterprise...... 13

3.3.5Data-Driven Ecosystem...... 13

4Core Concepts...... 14

4.1Data-Related Concepts...... 14

4.1.1Data...... 14

4.1.2Metadata...... 15

4.1.3Event...... 15

4.1.4Stream...... 15

4.1.5Insight...... 15

4.2Ingestion-Related Concepts...... 16

4.2.1Batch Ingestion...... 16

4.2.2Real-Time Ingestion...... 16

4.2.3Micro-Batch Ingestion...... 17

4.2.4Metadata Generation...... 17

4.3Processing-Related Concepts...... 17

4.3.1Lambda Architecture...... 17

4.3.2Batch Processing Workflow...... 17

4.3.3Analytics...... 18

4.3.4Analytics Engine...... 19

4.3.5Real-Time Processing...... 20

4.3.6Business Compartments...... 21

4.3.7Actions – Service Layer...... 21

4.3.8Existing IS Landscape...... 22

4.4Unified Data Management...... 23

4.4.1Master Data Management (MDM)...... 23

4.4.2Reference Data Management (RDM)...... 23

4.4.3Audit and Policy Management...... 23

4.4.4Privacy...... 24

4.4.5Information Security...... 24

4.5Unified Operations...... 24

4.5.1System Monitoring...... 24

4.5.2System Management...... 24

ARelationship to Other Open Group Standards...... 26

A.1The TOGAF Standard...... 26

A.2The ArchiMate Standard...... 26

A.3The IT4IT Standard...... 26

A.4The O-DEF Standard...... 26

Preface

The Open Group

The Open Group is a global consortium that enables the achievement of business objectives through IT standards. With more than 500 member organizations, The Open Group has a diverse membership that spans all sectors of the IT community – customers, systems and solutions suppliers, tool vendors, integrators, and consultants, as well as academics and researchers – to:

  • Capture, understand, and address current and emerging requirements, and establish policies and share best practices
  • Facilitate interoperability, develop consensus, and evolve and integrate specifications and open source technologies
  • Offer a comprehensive set of services to enhance the operational efficiency of consortia
  • Operate the industry’s premier certification service

Further information on The Open Group is available at

The Open Group publishes a wide range of technical documentation, most of which is focused on development of Open Group Standards and Guides, but which also includes white papers, technical studies, certification and testing documentation, and business titles. Full details and a catalog are available at

Readers should note that updates – in the form of Corrigenda – may apply to any publication. This information is published at

This Document

This document is the Conceptual Framework for the Open Business Data Lake (O-BDL), an Open Group standard. It has been initially submitted by Capgemini and then further developed and approved by The Open Group.

Trademarks

ArchiMate®, DirecNet®, Making Standards Work®, OpenPegasus®, The Open Group®, TOGAF®, UNIX®, and the Open Brand (“X”logo) are registered trademarks and Boundaryless Information Flow™, Build with Integrity Buy with Confidence™, Dependability Through Assuredness™, FACE™, IT4IT™, Open Platform 3.0™, Open Trusted Technology Provider™, and the Open “O” logo and The Open Group Certification logo are trademarks of The Open Group in the United States and other countries.

Capgemini is a trading name used by the Capgemini Group of companies which includes CapgeminiTechnologyServices SAS, a company registered in France.

COBIT® is a registered trademark of ISACA, registered in the United States and other countries.

Hadoop® is a registered trademark and Apache™, Apache Spark™, and MapReduce™ are trademarksof the Apache Software Foundation in the United States and other countries.

Walmart®is a registered trademark of Walmart.

All other brands, company, and product names are used for identification purposes only and may be trademarks that are the sole property of their respective owners.

Acknowledgements

The Open Group gratefully acknowledges the contribution of the following people in the development of this standard:

  • All the global Pivotal and Capgemini teams who originally crafted the Business Data Lake concepts, especially Steve Jones, Paul Gittins, Lee Brown, Pramod Taneja, John D. McKinney., <names from Pivotal to be added>
  • The Big Data teams at Capgemini Aerospace & Defence who contributed to and reviewed the initial submission of this document, especially Cédric Cormont, Sébastien Guilloux, Eric Lecorgne, Alexandre Diaz, Adrien Calvayrac, and Pascal Gillet.
  • The people who reviewed and helped to improve this document: KaryFramling, Dave Lounsbury, Andrew Josey, Ed Roberts, Tim Vincent, Mandy Chessell, and Kelvin Laurence, Edwin Anand, Tejpal (TJ) Virdi, Stuart Boardman.

The members of the Big Data Ggroup of the Open Platform 3.0™ Forum, a forum of The Open Group, have participated in the development of this document, and were as follows:

  • Carlos Ferraro Cavallini, Salesforce.com
  • Olivier Flebus, Capgemini
  • Chris Harding, The Open Group, Forum Director for Open Platform 3.0
  • Seshu Madabhushi, Tata Consultancy Services
  • Sudhir Singapuram, IBM
  • Ken Street, Conexiam
  • Robert Weisman, Build The Vision Inc.
  • Mandy Chessell, IBM

Referenced Documents

The following documents are referenced in this standard. These references are informative.

(Please note that the links below are good at the time of writing but cannot be guaranteed for the future.)

[1]An Information Architecture Vision: Moving from Data Rich to Information Smart, White Paper (W132), April 2013, published by The Open Group; refer to:

[2]ArchiMate® 2.1 Specification, an Open Group Standard (C13L), December 2013, published by The Open Group; refer to:

[3]COBIT® 5, Information Systems Audit and Control Association (ISACA); refer to:

[4]Data Management Association (DAMA): Dictionary of Data Management; refer to

[5]Data Management Association (DAMA): Guide to the Data Management Body of Knowledge (DMBOK) V1; refer to

[6]Thomas Davenport and Jeanne Harris: Competing on Analytics: The New Science of Winning, Harvard Business School Press, 2007.

[7]DursunDelen: From Real-World Data Mining: Applied Business Analytics and Decision-Making, Pearson FT Press, December 2014.

[8]ISO/IEC 2382:2015-1:1993: Information Technology – Vocabulary – Part 1: Fundamental Terms and Definitions.[OFL1]

[9]Open Data Element Framework (O-DEF), an Open Group Standard (due early 2016).

Note to Reviewers: The O-DEF will be updated as appropriate before publication.

[10]Viktor Mayer-Schoenberger and Kenneth Cukier: Big Data: A Revolution that will Transform how we Live, Work, and Think, Houghton Mifflin Harcourt, ISBN: 978-0-544-00269, 2013.

[11]Viktor Mayer-Schoenberger and Kenneth Cukier: The Rise of Big Data: How it is Changing the Way we Think About the World, published in Foreign Affairs, May/June 2013.

[12]Andrew McAfee and Erik Brynjolfsson: Big Data: The Management Revolution, Harvard Business Review, October 2013.

[13]NASCIO: Do you think? Or do you Know? Part II: The EA Value Chain, the Strategic Intent Domain, and Principles, September 2010.

[14]The Open Group IT4IT™ Reference Architecture, Version 2.0, an Open Group Standard (C155), October 2015, published by The Open Group; refer to:

[15]TOGAF® Version 9.1, an Open Group Standard (G116), December 2011, published by The Open Group; refer to:

[16]Andrew McAfee and Erik Brynjolfsson: Big Data: The Management Revolution, HBR, 2012, refer to: [OFL2]

Open Business Data Lake (O-BDL) Conceptual Framework 1

Copyright © 2015, Capgemini. All Rights Reserved. Confidential for Open Group use only.

This is an unapproved Standards Draft, Subject to Change.

1Introduction

1.1Objective

This document is a first step towards a Reference Architecture provides key concepts for the Open Business Data Lake (O-BDL), as a first step towards a Reference Architecture. By describing a set of architectural patterns, principles key concepts, and other re-usable artifacts and guidance, it intends to help organizations leveraging new [OFL3]disruptive “Big Data” solutions and setting up an associated “data-centric” strategy for an increased performance and competitiveness.

1.2Overview

An Open Business Data Lake is an enterprise capability that is enabled by Big Data technologies. Its objective is to consolidate and preserve enterprise data sets along the data lifecycle, such that data can be strategically consumed by applying suitable structural definitions and applying transformations as needed.

An Open Business Data Lake helps consolidate data today, such that it can be transformed tomorrow. It contrasts with a Data Warehouse by deferring schema definitions to the time of usage, rather than at the time of storage, enabled by the Technology Capability of Schema-on-read. By simply separating the concerns of storage and usage, a data lake enables business to define strategic digital initiatives with flexibility, knowing that data can be put to effective use when required.

[OFL4]The Open Business Data Lake is a particularly relevant solution for the Big Data Analytics services relevant to the ongoing development of standards by the Open Platform 3.0™ Forum, a forum of The Open Group.[OFL5]

The O-BDLAn O-BDL conceptual framework is described at the enterprise level. This means it provides both technological and business organizational content. The technology that has emerged from the transformation of Internet businesses The new, disruptive technology that has emerged from the digital transformation of the Internet giants [OFL6]can benefit almost every enterprise (and Ecosystem), but it also comes with a new, specific mindset that has to be addressed at the enterprise level.

The content of the O-BDLan O-BDL conceptual framework has been selected to be relevant for any industry. It intentionally does not integrate sector-specific constraints or principles. Thus, obviously, it does not help regarding specific digital strategy objectives that may be enabled on an O-BDL. While an O-BDL will help you to gain insights from diverse enterprise data sets, this document will not prescribe new digital services that can potentially be built on top of itself.Thus, obviously, it does not help regarding specific digital strategy objectives. The O-BDL will help you to get insights from all kinds of data. However, it will not tell you which new digital services you have to build.[OFL7]

1.3Conformance

For the purposes of this standard, no conformance requirements apply.

1.4Normative References

None.

1.5Terminology

For the purposes of the O-BDLan O-BDL standard, the following terminology definitions apply:

CanDescribes a possible feature or behavior available to the user or application.

MayDescribes a feature or behavior that is optional. To avoid ambiguity, the opposite of “may” is expressed as “need not”, instead of “may not”.

ShallDescribes a feature or behavior that is a requirement. To avoid ambiguity, do not use “must” as an alternative to “shall”.

Shall notDescribes a feature or behavior that is an absolute prohibition.

ShouldDescribes a feature or behavior that is recommended but not required.

WillSame meaning as “shall”; “shall” is the preferred term.

1.6Future Directions

The technology that will implement the Open Business Data Lake is evolving very rapidly (counting in months, not years). For instance, the Apache Hadoop platform has gone through a major shift when introducing the capability to integrate multiple processing engines, and not only Hadoop MapReduce.[1]

An architecture at this level should however not be constrained by a single technology implementation. Today there are other platforms that are emerging that can support all or part of the O-BDL. The resulting architecture needs to be implementable – but not just by one technology.[OFL8]

Thus,the O-BDLan O-BDL conceptual framework may be updated to reflect this kind of major change or opportunity. At the time of its first release, it is planned that the O-BDLan O-BDL conceptual framework described in this document is to be further developed completed (notably [OFL9]by architecture principles) in the near future.

Nevertheless, as an enterprise-level platform concept, the O-BDLan O-BDL is designed to be stable enough for organizations to embrace change. Openness is key here. [OFL10]In a volatile world, utmost solid architectures are required. The O-BDLAn O-BDL is one of them.

2Definitions

This chapter gathers definitions that are connected and/or relevant for the O-BDLan O-BDL standard.

Definitions for the O-BDLO-BDL core concepts are provided in Chapter 4.

For the purposes of this standard, the following terms and definitions apply. Merriam-Webster's Collegiate Dictionary should be referenced for terms not defined in this section.

2.1Analytics

Analytics facilitates realization of business objectives through reporting of data to analyze trends, creating predictive models for forecasting and optimizing business processes for enhanced performance[MC11].[2]

Analytics could be defined as: “the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and actions”.[3]

2.2Batch, Micro Batch

Batch processing is executed taking as input large datasets or a large group of events that are presented in a package – usually every hour or daily or monthly.

Micro Batch processing is executed taking as input a group of events as they are presented in frequent compact packages– usuallyevery few seconds or few minutes.

2.3Big Data

The term “Big Data”[4] refers to the large amounts of data available to enterprises for use, in particular for Analytics. The data is characterized by:[5]

  • Volume – referring to the sheer amount of data available (e.g.,Walmart collected 2.5 petabytes from its customers every hour in 2012[6]).
  • Velocity – the speed of data creation is accelerating (e.g., in 2012 2.5 exabytes of information were created every day with the amount doubling every 40 months[7]).
  • Variety – the sources and types of data are varied (Structured, Semi-structured, and Unstructured) be it images, digital phone messages, or the like. The quality of data is also varied and the vocabulary and reference data is inconsistent.[OFL12]

From a more technical perspective,[8] the Big Data phenomenon is a function of the integrated availability of information from traditional information systems, System Control and Data Acquisition (SCADA) systems (e.g., electrical grid), and social media. This capability has been made possible by:

The advances of technology in the past five to ten years where improvements in computing power and storage have made the processing and integration of the data feasible[9]

A shift in mindset about how data could be used[10]

The conversion of analog systems (e.g., telephones) into digital (Voice over Internet Protocol (VoIP) ones) creating new information assets that have to be managed

Figure 1: Big Data - An Architecture Perspective

From a business perspective, Big Data refers to getting used to handling large amounts of data that is “messy” (i.e., varying degrees of quality) and “giving up our quest to discover the cause of things, in return for accepting correlations”[11] (i.e., focus on correlations rather than causation).

This integration opens up critical infrastructure protection challenges derived from the ability to access SCADA systems through new channels such as social media.[12]

Certain industry/government verticals organizations, notably power generation and defense, have long coped with many aspects of Big Data, but the implications are that organizations are moving on from an emphasis on transaction processing to more decision support/Analytics.

2.4[OFL13]Business Ecosystem

An A Business Ecosystem is a set of enterprises that collaborate in an open, [OFL14]agile way pursuing common and/or complementary business goalsbusiness goals that are consistent to every enterprise.[OFL15]

At a certain level, one of the objectives of the Open Platform 3.0 standard, the O-BDLan O-BDL standard, and, more generally, all open platforms is to foster the creation and the development of such Ecosystems.

2.5Enterprise Data Warehouse (EDW)

An Enterprise Data Warehouse (EDW) is a storage architecture designed to hold data extracted from transaction systems, operational data stores, and external sources. The warehouse then combines that data in an consistent data representation that can then be further aggregated, and summarised infor different formats suitable for enterprise-wide data analysis and reporting for predefined business needsThe warehouse then combines that data in an aggregate, summary form suitable for enterprise-wide data analysis and reporting for predefined business needs.[OFL16]

The five components of an EDW are:

Production data sources

Data extraction and conversion

The EDW database management system

Data warehouse administration

Business Intelligence (BI) tools

An EDW contains data arranged into abstracted subject areas with time-variant versions of the same records, with an appropriate level of data grain or detail to make it useful across two or more different types of analyses most often deployed with tendencies to third normal form. A data mart contains similarly time-variant and subject-oriented data, but with relationships implying dimensional use of data wherein facts are distinctly separate from dimension data, thus making them more appropriate for single categories of analysis.[13]