Reference Architecture

NIST Big Data

Reference Architecture

DRAFT

Version 1.21

Reference Architecture Subgroup

NIST Big Data Working Group (NBD-WG)

NovemberDecember, 2013

Version / Date / Changes abstract / References / Editor
0.1 / 9/6/13 / Outline and Diagram / M0202v1 / Orit
0.2 / 9/11/13 / Text from input docs / M0142v3, M0189v1, M0015v1, M0126v4, M0231v1 / Orit
0.3 / 9/18/13 / - Additions to the diagram based on a weekly conference call
- Clarification to the RA model based on email conversation
- Intro text is added / M0230v2, M0239v1 / Orit
0.4 / 9/22/13 / - Security and Management are shown as the fabrics around all blocks
- Transformation sub-blocks are shown with same color
- Editorial changes to Section 3 (as an input to roadmap) / Weekly conf call. M0230v3 / Orit
0.5 / 9/24/13 / - Multiple RA views for discussion (open the Figure as an object)
- Appendices A and C
- Additions to Executive Summary
- New: Management Lifecycle / M0243v2, M0247v1, M0249v1
Input from Gary Mazzaferro / Orit
0.6 / 9/25/13 / - Editorial changes
- Interfaces description
- Diagram update / M0252v1 / Orit
0.7 / 9/26/13 / - Diagram update
- New: Deployment Appendix / M0251v1 / Orit
1.0 / 9/29/13 / - Security Fabric description
- High level requirements
- BD Application Provider revision / Input from Arnab Roy
Input from the Reqs subgroup
M0258v1 / Orit
1.1 / 10/15/13 / - Editorial changes / Including the “System Management” baseline text. / Felix Njeh
1.1 / 11/22/13 / - Clarifications based on the mailing list discussions / Replies to Bob Marcus / Orit
1.2 / 12/8/13 / - Editorial changes / Pw Carey, Nancy & Orit

Table of Contents

Executive Summary 4

1 Introduction 4

1.1 Background 4

1.2 Objectives 5

1.3 How This Report Was Produced 6

1.4 Structure of This Report 6

2 Big Data General Requirements 7

3 Conceptual Model 8

4 Main Components 10

4.1 Data Provider 10

4.2 Big Data Application Provider 11

4.3 Big Data Framework Provider 14

4.4 Data Consumer 15

4.5 System Orchestrator 16

5 Management 16

5.1 System Management 16

5.2 Lifecycle Management 17

6 Security and Privacy 18

7 Big Data Taxonomy 18

Appendix A: Terms and Definitions 18

Appendix B: Acronyms 20

Appendix C: References 20

Appendix D: Deployment Considerations 21

1 Big Data Framework Provider 21

1.1 Traditional On-Premise Frameworks 22

1.2 Cloud Service Providers 22

1.2.1 Cloud Service Component 22

1.2.2 Resource Abstraction & Control 23

1.2.3 Physical Resources 23

Executive Summary 4

1 Introduction 4

1.1 Background 4

1.2 Objectives 5

1.3 How This Report Was Produced 6

1.4 Structure of This Report 6

1.5 Big Data General Requirements 7

2 Conceptual Model 8

3 Main Components 10

3.1 Data Provider 10

3.2 Big Data Application Provider 11

3.3 Big Data Framework Provider 14

3.4 Data Consumer 15

3.5 System Orchestrator 15

4 Management 16

4.1 System Management 16

4.2 Lifecycle Management 16

5 Security and Privacy 18

6 Big Data Taxonomy 18

Appendix A: Terms and Definitions 18

Appendix B: Acronyms 19

Appendix C: References 20

Appendix D: Deployment Considerations 21

1 Big Data Framework Provider 21

1.1 Traditional On-Premise Frameworks 21

1.2 Cloud Service Providers 22

1.2.1 Cloud Service Component 22

1.2.2 Resource Abstraction & Control 22

1.2.3 Physical Resources 23

Executive Summary

1 Introduction

1.1 Background

Big Data is the common term used to describe the deluge of data in our networked, digitized, sensor-laden, information driven world. The availability of vast data resources carries the potential to answer questions previously out of reach. Questions like: How do we reliably detect a potential pandemic early enough to intervene? Can we predict new materials with advanced properties before these materials have ever been synthesized? How can we reverse the current advantage of the attacker over the defender in guarding against cyber-security threats?

Within this context, on 29 March, 2012 The White House announced the Big Data Research and Development Initiative[1]. The initiative’s goals were to help accelerate the pace of discovery in science and engineering, strengthen the national security, and transform teaching and learning by improving our ability to extract knowledge and insights from large and complex collections of digital data.

Six Federal departments and their agencies announced more than $200 million in commitments – spread across 80+ projects – that aimed to significantly improve the tools and techniques needed to access, organize, and glean discoveries from huge volumes of digital data. The initiative also challenged industry, research universities, and non-profits to join with the Federal government to make the most of the opportunities created by Big Data.

Despite the widespread agreement on the opportunities inherent to Big Data, a lack of consensus on some important, fundamental questions continues to confuse potential users and hold back progress. What are the attributes that define Big Data solutions? How is Big Data different from the traditional data environments and related applications we have encountered thus far? What are the essential characteristics of Big Data environments? How do these environments integrate with currently deployed architectures? What are the central scientific, technological, and standardization challenges that need to be addressed to accelerate the deployment of robust Big Data solutions?

The NIST Big Data program was formally launched on 13 June, 2012 to help answer some of the questions surrounding Big Data and to support the federal government effort to incorporate Big Data as a replacement for, or enhancement to, traditional data analysis systems and models where appropriate.

[Editor’s Note: Need some transition verbiage here. How did the first conference lead to the BD-PWG?]

On 19 June, 2013 NIST hosted the Big Data Public Working Group (BD-PWG) kickoff meeting to begin addressing those questions. The Group was charged with developing a consensus definition, taxonomy, reference architecture, and technology roadmap for Big Data that can be embraced by all sectors.

These efforts will help define and prioritize requirements for interoperability, portability, usability and reusability, and extendibility for Big Data analytic techniques and technology infrastructure in order to facilitate the adoption of Big Data.

The aim is to create a vendor-neutral, technology and infrastructure agnostic deliverables to enable Big Data stakeholders to pick-and-choose best analytics tools for their processing and visualization requirements on the most suitable computing platforms and clusters while allowing value-added functionalities from Big Data service providers and flow of data between the stakeholders in a cohesive and secure manner.

Within the BD-PWG, the following working groups were chartered in order to provide a technically-oriented strategy and standards-based guidance for the federal Big Data implementation effort:

· Definitions and Taxonomies

· General Requirements

· Security and Privacy Requirements

· Reference Architectures

· Technology Roadmap

1.2 Objectives

In general terms, a reference architecture provides “an authoritative source of information about a specific subject area that guides and constrains the instantiations of multiple architectures and solutions”[2]. Reference architectures generally serve as a reference foundation for solution architectures and may also be used for comparison and alignment purposes.

The broad goal of the Reference Architecture working group is to develop a Big Data open reference architecture that:

· Provides a common language for the various stakeholders

· Encourages adherence to common standards, specifications, and patterns

· Provides consistency of implementation of technology to solve similar problem sets

The reference architecture is intended to facilitate the understanding of the operational intricacies in Big Data. It does not represent the system architecture of a specific Big Data system; instead it is a tool for describing, discussing, and developing system-specific architectures using a common framework of reference.

It provides a generic high-level conceptual model that is an effective tool for discussing the requirements, structures, and operations inherent to Big Data. The model is not tied to any specific vendor products, services or reference implementation, nor does it define prescriptive solutions that inhibit innovation.

The design of the NIST Big Data reference architecture serves the following objectives:

· To illustrate and understand the various Big Data components, processes, and systems, in the context of an overall Big Data conceptual model;

· To provide a technical reference for U.S. Government departments, agencies and other consumers to understand, discuss, categorize and compare Big Data solutions; and

· To facilitate the analysis of candidate standards for interoperability, portability, reusability, and extendibility.

The design of the Big Data reference architecture does not address the following:

· Detailed specifications for any organizations’ operational systems;

· Detailed specifications of information exchanges or services; or

· Recommendations or standards for integration of infrastructure products.

It is important to note that at this time, the Big Data reference architecture is not complete. Many sections of this document are still under development.

1.3 How This Report Was Produced

The approach for developing this document involved four steps:

1. The first step was to announce a Big Data Reference Architecture Working Group open to the public in order to attract and solicit a wide array of subject matter experts and stakeholders in government, industry, and academia.

2. The second step was to gather available Big Data architectures and materials representing various stakeholders, different data types, and different use cases.

3. The third step was to examine and analyze the Big Data material to better understand existing concepts of Big Data, what it is used for, its goals, objectives, characteristics, and key elements, and then document the these using the Big Data taxonomies model.

4. The fourth step was to develop an open reference architecture based on the analysis of Big Data material and the inputs from the other NIST Big Data working groups.

1.4 Structure of This Report

The remainder of this document is organized as follows:

Section 2 contains high level requirements relevant to the design of the Reference Architecture.

Section 33 represents a generic big data system comprised of technology-agnostic function blocks interconnected by interoperability surfaces.

Section 44 describes the main components of the generic system.

Section 5 describes the system and data management considerations.

Section 65 addresses security and privacy.

Section 76 contains the Big Data taxonomy.

Appendix A lists the terms and definitions appearing in the taxonomy.

Appendix B contains the acronyms used in this document.

Appendix C lists the references used in the document.

1.5 Big Data General Requirements

There is a two-step process involved with requirement extraction. The first step is to extract specific requirements based on each application’s characteristics, which includes detailed information on

(a) data sources (data size, file formats, rate of grow, at rest or in motion, etc.),

(b) data lifecycle management (curation, conversion, quality check, pre-analytic processing, etc.),

(d) big data framework and infrastructure (software tools, platform tools, hardware resources such as storage and networking), and

(e) data usage (processed results in text, table, visual, and other formats).

The second step is to aggregate each application’s specific requirements into high-level generalized requirements which are vendor-neutral and technology agnostic. For complete use cases’ characteristics and requirements analysis, please refer Big Data Use Cases and Requirements, M0245.

The following are the high-level big data general requirements.

Data Provider Requirements (DPR)

DPR-1: Needs to support reliable real time, asynchronous, streaming, and batch processing to collect data from centralized, distributed, and cloud data sources, sensors, or instruments.

DPR-2: Needs to support slow, bursty and high throughput data transmission between data sources and computing clusters.

DPR-3: Needs to support diversified data content ranging from structured and unstructured text, documents, graphs, web sites, geospatial, compressed, timed, spatial, multimedia, simulation, and instrumental (a.k.a., system managements and monitoring) data.

Big Data Application Provider Requirements (APR)

APR-1: Needs to support diversified compute intensive, analytic processing and machine learning techniques

APR-2: Needs to support batch and real time analytic processing

APR-3: Needs to support processing large diversified data content and modeling

APR_-4: Needs to support processing data in motion (streaming, fetching new content, data tracking, traceability, data change management, data boundaries, etc.)

Big Data Framework Provider Requirements (FPR)

FPR-1: Needs to support legacy software and advanced software packages (software)

FPR-2: Needs to support legacy and advance computing platforms (platform)

FPR-3: Needs to support legacy and advanced distributed computing clusters, co-processors, I/O processing (infrastructure)

FPR-4: Needs to support advanced networks (ex. Software Defined Networks) and elastic data transmission including fiber, cable, and wireless networks, LAN, WAN, MAN and WiFi (networking)

FPR-5: Needs to support legacy, large, virtual and advanced distributed data storage (storage)

FPR-6: Needs to support legacy and advanced programming executable, applications, tools, utilities, and libraries (software)

Data Consumer Requirements (DCR)

DCR-1: Needs to support fast searches (~0.1 seconds) from processed data with high relevancy, accuracy, and high recall

DCR-2: Needs to support diversified output file formats for visualization, rendering, and reporting

DCR-3: Needs to support visual layout for results presentation

DCR-4: Needs to support rich user interface for access using browser, visualization tools

DCR-5: Needs to support high resolution multi-dimension layer of data visualization

DCR-6: Needs to support streaming results to clients

Security & Privacy Requirements (SPR)

SPR-1: Needs to protect and preserve security and privacy on sensitive data

SPR-2: Needs to support multi-tenant, multi-level policy-driven, sandbox, access control, authentication on protected data in-line with accepted GRC (Governance, Risk & Compliance) and CIA (Confidentiality, Integrity & Availability) best practices

Lifecycle Management Requirements (LMR)

LMR-1: Needs to support data quality curation including pre-processing, data clustering, classification, reduction, format transformation

LMR-2: Needs to support dynamic updates on data, user profiles, and links

LMR-3: Needs to support data lifecycle and long-term preservation policy including data provenance