NIST Big Data

Security and Privacy Requirements

Version 0.4

September 24, 2013

Security & Privacy Subgroup

NIST Big Data Working Group (NBD-WG)

September, 2013

Executive Summary

1Introduction

1.1Background

1.2Objectives

1.3How This Report Was Produced

1.4Structure of This Report

2Big Data Security and Privacy

2.1Introduction

2.2Scope

2.3Actors

2.4Classification and Discussion of Topics

3Use Cases

3.1Retail/ Marketing

2.1.1Scenario 1: Modern Day consumerism

2.2.1Scenario 2: Nielsen Homescan

2.3.1Scenario 3: Web Traffic Analytics

3.2Healthcare:

2.4.1Scenario 1: Health Information Exchange

Security:

i.Light-weight but secure off-cloud encryption: Need the ability to perform light-weight but secure off-cloud encryption of an EHR that can reside in any container that ranges from a browser, to an enterprise server, that leverages strong symmetric cryptography.

ii.Homomorphic Encryption.

iii.Applied Cryptography: Tight reductions, realistic threat models, and efficient techniques.

Privacy:

i.Differential Privacy: Techniques for guaranteeing against inappropriate leakage of PII

ii.HIPAA

Current research: Homomorphic Encryption, Off-cloud Encryption.

2.5.1Scenario 2: Genetic Privacy

2.6.1Scenario 3: Pharma Clinic Trial Data Sharing [3]

3.3Cyber-security

2.7.1Scenario

3.4Government

2.8.1Scenario 1: Military (Unmanned Vehicle sensor data)

2.9.1Scenario 2: Education (“Common Core” Student Performance Reporting)

3.5Industrial: Aviation

2.10.1Scenario

4Abstraction of Requirements

4.1Privacy of data

4.2Provenance of data

4.3System Health

5Internal Security Practices

5.1Internal Access control rules for general industry

6Taxonomy of Security and Privacy Topics

6.1Taxonomy of Technical Topics

6.2Privacy

6.3Provenance

6.4System Health

7Security Reference Architecture

7.1Architectural Component: Interface of Sources Transformation

7.2Architectural Component: Interface of Transformation Uses

7.3Architectural Component: Interface of Transformation Data Infrastructure

7.4Architectural Component: Internal to Data Infrastructure

7.5Architectural Component: General

8Mapping Use Cases to Reference Architecture

8.1Cargo Shipping

(Image from William Miller)

2.11.1Sources Transformation:

2.12.1Transformation Uses:

2.13.1Transformation Data Infrastructure:

2.14.1Data Infrastructure:

2.15.1General:

8.2Nielsen Homescan

2.16.1General Description of the Industry / Use Case:

2.17.1Mapping to the Security Reference Architecture:

8.3Pharma Clinical Trial Data Sharing

2.18.1General Description of the Industry / Use Case:

Under an industry trade group proposal, clinical trial data for new drugs will be shared outside intra-enterprise warehouses. Regulatory submissions commonly exceed “millions of pages.”

2.19.1Mapping to the Security Reference Architecture:

8.4Large Network Cybersecurity SIEM

2.20.1General Description of the Industry / Use Case:

2.21.1Mapping to the Security Reference Architecture:

8.5Consumer Digital Media Usage

2.22.1General Description of the Industry / Use Case:

Content owners license data for usage by consumers through presentation portals, e.g., Netflix, iTunes, etc. Usage is Big Data, including demographics at user level, patterns of use such as play sequence, recommendations, content navigation.

2.23.1Mapping to the Security Reference Architecture:

8.6Unmanned Military Vehicle Sensor Systems

2.24.1General Description of the Industry / Use Case:

Unmanned vehicles (“drones”) and their onboard sensors (e.g., streamed video) can produce petabytes of data that must be stored in nonstandard formats. Refer to DISA large data object contract for exabytes in DoD private cloud.

2.25.1Mapping to the Security Reference Architecture:

8.7Common Core K-12 Student Reporting

2.26.1General Description of the Industry / Use Case:

2.27.1Mapping to the Security Reference Architecture:

8.8Web Traffic Analytics

2.28.1General Description of the Industry / Use Case:

Visit-level webserver logs are high-granularity and voluminous. Web logs are correlated with other sources, including page content (buttons, text, navigation events), and marketing events such as campaigns, media classification.

2.29.1Mapping to the Security Reference Architecture:

8.9Title: Health Information Exchange

2.30.1General Description of the Industry / Use Case:

2.31.1Mapping to the Security Reference Architecture:

9References

Executive Summary

1Introduction

1.1Background

There is a broad agreement among commercial, academic, and government leaders about the remarkable potential of “Big Data” to spark innovation, fuel commerce, and drive progress. Big Data is the term used to describe the deluge of data in our networked, digitized, sensor-laden, information driven world. The availability of vast data resources carries the potential to answer questions previously out of reach. Questions like: How do we reliably detect a potential pandemic early enough to intervene? Can we predict new materials with advanced properties before these materials have ever been synthesized? How can we reverse the current advantage of the attacker over the defender in guarding against cybersecurity threats?

However there is also broad agreement on the ability of Big Data to overwhelm traditional approaches. The rate at which data volumes, speeds, and complexity are growing is outpacing scientific and technological advances in data analytics, management, transport, and more.

Despite the widespread agreement on the opportunities and current limitations of Big Data, a lack of consensus on some important, fundamental questions is confusing potential users and holding back progress. What are the attributes that define Big Data solutions? How is Big Data different from the traditional data environments and related applications that we have encountered thus far? What are the essential characteristics of Big Data environments? How do these environments integrate with currently deployed architectures? What are the central scientific, technological, and standardization challenges that need to be addressed to accelerate the deployment of robust Big Data solutions?

At the NIST Cloud and Big Data Forum held in January 15-17, 2013, the community strongly recommends NIST to create a public working group for the development of a Big Data Technology Roadmap. This roadmap will help to define and prioritize requirements for interoperability, portability, reusability, and extensibility for big data usage, analytic techniques and technology infrastructure in order to support secure and effective adoption of Big Data.

On June 19, 2013, the NIST Big Data Public Working Group (NBD-PWG) was launched with overwhelmingly participation from industry, academia, and government across the nation. The scope of the NBD-PWG is to form a community of interests from all sectors including industry, academia, and government, with the goal of developing a consensus in definitions, taxonomies, secure reference architectures, and a technology roadmap. Such a consensus would therefore create a vendor-neutral, technology and infrastructure agnostic framework which would enable Big Data stakeholders to pick-and-choose best analytics tools for their processing and visualization requirements on the most suitable computing platform and cluster while allowing value-added from Big Data service providers.

Currently NBD-PWG has created five subgroups namely the Definitions and Taxonomies, Use Case and Requirements, Security and Privacy, Reference Architecture, and Technology Roadmap. These subgroups will help to develop the following set of preliminary consensus working drafts by September 27, 2013:

  1. Big Data Definitions
  2. Big Data Taxonomies
  3. Big Data Requirements
  4. Big Data Security and Privacy Requirements
  5. Big Data Reference Architectures White Paper Survey
  6. Big Data Reference Architectures
  7. Big Data Security and Privacy Reference Architectures
  8. Big Data Technology Roadmap

Due to time constraints and dependencies between subgroups, the NBD-PWG hosted two hours weekly telecon meeting from Mondays to Fridays for the respective subgroups. Every three weeks, NBD-PWG called a joint meeting for progress reports and document updates from these five subgroups. In between, subgroups co-chairs met for two hours to synchronize their respective activities and identify issues and solutions

1.2Objectives

Scope

The focus of the NBD-PWG Security and Privacy Subgroup is to form a community of interest from industry, academia, and government, with the goal of developing a consensus secure reference architecture to handle security and privacy issues across all stakeholders. This includes gaining an understanding of what standards are available or under development, as well as identifies which key organizations are working on these standards.

Tasks

  • Gather input from all stakeholders regarding security and privacy concerns in Big Data processing, storage, and services.
  • Analyze/prioritize a list of challenging security and privacy requirements that may delay or prevent adoption of Big Data deployment
  • Develop a Security and Privacy Reference Architecture that supplements the general Big Data Reference Architecture

Deliverables

  1. Produce a working draft for Big Data Security and Privacy Requirements Document
  2. Produce a working draft Big Data Security and Privacy Reference Architecture

1.3How This Report Was Produced

[Wo will put more thoughts into this section; each subgroup will be slightly different]

1.4Structure of This Report

[Subgroup will come up this section and hope it will be consistent with other subgroups’ approach]

2Big Data Security and Privacy

2.1Introduction

To standardize on a security reference architecture there is a need to leverage specialists in diverse application and technological domains. Application domains include Health Care, Drug Discovery, Finance, Corporate, and Government. Examples of scenarios within these application domains include Health Exchanges, Clinical Trials, Mergers and Acquisitions, Device Telemetry, and International anti-Piracy. Technology domains include identity, authorization, audit, network and device security, and federation across trust boundaries. When there is leverage of cloud service providers and federation across trust boundaries, there is a need to leverage technology domains such as cryptography to augment traditional security. Just as it would be difficult to align schemas and protocols within implementations of a technical domain, it would be quite difficult to standardize, align, or even provide a mapping between terms that are used across these application and technology domains. To effectively communicate across domains, there is a need to socialize the usage of terms related to security and compliance across these domains. But there is also a need to do this in a manner that minimizes the barrier to entry for any specialist to participate, while minimizing the dilution of the working group’s core focus.

Within a vertical, and often across verticals, there tends to be a common vocabulary because of the underpinning of GRC (Governance, Risk Management and Compliance) requirements, which include a degree of litigation preparedness. However there are disparate regulatory bodies across sovereign boundaries, and often the entities that certify are distinct from those that audit or adjudicate. Technologies to be leveraged are quite diverse and include devices and services for perimeters, networks and devices. While these techniques and solutions have grown organically, there is a trend toward consistent blueprints for Policy Authoring, Decision and Enforcement, and for federation of demands, claims and obligations across organizations. Clouds and Federation tend to complicate things for the Application and the Technology domains, because compliance requirements did not presuppose technological advances, and security mechanisms are often a previous generation of enterprise solutions that are being repurposed inappropriately for a radically different threat model. In the legal domain for instance, the classification of data can change when it moves from an enterprise data center, to the data center of a service provider, hence the laws pertaining to discovery can be interpreted in a manner that would be detrimental to litigants. This is further complicated by the fact that participants and service providers can reside in distinct legal jurisdictions with consistent civil procedure.

In compliance domains, even if all participants were to have state of the art security, it is possible that they would be in violation of compliance requirements if they did not federate identity, authorization and audits in a manner that would enable each party to address obligations such as audits, alerts, document lifecycle management, and others. When compliance requirements are mingled with technologies for security, things can get quite complicated. It is not uncommon to encounter situations where an ostensible compliance requirement causes a security requirement to be compromised, and vice versa.

Therefore it is an imperative to enable working across specializations, but to do this in a manner that is lightweight and does not require a specialist to have any significant knowledge of other domains. There is a need to communicate terms, but also the broader intent, along with opportunities and risks.

There has been an explosion of big data applications and solutions, but there has been no significant effort at understanding the business or GRC requirements of participants in order to do business at scale in the clouds. Hence there has been no significant effort at capturing and communicating essential terms and requirements across application and technology domains. Infrastructure has been largely designed and deployed before any sufficient consideration of cloud security requirements. Hence, given the difficulty in retrofitting security mechanisms, there is growing interest in leverage of cryptographic techniques that can provide content-level security such as format preserving encryption. In addition, recent news about government surveillance could catalyze changes in how clouds are used, and there is likelihood that infrastructure will fragment along sovereign and compliance boundaries, and that it will become even more necessary to leverage cryptographic techniques.

It is becoming increasingly likely that Big Data without Big Data Security will be just silos of fragmented data. Therefore, it is increasingly important to work effectively and efficiently across specializations to address opportunities and reduce risks so that we can define reference architecture that is sovereign friendly and cloud effective. The immediate need is to efficiently communicate terms and intent.

2.2Scope

Firstly, a distinction needs to be made between fault tolerance and security.Fault tolerance is resistance to unintended accidents, while security is resistance to malicious actions.Secondly, we need to understand how Big Data security concerns arise out of the defining characteristics of Big Data and how it is differentiated from traditional security concerns.

  1. Big data is gathered from diverse end-points. So there are more types of actors than just Provider and Consumers – viz. Data Owners: for example, mobile users, social network users and so on.
  2. Data aggregation and dissemination have to be made securely and inside the context of a formal, understandable framework. This should part of the contract that has to be provided to Data Owners.
  3. Availability of data to Data Consumers is an important aspect in Big Data. Availability can be maliciously affected by Denial of Service (DoS) attacks.
  4. Searching and Filtering of Data is important since all of the massive amount of data need not be accessed. What are the capabilities provided by the Provider in this respect?
  5. The balance between privacy and utility needs to be thoroughly analyzed. Big Data is most useful when it can be analyzed for information.
  6. However, privacy would restrict the form and availability of data to analytics technologies.
  7. Since there is a separation between Data Owner, Provider and Data Consumer, the integrity of data coming from end-points has to be ensured. Data poisoning has to be ruled out.

2.3Actors

A person has relationships with many applications and sources of information in a big data system.

We describe a number of instances to exemplify this assertion:

1.A retail organization refers to a person who “may” buy goods or services as a consumer, before the purchase and a customer after a purchase.

2.A retail organization may use a social media platform as a channel for their online store.

3.The person may be a patron at a food and beverage organizationfor as few as none and as many as 3 before a warning may need to be triggered.

4.A person has a customer relationship with a financial organization in either prepaid or personal banking services.

5.A person may have a car or auto loan with a different or same financial institution.

6.A person may have a home loan with a different or same bank as a personal bank or each may be different organizations for the same person.

7.A person may be “the insured” on health, life, auto, homeowner or renters insurance.

8.A person may be the beneficiary or future insured person by an employer payroll deduction through a payroll service in the private sector employment development department in the public sector.

9.A person has been educated by many or few educational organizations in either public or private schools for the first 15-20 years of their childhood making the right decisions.

10.A person may be an employee, temporary worker, contractor or another 3rd party employee for one or more companies or businesses.

11.A person may be underage and have special conditions around collection of data.

2.4Classification and Discussion of Topics

The set of topics were initially adapted from the scope of the CSA BDWG charter, organized according to the classification in [1]. Security and Privacy concerns are classified in 4 categories:

  1. Infrastructure Security
  2. Data Privacy
  3. Data Management
  4. Integrity and Reactive Security

In this section, we describe the topics in detail, along with community discussion around them.

Infrastructure Security

  1. Review of technologies and frameworks that have been primarily developed for performance, scalability and availability. (e.g., Apache Hadoop, MPP databases, etc.,)
  2. High-availability
  3. Security against Denial-of-Service (DoS) attacks.

Data Privacy

  1. Impact of social data revolution on security and privacy of big data implementations.
  2. Unknowns of Innovation - When a perpetrator, abuser or stalker misuses technology to target and harm a victim, there are various criminal and civil charges that might be applied to ensure accountability and promote victim safety. There are a number of U.S. federal and state/territory/tribal laws that might apply. To support the safety and privacy of victims, it is important to take technology-facilitated abuse and stalking seriously. This includes assessing all ways that technology is being misused to perpetrate harm, and, considering all charges that could or should be applied.
  3. Identify laws that address violence and abuse. Identify where they explicitly or implicitly include the use of technology and electronic communications:
  1. Stalking and cyberstalking (felony menacing by, via electronic surveillance, etc.)
  2. Harassment, threats, assault
  3. Domestic violence, dating violence, sexual violence, sexual exploitation
  4. Sexting and child pornography: electronic transmission of harmful information to minors, providing obscene material to a minor, inappropriate images of minors, lascivious intent
  5. Bullying and cyberbullying
  6. Child abuse
  7. Identify possible charges related to technology, communications, privacy and confidentiality:
  1. Unauthorized access, unauthorized recording/taping, Illegal interception of electronic communications, illegal monitoring of communications, surveillance, eavesdropping, wiretapping, unlawful party to call
  2. Computer and Internet crimes: fraud, network intrusion
  3. Identity theft, impersonation, pretexting
  4. Financial fraud, telecommunications fraud
  5. Privacy violations: Reasonable expectation lawful purposes
  6. Consumer protection laws
  7. Violation of no contact, protection and restraining orders
  8. Finding Laws To Charge Perpetrators Who Misuse Technology. Defamatory libel, slander, economic or reputational harms, privacy torts
  9. Burglary, criminal trespass, reckless endangerment, disorderly conduct, mischief, obstruction of justice.
  1. Data-centric security to protect data no matter where it is stored or accessed in the cloud
  2. For example, attribute-based encryption, format-preserving encryption
  1. Big data privacy and governance[1]
  2. Data discovery and classification
  3. (Flexible) policy management for accessing and controlling the data.
  4. For example, language framework for big data policies
  5. Data masking technologies: anonymization, rounding, truncation, hashing, differential privacy
  6. It is important to consider how these approaches degrade performance or hinder delivery all together. Often these solutions are proposed and then cause an outage at the time of the release forcing the removal of the option.
  7. Data monitoring
  8. Compliance with regulations such as HIPAA, EU data protection regulations, APEC Cross-Border Privacy Rules (CBPR) requirements, and country-specific regulations
  9. Regional data stores enable regional laws to be enforced
  10. Cyber-security Executive order 1998 - assumed data and information would remain within the region.
  11. People centered design makes the assumption that private sector stakeholders are operating ethically and respecting the freedoms and liberties of all Americans. [2]
  12. Class action and Civil Suits are highly likely based on the volume significant increase in threats with Big Data.
  13. People before profit must be revisited to understand the large number of Executive Orders overlooked
  14. People before profit must be revisited to understand the large number of domestic laws overlooked.
  15. Indigenous and Aboriginal people and privacy of all associated vectors and variables must be excluded from any big data store, in any case a person must OPT IN versus OPT Out.
  16. all tribal land is an exclusion from any image capture and video streaming or capture.
  17. Human Rights
  18. Government access to data and freedom of expression concerns
  19. Some believe people in general are not nearly concerned about the freedom of expression as they are about misuse or inability to govern private sector use.[3]
  20. Cisco’s Internet of Everything is directly dependent on Big Data, as shown in the survey summary the greatest concern respondents chose “threats to data(loss) and fear for physical safety”