Draft 0.1; 30 Sept 2004

Open Science Grid

Security Incident Handling and Response Guide

Document Log
Issue / Date / Author / Comment
0.1 / 30 Sept 2004 / Doug Pearson / Draft release to the Activtiy Group
0.2 / 7 Sept 2004 / Doug Pearson / Draft release to the OSG Workshop

i. Document Development Milestones:

September 6, 2004: An abbreviated version of the full Guide will contain completed recommendations for the establishment and maintenance of contact lists and communications methods; preliminary recommendations for containment and notification methods; and and an outline of additional content to be developed. Recommendations regarding the additional content development process and schedule will be made. The abbreviated version will be presented to OSG and iVDGL committees on September 6, for review and discussion during the Sept 9-10 OSG Workshop.

November, 2004: Progress in content development, especially aiming for harmonization with EGEE efforts, in preparation for the Second EGEE Conference, November 22-26, 2004.

February 2005: Guidelines developed and processes and services implemented as necessary for OSG-0.

ii. Credits:

This document was developed through the work of the OSG Security Incident Handling Activity Group[1], including members Bob Cowles (SLAC), Mark Green (U Buffalo), Michael Helm (ESnet/LBNL), Doug Olson (LBNL), Doug Pearson (IU/REN-ISAC), Dane Skow (Fermilab), Tom Throwe (BNL), and Von Welch (NCSA); and with background developed through the prior works of Yuri Demchenko (University of Amsterdam).

iii. Contact:

Comments, questions, etc. may be referred through theOSG Security Incident Handling Activity Group chair, Doug Pearson <>.

iv. Document Work in Progress:

In addition to the specific call-outs in the document for work in progress, the following enhancements, additions, and changes are in progress:

  • Consideration needs to be given to the nature of OSG as a framework for the cooperation of grids, and as a grid itself, i.e. OSG-0. Currently the document is heavily slanted to OSG as a framework. One approach to the duality may be to create two documents. One as a concept of operations for grids that participate in the framework and the central services and processes required to facilitate the cooperation; and another document to serve as the specific guide to OSG-0.
  • Need to define the minimum set of requirements for OSG-0 and identify the implementation.
  • Slant the processes and more towards coordination rather centralized command and control.

1.Introduction

2.The purpose of this document

3.Definitions

4.Incident taxonomy and levels of support

5.Policies

5.1.Reporting and responding to Grid incidents:

5.2.Guidance to the sharing and disclosure of sensitive data

6.Organizational structure

7.Supporting resources

7.1.Mailing lists

8.Process

8.1.Discovery and reporting

8.2.Triage

8.3.Containment

8.4.Initial notification

8.5.Analysis and response

8.6.Tracking and progress notification

8.7.Escalation

8.8.Reporting

8.9.Public relations

8.10.Post-incident analysis

9.Communications support

9.1.Contact lists

9.2.Normal communication channels

9.3.Secure communications

9.4.Phone bridge

10.Web site

11.Periodic reporting

12.Relationships to other entities

13.Outreach

14.Security Operations Center

15.Effectiveness evaluation process

16.Information disclosure guidelines

17.Incident reporting formats

18.Local processes and tools to support incident response

19.Guidance to middleware and grid service developers

20.Relationships

21.Relevant and related standards and practices

22.Useful References and Other Works

1.Introduction

The cyberspace defined by Grids transcends organizational boundaries. An operative space requires that participants develop new forms of cooperation with respect to policies, resources, operations, and security. Although the Grid doesn't create fundamentally new cyber security risksit does serve to amplify risks and creates a broader scope of impact for incidents. User identity and authorizations are extended throughout the multi-organizational space. Large numbers of homogenous systems scattered across organizations are presented to the authenticated user. These and other aspects provide fertile ground for the rapid spread of security incidents and expose an institution to risks commensurate to the security practices of its collaborators. Additionally, the high profile and vast resources of Grids are attractive hacking targets, providing notoriety, or as a platform for other attacks.

Thecharacter of the security vulnerabilities and risks presented by Grid cyberspace provides a rationale for strong coordination among the Grid participants for cyber security incident response.

2.The purpose of this document

The Open Science Grid Consortium[2] is an umbrella for guidance and support to various independent Grid efforts, seeking to expand and enable the use of common grid infrastructure and shared resources for the benefit of scientific applications.

This document is targeted for Grids partcipating in the OSG structure, and is relevent to the haromonization with other national and international efforts. The document was developed with an eye to the USPPDG[3], iVDGL[4] and TeraGrid[5]communities, and European LCG[6], and EGEE[7] efforts.

The purpose of this document is to guide the development and maintenance of a common capability for handling and response to cyber security incidents on Grids. The capability will be established through (1) common policies and processes, (2) common organizational structures, (3) cross-organizational relationships, (4) common communications methods, and (5) a modicum of centrally-provided services and processes.

The vision articulated in this document is not to establish a centralized OSG incident handling and response organization, but to establish a concept of operations for individual Grid security efforts, permitting a harmonization and collaboration in the collective Grid space.

Ultimately the purpose behind the development of this document is to reduce the incidence, severity, and exposure of Grids to cyber security incidents and to reduce the exposures to institutions and their systems posed by the Grid.

3.Definitions

A cyber security incident is any real or suspected event that poses a breach to explicit or commonly-held security policies and practices, and that poses a real or potential threat the integrity of services, resources, infrastructure, or identities. Some typical classes of incidents are computer intrusion, denial-of-service attack and worm/virus infections.

Although the Grid doesn't create fundamentally new cyber security risks or classes of incidents, it does serve to amplify risks and creates a broader scope of the impact of incidents.

4.Incident taxonomy and levels of support

Need work here

5.Policies

5.1.Reporting and responding to Grid incidents:

Grid Site Charters, Agreements, and other policy documents that guide overall site participation are the explicit sources of site requirements for security incident handling and response. Ideally, those charters and agreements will refer directly to this document for incident handling and response requirements and procedures. The charters and agreements should state that:

  • Grid participants MUST follow the guidelines and practices established in the OSG Security Incident Handling and Response Guide.
  • Grid participants MUST report incidents that have impact or relationship to Grid resources, services, or identities. Reports MUST be made for incidents with potential impact[*]as well incidents with known impact.
  • Grid participants MUST respond to incidents where local systems or resources are presenting a threat to Grid security. Response is guided by OSG Security Incident Handling and Response Guide, section8.3Containment, and more fully through the methods of section 8Process.

5.2.Guidance to the sharing and disclosure of sensitive data

5.2.1.Privacy and security concerns

Information exchanged during the investigation of an incident may include data about individuals or groups and behavours that is not meant to be public information (e. g. contact telephone numbers). Additionally, some application areas have stringent requirements about their data is to be handled due to legal and ethical considerations (e. g. biomed data). A site handling incident information MUST treat it with security appropriate to the sensitivity of the data involved.

5.2.2.Sharing of processed incident information

Incident information transferred by one site to another MUST only occur through secure means sufficient to prevent casual eavesdropping by the attackers. Additional protections may be required if the information being exchanged is of the nature described above.

All media contacts relating to an OSG security incident MUST be handled by the OSG Public Relations contacts. All interviews or information passed to the media MUST be through OSG Public Relations except through prior arrangement.

5.2.3.Handling of shared log, netflow, and other supporting data

Need work here [What should we say? It’s not probably suitable for actually inclusion in the ticketing system, so that system would at best contain a pointer to it. Can the GOC be responsible for maintaining that repository of supporting information associated with an incident? How else do we provide for the protection of the data in cses where there might be prosecution? What about physical evidence – how do we handle that? What data handling procedures do we need to be sure that logs, etc haven’t been altered? ]

6.Organizational structure

The operational organizational structure is comprised of the components:

Security Contacts. Every Grid participant (user, service, or resource) must have assigned Security Contact(s). Institutional approaches may vary significantly. Security Contacts may report from a variety of organizational units, such as system administrators, security engineers, or network engineers. The purview of Security Contacts may be an entire institution or a single department. As possible, the Security Contacts should include 24x7 support desks, in addition to but not replacing, the appointment of specific individuals. A list of Security Contactsis maintained by the Security Operations Center (SOC).

A body of incident handling and response technical experts. The body is a self-organized collection of volunteer technical experts who are available to provide response and remediation advice in the event of large or technically complex incidents. The expert body is organized and available as a mailing list maintained by the SOC.

Ad hoc Incident Response Teams. Depending upon the severity, complexity, and scope of an incident, response may require the ad hoc formation of an Incident Response Team. The team will be formed under the auspices of the SOC and will composed ofSecurity Contacts, site systems administrators, Grid software specialists, operations centers, etc. as appropriate to the incident. The team leader is responsible to coordinate with the SOC for supporting services, and to maintain a flow of information regarding incident status to the SOC.

The Security Operations Centers (SOC). Individual Grids may choose to establish individual centers or partner with other Grid efforts in joint efforts. The SOC function will often be provided by the Grid Operations Center (GOC). The SOC is responsible to organize and coordinate large-scale, multi-institutional response efforts, to track and report on security incidents, and to monitor information flows, e.g. closed security lists and REN-ISAC alerts, for threat to the served community, and to provide supporting services (see 14Security Operations Center).

A Security Operations Advisory Groupis established with representative participation of the the served community, to advise the development and practice of the SecurityOperationsCenter.

Cross SOC Coordination (XSOC). The Security Operations Centers of Grids operating under OSG participate in a Cross-SOC Coordination body. The body acts to share information across Grid stovepipes regarding incidents, vulnerabilities, exploits, attacks, practices, tools, etc.

7.Supporting resources

7.1.Mailing lists

Three mailing lists support incident reporting, analysis, and response. In the following, xxx.yyy is to be replaced with the respective Grid, e.g. ivdgl.org or ppdg.net.

The standard grid support processes and GridOperationsCenter procedures are employed by end-users and individuals other than Security Contacts to report incidents. The GOC monitors and supports these reporting methods, and relays incident reports to INCIDENT-SEC-L.

is a closed list composed of the Security Contacts at all sites – including GridOperationsCenters and SecurityOperationsCenters. Only list members may post to the list. The list is intended solely for initial incident reporting, not for incident discussion.

is a closed list composed of the same members as INCIDENT-SEC-L. Only members may post to the list. The list is intended for discussion of reported incidents.

The reason for the differentiating INCIDENT-SEC-L and INCIDENT-DISCUSS-L is to allow alerting mechanisms to be driven by the presence of a new message in INCIDENT-SEC-L. Communications on both lists MUST be encrypted.

8.Process

The processes for incident handling and responses are:

  1. Discovery and reporting
  2. Triage
  3. Containment
  4. Initial notification
  5. Analyis and response
  6. Tracking and progress notification
  7. Escalation
  8. Reporting
  9. Public relations
  10. Post-incident analysis

8.1.Discovery and reporting

Incidents will be discovered through a variety of means including users, system administrators and engineers, operations center monitoring of infrastructure, services, and resources, through monitoring of intelligence channels, such as FIRST and REN-ISAC, and through reports from peers of the aforementioned entities.

If an incident is discovered locally, such as by a user or systems administrator, the local incident reporting process should be used, AND the discovering party should make certain to inform the local incident response team that (1) the incident involves the Grid, and (2) the incident must be reported to INCIDENT-SEC-L according to guidelines at [webpage]. If the initial assessment by the local incident response team reveals that real or potential impact or relationship to Grid resources, services, or identities has occured, the Security Contact must immediately report the incident to INCIDENT-SEC-L.

If an incident is discovered locally, such as by a user or systems administrator, AND a local incident reporting process cannot be engaged or the Security Contact informed, for instance due to occurance outside of normal business hours, the discovering party should report the incident directly to the Grid security incident handling community via [webpage], and follow-up with local incident reporting procedures when possible.

If an incident is discovered by Security Contacts, grid operations centers, or security operations centers, the discovering party should immediately report the incident to INCIDENT-SEC-L.

In all cases the SOC will monitor incident reports and will assume the responsibility to insure that involved sites are aware of the incident.

Need template for reporting.

8.2.Triage

8.2.1.Verify the incoming incident reports

Need work here

8.2.2.Assign a severity classification

Through an initial assessment of the incident, assign a severity classification according to:

High:

The incident could lead to exploitation of the trust fabric, i.e user and host identities, or
the incident could lead to instability of the overall Grid, or
a denial-of-service is in progress against all replicas of a given Grid service.

Medium:

The incident affects an instance of a Grid service, but Grid stability is not at risk, or
a denial-of-service affects one replica of a given Grid service, or
a local attack compromised a priviledged user account.

Low:

A local attackcomprised individual user, non-privileged credentials, or
a denial-of-service attack or compromise affects only local grid resources.

8.2.3.Engage response activities

Need work here

8.3.Containment

There are three areas of concern for containment of an attack: (1) preventing further spread of the attack through local services/resources; (2) preventing further attacks from external grid services/resources; and (3) protecting the grid from attacks sourced at a different site. For this discussion, we will assume the local site already has procedures in place to handle (1); however, it should be validated as part of the site registration process.

8.3.1.Protection from attacks through the grid

Attacks originating from the grid might be coming from (1) a grid service hosted at the local site; (2) a grid service hosted at a remote site; (3) a shared authentication (group account where some other process possibly at some other site has handled the authentication and authorization of the user to request this resource/service); and (4) a single grid user. As a general matter, the level of response must take into account a number of factors:

  • the resource/service has been compromised or is it just under attack?
  • the kind of attack - DOS or user or privileged user compromise?
  • the importanceof the resource/service locally?
  • the importanceof the resource/service to the operation of the grid?
  • the importanceof the resource/service to various Virtual Organizations?

As an operational principle for the site, the normal response should be to err on the side of caution and block access from the grid during the initial stages of dealing with an intrusion –only opening access as is prudent and justified, without extraordinary risk. Having this policy results in two beneficial effects: (1) it gives sites more freedom of action and more confidence they can act to protect themselves without bringing down the wrath of the grid community; and (2) it will hopefully result in more redundancy of services, better failover, and applications that are more robust to outages in various parts of the grid (by putting the responsibility on the middleware and application developers to design a more failsafe environment).

Sites MUST inform the grid operations center of actions they take affecting grid resources/services.

For a grid service/resource hosted at the local site, it MUST have an interface allowing it to be disabled and SHOULD be able to inform central scheduling and monitors that it is entering a disabled state. It is assumed that local site policies will handle containment issues from locally hosted resources/services to the rest of their infrastructure.

For a grid service/resource hosted at a remote site, and interface MUST be provided to local services and resources to block requests or access from the remote service. If a compromise-style of attack then blocking authorizations at the appropriate level is probably sufficient. Queries from remote monitors and schedulers SHOULD be told that access is blocked at the appropriate level. For DOS-style attacks, lower-level protocol blocking is likely to be necessary but there SHOULD still be a way to inform schedulers and monitors that access is being blocked.