Pennsylvania
Department of Public Welfare
Bureau of Information Systems
Event Notification Process Guideline
Version 2.0
November10, 2005
Table of Contents
Introduction
Purpose
Objectives
Benefits
Definitions
Event
Stakeholder
Impacted Stakeholder
Troubleshooter
Overview of the Event Notification Process
Identification/Definition
Assessment/Resolution
Close-out
Step 1 – Event Identification/Definition
Receive the Event
Document the Event Details and Systems Affected
Step 2 – Event Assessment/Resolution
Notify the Troubleshooters of the Event
Conduct a Quick Assessment of the Event
Notify the Impacted Stakeholders of the Event
Continue to Resolve the Event
Update the Infrastructure Operations Section
Update the Impacted Stakeholders
Is the Event Resolved?
Step 3 – Event Close-out
Notify the Impacted Stakeholders the Event is Resolved
Outputs
Roles and Responsibilities
Appendix A: Event Notification Process Map
Appendix B: Event Notification Log
Appendix C: Event Notification Form
Appendix D: Post Mortem
Document Change Log
Event Notification Process Guideline
Introduction
This document defines the process for distributing an Event Notification to impacted stakeholders for any interruption affecting the server and mainframe production environments. An event is a problem that significantly impacts the IT (Information Technology)services provided by the Department of Public Welfare (DPW).
Purpose
An Event Notification determines how and under what conditions an event is reported. This document details the documentation, tracking, and notification process forwhen disruptions impact access to the DPW Production Environment. The following are examples of disruptions triggering these events: a failed system health check, a call from the CIS hotline, and planned hardware maintenance for an application. These events can come from any organizational unit within DPW.
Objectives
The specific objectives of the Event Notification process are to:
- Document and track events from initial discovery through resolution
- Identify the stakeholders impacted
- Provide a means of communicating information accurately and timely to stakeholders
- Provide updates and a event resolution information
Benefits
The following benefits accrue from implementing an event notification process:
- Provides a single point of contact for all event communications
- Provides a standard communication mechanism for all events
- Provides a standardized, documented procedure that can be followed repeatedly
- Provides a centralized event notification log
Definitions
Event
An event is a significant problem, occurrence, or happening.
Stakeholder
A stakeholder is an individual or organizational unit with a vested interest in a project. This individual or organizational unit may positively or negatively influence the eventor have the event’s outcome positively or negatively influence them.
Impacted Stakeholder
An impacted stakeholder is an individual or organizational unit affected by an event. Impacted stakeholders may include: BISTechnical Staff, BIS Development Staff, BIS Database Staff, DPW Program Offices, and Contractors.
Troubleshooter
A troubleshooter is a skilled individual or organizational unit responsible for diagnosing, isolating, and resolving the event.
Overview of the Event Notification Process
The 3 steps of the Event Notification process are shown below:
Identification/Definition
The Infrastructure Operations Section is notified by email, phone, or in person whenan event has occurred or has to be scheduled. The event’s details are documented on a form. The form includes the initial notification received and the descriptive information surrounding the event. An entry is made in the event notificationlog to track the event. The Infrastructure Operations Section determines which organizational unit is assigned to resolvethe event. If they are unable to determine which organizational unit is assigned to resolve the event, notification is sent to a predefined list of troubleshooters and the Infrastructure Operations Section begins their role of communication liaison for the event.
Assessment/Resolution
The Infrastructure Operations Section determines if the event is planned or unplanned and alerts the appropriate troubleshooters about the event. The Infrastructure Operations Section starts a fifteen minute clock to give the troubleshooters time to gather any extra details on the event before announcing it to the impacted stakeholders. The Infrastructure Operations Sectionis the centralized contact for initially communicating the event and providing any updates between the troubleshooters and the impacted stakeholders.
Close-out
When the Infrastructure Operations Sectionis notifiedthat the event is resolved, the troubleshootercreates a final resolution notice announcing the event’s conclusion to the impacted stakeholders. The Infrastructure Operations Sectionbundles all of the event’s documentation in preparation to conduct a post mortem on the event. The post mortem reviews the event to determine its cause and how to prevent it in the future. The post mortem also identifies any lessons learned.
Step 1 – Event Identification/Definition
Receive theEvent
The Infrastructure Operations Section is notifiedwhen an application, a server, or system outage has occurred.Notification occurs via an email message to PW,DataCenter , a phone call at 717-772-7153, or in person.
Document the Event Details and Systems Affected
The Infrastructure Operations Section enters the details surrounding the event on a form. This form includes the initial contact person reporting the information and a description of the event. This form also contains the related activities which occurred prior to the event and the system where the event happened.
Step 2 – Event Assessment/Resolution
Notify the Troubleshooters of the Event
The Infrastructure Operations Section alerts the troubleshootersabout the event and begins the fifteen minute reporting clock. This clock provides the troubleshooters the opportunity to assess the event’s impact. The Infrastructure Operations Section alerts the stakeholders of the problem within fifteen minutes of receiving the initial notification announcing the event. The Infrastructure Operations Section transmits any informationcommunicated from the initial notification to the troubleshooters. The troubleshooters start evaluating the event based on the reporting initial notification’sinformation.
Conduct a Quick Assessment of the Event
This quick assessment determines how long the event will take to resolve. The event’s impactincreases depending on the number of systems affected and the magnitude of the agency’s downtime. The troubleshooter informs the Infrastructure Operations Section of the estimated time to resolve the event and itsimpact. This quick assessment generally takes fifteen minutes or less.
Notify the Impacted Stakeholders of the Event
The Infrastructure Operations Sectionpersonnel relays details pertaining to the event,the systems affected, the impact, and the estimated time to finish the work to all of the impacted stakeholders. The troubleshootercreates the message and the Infrastructure Operations Section distributes it to the impacted stakeholders.
Continue to Resolve the Event
The troubleshooterswork continuously to resolve the event. The initial step involves examining the details surrounding the event. This research enables them to calmly and methodically work to reach the event’s resolution.
Update the Infrastructure Operations Section
The troubleshooters periodically update the status of their activities with the Infrastructure Operations Section. These updates describe what took place after the last update and what remains to resolve the event. The updates also include any changes to the estimated resolution time and the next update release.
Update the Impacted Stakeholders
The Infrastructure Operations Sectionsends structured messages to the impacted stakeholders keeping them apprised of the situation and how long before the event will be resolved. These messages are geared to non technical users, are short, and provide details about the expected time of resolution and whenthe event will be updated. The troubleshooter crafts the message. The Infrastructure Operations Section distributes the message to the impacted stakeholders.
Is the Event Resolved?
The Infrastructure Operations Sectiondetermines if the event has been resolved based on the troubleshooters’ latest update. This decision point allows the event to reenter the process or to exit the process because of resolution.
Step 3 – Event Close-out
Notify the Impacted Stakeholders the Event is Resolved
When the Infrastructure Operations Sectionis notified the event is resolved, the impacted stakeholders are notified that the system is operational. The troubleshooter creates this final message. The event’s documentation is bundled in preparation for conducting a post mortem.
Outputs
The work products associated with the event are archived for future historical reference. These work products include the event notificationlog, any documents, procedures, or standards that are modified while working to resolve theevent and all communicationsrelated to the event. The post mortem and its supporting documentation are outputs to this process.
Roles and Responsibilities
The following table lists the Event Notification roles and responsibilities:
ROLE / RESPONSIBILITIESInfrastructure Operations Section /
- Receives and Documents the details surrounding the event
- Communicates updates between the troubleshooters and the impactedstakeholders
Troubleshooters /
- Conducts quick assessments on unplanned events
- Resolves the event
Appendix A: Event Notification Process Map
This is the Event Notification process map. It graphically depicts an event’s progression starting with its discovery and ending with notifying everyone of the event’s resolution.
Appendix B: Event Notification Log
This is a blank copy of the Event Notification log. The Infrastructure Operations Section completes and submits it to the troubleshooters responsible for the event in question.
Event Notification Information / ResponseReported By: / Caller’s Name:
Organization:
Phone:
Date/Time Initial Call Received
System working in when event occurred
Screen, Menu, Application, Mainframe, Server, IP Address, Router, Switch, etc.
Attempted the same action more than once
If yes, how many times
Same error received each time
Exact error message
Others in area able to perform this action?
Yes or No
Were other transactions attempted in this system?
If NO, please do so, so a determination can be made if the entire system is effected or transaction specific
If YES, was the other transaction successful
If known, is anyone else attempting to troubleshoot this event
Follow up date/time (if applicable)
Resolution Information: / Date/Time:
How problem was Resolved:
Appendix C: Event Notification Form
Downtime is a scheduled or unplanned event that deviates from standard activities or normal operating conditions. The following form will used by the user to communicate the alert of the interruptions as they occur. The areas indicated with an asterisk *aremandatory fields andmust be provided by the person reporting the event.
Save this form to your hard drive before using to view the email portion of this form
After the form is saved, click on the “Send a Copy” icon above.
The email message appears. Fill in the form, when completed send the document.
Event NotificationBrief description of symptoms: / Provide known outage or impairment information.
*
Cause of Outage/Impairment: / Provide what caused the problem.
*
The Outage Affected: / What Applications are affected?
What Hardware platforms are affected?
What Services are affected?
Etc…..
*
Who is currently working on the problem: / TELCOM, DTE, DADD, DIMO, Network, Server, etc…..
Outage was resolved at: / Event end time
Trouble Ticket Number (if opened) / Remedy ticket number, either CTC, DPH, or Intellimark
Additional Comments: / Enter any additional information
Appendix D: Post Mortem
When the event is resolved, its documentation is bundled for the post mortem. The post mortem reviews the event, its causes, and standard operating procedures to deal with the event. There are many means of conducting the review. The following are examples of those reviews: a post mortem report writtenand circulated for concurrence, conduct a meeting with the troubleshooters and impacted stakeholders, or conduct a series of meetings with the troubleshooters and impacted stakeholders.This review results in lessons learned, which are archived for future reference. To assist in reducing the event’s reoccurrence, an amendment to the standard operating procedureswill be applied.
Document Change Log
Change Date / Version / CR # / Change Description / Author and Organization07/13/04 / 1.0 / New process documentation / CSSS Process Unit
10/25/04 / 1.0 / Change OIS to BIS / CSSS Process Unit
11/10/05 / 2.0 / Revision of event notification form and organization name changes / CSSS Process Unit