Implementing SystemCenter Operations Manager2007 at Microsoft

Technical White Paper

Published: March 2007

Contents

Executive Summary

Introduction

Challenges of Using Microsoft Operations Manager2005 for Monitoring

Benefits of Using System Center Operations Manager2007 for Monitoring

Microsoft IT Monitoring Solution

Planning the System Center Operations Manager2007 Environment

Designing the System Center Operations Manager2007 Infrastructure

Deploying System Center Operations Manager2007

Best Practices

Conclusion

For More Information

Executive Summary

Microsoft Information Technology (Microsoft IT) is responsible for monitoring the health of nearly 11,000 servers that are running over 2,000 corporate applications. Until the development of Microsoft® System Center Operations Manager (OpsMgr)2007, Microsoft IT used several individual installments of Microsoft Operations Manager (MOM)2005 for monitoring. Using multiple installations of MOM2005 sometimes resulted in duplication of effort and redundant data. This technical white paper describes how Microsoft IT created a centralized monitoring solution by deploying OpsMgr2007 in its corporate network environment.

This paper assumes that readers are technical decision makers and IT professionals who are already familiar with the Microsoft Windows Server® operating system and MOM2005. It describes the planning, deployment, and operations of OpsMgr2007 at Microsoft. Though this paper provides guidance through best practices based on Microsoft IT's early-adopter experiences, it is not intended to serve as an instructional guide. Each environment is unique, and each organization should consider its situation and needs before planning to deploy any product or solution.

Note: For security reasons, the sample names of domains, internal resources, organizations, and internally developed security file names used in this paper do not represent real resource names used within Microsoft and are for illustration purposes only.

Introduction

The Microsoft server environment consists of roughly 7,950 production servers, 2,700 pre-production servers, and another 50 development, staging, and test servers. Microsoft IT is responsible for monitoring all of them and the more than 2,000 line-of-business applications that they host, many of which are critical to the operations of Microsoft. The previous solution, MOM2005, was a compilation of many autonomous management groups that were managed by several different internal support groups because of its lack of granular rights delegation. In addition, many application support teams and lab managers deployed their own MOM2005 infrastructures to monitor their applications and lab server hardware. At one point, more than 160 active MOM2005 management groups existed in the Microsoft IT environment.

To streamline operations and the infrastructure, Microsoft IT needed a centralized monitoring solution. With the deployment of OpsMgr2007, Microsoft IT made a fundamental shift toward centralizing the way it provided monitoring services to organizational groups.

Challenges of Using Microsoft Operations Manager2005 for Monitoring

Although the MOM2005 implementation gave the individual teams substantial control and agility over their monitoring, it also presented several problems that Microsoft IT needed to consider during the design of the OpsMgr2007 environment:

  • There was a significant duplication of effort to provide fault monitoring. Each management group required both hardware and people to maintain the infrastructure. Multiple management groups often monitored the same asset. This meant that one failure could engage multiple teams.
  • Even with all the autonomous management groups, gaps existed in the service offering. Most lab environments lacked hardware monitoring, and several applications were not being monitored.
  • There was no comprehensive, holistic view of the health of the Microsoft IT network environment. Network, hardware, Microsoft SQL Server™, and application monitoring existed in separate systems.
  • Many smaller groups depended on a single expert to manage their MOM2005 implementation. If that person left, the team often lost its monitoring expertise.

Benefits of Using SystemCenter Operations Manager2007 for Monitoring

The OpsMgr2007 deployment provided several benefits for Microsoft IT:

  • The introduction of role-based security and improvements to both scalability and Run As accounts removed the roadblocks that Microsoft IT previously had in using MOM2005 to deploy a centralized monitoring platform to be used by many teams within Microsoft. The centralized design enabled Microsoft IT to eliminate monitoring fragmentation and to better utilize resources.
  • By using the Distributed Application Management Wizard, Microsoft IT has been able to rapidly develop management packs for custom applications.
  • Microsoft IT invested a substantial amount of time and energy into creating and tuning custom management packs for MOM2005. Use of the management pack conversion tools in OpsMgr2007 enabled Microsoft IT to retain and carry forward the work that the team had previously invested in its management packs.
  • OpsMgr2007 management packs are based on Service Modeling Language (SML), which is a part of the Dynamic Systems Initiative (DSI). As a result, Microsoft IT has been able to use existing management packs and develop new ones based on health models. Health models in turn use classes to provide a more dynamic, relational representation of the state of a system, service, or application.
  • OpsMgr2007 provides a more robust monitoring service offering due to improvements in product scale and support of clustering key infrastructure roles, as well as log shipping for disaster recovery of the Operational database.
  • OpsMgr2007 significantly reduced agent deployment efforts by using Microsoft Systems Management Server (SMS) packages. Agents can discover their configuration automatically from the Active Directory® directory service.
  • The introduction of OpsMgr2007 Audit Collection provided an efficient means of collecting and storing security event logs.
  • OpsMgr2007 provides a full software development kit (SDK) that provides Microsoft IT a complete interface to develop automation of administrative tasks.

Note: For more information about server roles in OpsMgr2007, visit .

Microsoft IT Monitoring Solution

OpsMgr2007 gave Microsoft IT an opportunity to reassess its service offering and to completely rebuild its monitoring infrastructure. Based on the needs of the organization and the existing customers of the monitoring solution, Microsoft IT developed a new offering, running on a new infrastructure, that provides parity for the earlier service that was based on MOM2005.

The application has undergone significant improvements to many of the features introduced in previous versions of Operations Manager, and it offers a number of new features that improve its capabilities in areas such as security auditing, event log collection, and desktop monitoring.

To build and deploy the new monitoring solution running on OpsMgr2007, Microsoft IT used the following phases:

1.Planning the OpsMgr2007 environment

2.Designing the OpsMgr2007 infrastructure

3.Deploying OpsMgr2007

Planning the SystemCenter Operations Manager2007 Environment

Prior to any planning work, the engineers on the monitoring team within Microsoft IT held a series of meetings with representatives from their existing customer base as well as a set of prospective monitoring customers to discuss their requirements. The purpose of these meetings was to understand what each of these groups needed from the monitoring solution. The result of these meetings was a detailed list of requirements ranging from the functionality that absolutely must exist to those that were optional. From this list, Microsoft IT was able to begin the process of planning the deployment, as well as the aspects of the service offering, targeted toward meeting the requirements.

Informed by the requirements gathered, Microsoft IT approached the design of the OpsMgr2007 environment with three main objectives:

  • Provide a complete monitoring service offering.
  • Provide a centralized and holistic monitoring platform.
  • Reduce resources devoted to monitoring.

Provide a Complete Monitoring Service Offering

With the MOM2005 deployment, Microsoft IT had deployed monitoring agents only to production servers. Although this sufficed for the purposes of just monitoring the production aspects of the data centers, Microsoft IT's customers had become increasingly interested in expanding their monitoring to non-production systems. Because of that interest, Microsoft IT opted to expand the agent base with OpsMgr2007 to all servers in Microsoft data centers. This expansion fulfilled customer requirements and served as the foundation for providing a complete monitoring service for the future.

Provide a Centralized and Holistic Monitoring Platform

With MOM2005, a number of different groups within Microsoft opted to deploy their own management groups to monitor their specific servers or applications. Although this gave individual support groups a substantial amount of control and agility over their monitoring needs, it required more resources than were necessary for monitoring. More importantly, it prevented Microsoft IT from viewing the health of the entire environment in a comprehensive manner.

With OpsMgr2007, Microsoft IT planned, from the beginning, to design an infrastructure that would be capable of the necessary scale while at the same time providing flexibility and control to each support group. An additional benefit to this design is that all management packs would be functioning together in the same management group, giving all users a holistic view of devices, applications, or services from hardware to software.

Reduce Resources Devoted to Monitoring

With the expansion of the scale and the comprehensiveness of the monitoring deployment, the Microsoft IT goal was to reduce the resources required to provide monitoring overall. The resource savings were made possible through hardware consolidation and centralized monitoring,which improved the productivity of the monitoring team.

The key focus of reducing hardware costs for Microsoft IT came in the form of consolidation. Early in the planning phases, through conversations with other teams within Microsoft, Microsoft IT determined that if it could provide an infrastructure capable of scaling appropriately and a service offering that was complete and flexible enough, the other teams running their own infrastructures would abandon their systems in favor of running their monitoring on a centrally managed platform.

In the area of reducing people resources, the focus was twofold:

  • Freeing people from maintaining their own monitoring solution
  • Using new OpsMgr2007 features to reduce existing support costs

The first half of the objective is really a byproduct of the consolidation. As support teams merge their monitoring into Microsoft IT's centralized monitoring platform, the people on those teams that have traditionally focused on monitoring will be able to spend more of their time focusing on their team's normal workload. Some examples of where the second objective has been possible are:

  • Coupled with using SMS for deploying agents, Microsoft IT uses Active Directory integration so that agents can automatically discover their monitoring configuration when the agent health service starts.
  • Microsoft IT is using an agent health remediation management pack to automate repairs on unhealthy agents.
  • Microsoft IT is using the OpsMgr2007 SDK as an interface for other forms of custom-built automation.

Designing the SystemCenter Operations Manager2007 Infrastructure

As with previous Operations Manager deployments, Microsoft IT built two distinct infrastructures, one for the pre-production environment and another for the production environment. The following sections describe the server roles, the details about each environment, the hardware specification that Microsoft IT developed for each server role, and the measures that Microsoft IT has implemented for redundancy and disaster recovery.

Server Roles in SystemCenter Operations Manager2007

The various OpsMgr2007 server roles are:

  • The Operational database server. The Operational database is a Structured Query Language (SQL) database hosted on a Microsoft SQL Server2005 Service Pack1 (SP1) server. This database houses all of the configuration data for a management group and contains all data generated by the systems in the management group. The database must be online at all times for the management group to function.
  • The root management server. The root management server is a new server role in the OpsMgr2007 infrastructure. The root management server is responsible for hosting the SDK and configuration services and is the central coordinator of health monitoring. The SDK service is the single point in a management group that responds to SDK requests. This includes all operations consoles and connectors. The Config service is the source of authority for all configurations in a management group. All OpsMgr2007 servers and agents ultimately get their configuration from the Config service.
  • Management servers. The management server role has not changed significantly from MOM2005. These servers are responsible for management of agents, which includes sending configuration data to the agents, and receiving operational data from the agents and forwarding that data to the Operational database. In some instances, the management servers are also responsible for writing data to a data warehouse.
  • Gateway servers. The gateway server is a new role in OpsMgr2007 that functions as a proxy between one or more management servers and a set of agents. In the presence of a firewall or the lack of a two-way trust between a set of agents and one or more management servers, a gateway server can be deployed to act as an intermediary.
  • Audit collection servers. OpsMgr2007 introduces the Audit Collection feature set. These features focus specifically on collecting security events from monitored computers and storing that data in a central repository for alerting and forensics purposes. Three distinct roles are related to OpsMgr 2007 Audit Collection: the forwarder, the collector, and the database. The forwarder, a component of the OpsMgr2007 agent, provides real-time monitoring of the security log on managed computers. The forwarder encrypts and transmits security events to the collector, a component of the OpsMgr2007 management server. The collector makes security events available to Windows Management Instrumentation (WMI) subscribers in real time and maintains that data in the audit collection database.
  • Data warehouse and reporting server. The data warehouse and reporting server hosts a database that the reporting component of OpsMgr2007 uses.

Pre-production Environment

The purpose of Microsoft IT's pre-production environment is to act as a complete (as possible), yet scaled-down version of the production environment. Microsoft IT uses the pre-production environment primarily for evaluation purposes. Evaluations range from working with the Operations Manager product group for testing new builds or features, to stressing product scale, to staging a proposed change prior to its official release into production. Figure 1 shows the design of a pre-production management group for Microsoft IT.

Figure 1. Management group in pre-production environment

The key aspects of the pre-production deployment are:

  • The deploymentconsists of two management groups to enable Microsoft IT to setup a tiered infrastructure.
  • The mid-tier management group manages all agents (approximately 3,500 agents at the time of this writing).
  • The OpsMgr2007 data warehouse and reporting are installed in the mid-tier management group.
  • The agent base spans multiple Active Directory domains and forests.
  • The majority of the agents managed in pre-production are production servers within Microsoft data centers.
  • Agents are multi-homed to this management group and the production management group to enable them to send monitoring data to multiple locations.

Production Environment

As the name implies, the production environment provides live monitoring of Microsoft data-center servers. As shown in Figure 2, the Microsoft IT production environment consists of three management groups. Two of the management groups (shown on the left side of the figure) exist in the corporate intranet. The third management group (shown on the right side of the figure) exists to provide monitoring for systems in networks and domains on the perimeter network (also known as DMZ, demilitarized zone, and screened subnet).

Figure 2. SystemCenter Operations Manager2007 production environment

The key design aspects of the production environment are as follows:

  • The environment consists of three management groups to allow for scalability and to span across diverse network and domain spaces.
  • The management groups have been tiered so that all users can view alert data from a single console. The first management group is connected to the other two management groups to form these tiered relationships.
  • Each management group manages a similar number of agents (totaling approximately 11,000 agents at the time of this writing).
  • Microsoft IT has deployed a single data warehouse, which receives operational data from all three management groups.
  • The agent base spans multiple Active Directory domains and forests.
  • Microsoft IT is using a three-node cluster in each management group to host the Operational database and root management server roles. One node is dedicated to hosting the Operational database, and one node is dedicated to hosting the root management server. The third node is capable of having either resource failed over to it.
  • Microsoft IT has deployed a failover database server into each management group. This SQL Server–based server is located remotely from the rest of the management group and contains the up-to-date contents of the Operational database shipped to it via log shipping.
  • Each management group contains at least two management servers dedicated to agent management.
  • In the third management group, gateway servers were deployed into remote network locations to manage agents in those spaces. Because the gateway servers communicate back to management servers in separate networks and separate forests, certificate-based authentication is used between the management servers and the gateway servers.

Hardware Specification for Infrastructure Servers

Table 1 lists the platform specifications that Microsoft IT is using for each server role type. To determine hardware specifications, Microsoft IT started by studying the resource utilization of the MOM2005 servers in the production environment. Microsoft IT used the CPU, memory, and disk utilization statistics as a baseline and compared that baseline to the current hardware profiles that were available. After purchasing the appropriate hardware, Microsoft IT verified performance during beta testing. Microsoft IT's experiences required minor revisions to some disk configurations, but Microsoft IT discovered that, for the most part, the hardware originally procured was more than sufficient for OpsMgr2007.