Backoffice Server 4.5 Reliability Checklist

Microsoft BackOffice Server 4.5 Reliability Checklist

White Paper

Abstract

This paper provides a high-level checklist for tuning the reliability of the Microsoft® BackOffice® Server 4.5 suite. It will not give a tutorial on the overall operation of BackOffice Server 4.5, nor will it go into elaborate detail on the individual reliability-tuning variables for each BackOffice Server component. After reading this paper, however, you should be better prepared to tackle BackOffice Server performance tuning for your organization.

Published: December 1999

Microsoft BackOffice Reliability Checklist

Microsoft BackOffice Server 4.5 Reliability Checklist

White Paper

Abstract

Contents

Microsoft BackOffice 4.5 Reliability Checklist

White Paper

Introduction

Understand Causes of Failures

Devise Measurements to Monitor Reliability

Deploy Reliable Hardware and Drivers

Deploy Reliable Networks

Establish Operations Procedures

Prevent Power Loss

Install Fault-Tolerant Storage

Use WindowsNT Server Reliability Features

Consider Cluster Servers

Consider Offsite Storage

Continually Monitor and Tune Performance

Microsoft WinRAS Project

Appendix A: SQLServer 7.0 and Exchange Server 5.5 Reliability

SQLServer 7.0

SQLServer 7.0 and Exchange Server 5.5 on the Same Physical Server

Files and Filegroups in SQLServer 7.0

Growing Data Files with Autogrow

Managing the Transaction Log

General Backup Recommendations

Exchange Server 5.5

Restoring Data

Preparing the Recovery Machine: Hot Spares

Appendix B: WindowsNT Server Configurations

Services

Event Logging

Ease of Use

Page File

Time Sync

Crash Dump/Debug

Kernel Mode

User Mode

Service Pack

Hot Fixes

Symbol Subset

First-Tier Domain Controllers

SYN-ATTACK Fixes

Other

Planning: Pitfalls to Avoid

Appendix C: RAID Levels

Appendix D: Drive Layout

Drive Space Categories

Operating System

Swap Space

Applications

Data

Log Files

Logical Drive Partitioning

Disk Drive Layout

Microsoft BackOffice Server 4.5 Reliability Checklist1

Microsoft BackOffice 4.5 Reliability Checklist

White Paper

Published: December 1999

For more information see

Introduction

Every element of the environment can potentially affect the reliability of your Microsoft® BackOffice® Server 4.5 system. BackOffice Server and WindowsNT® Server include tools that encourage reliability. However, there is no substitute for smart planning, careful deployment, and diligent attention to your environment, its components, and their complex relationships.

This paper provides steps for achieving maximum reliability:

Understand the various causes of failures.
Devise measurements to track your system's reliability and detect memory leaks.
Deploy reliable and certified hardware and software platforms.
Deploy reliable and tested networks.
Establish operations procedures that support good system reliability.
Safeguard against power loss.
Install fault-tolerant storage.
Use WindowsNT Server reliability features, including those on the WindowsNT Server 4.0 Service Pack 4 or higher.
Consider setting up cluster servers.
Stay up-to-date on security issues.
Continually monitor and tune performance.

Understand Causes of Failures

A number of factors can cause your BackOffice Server reliability to drop below 100 percent. In a recent survey by Cahners Instat Group, Information Services (IS) managers and executives identified the following root causes of downtime: software failure (including both operating system and application failures), hardware failure, operator or procedural error, and environmental failures. Causes cited by Cahners Instat Group can be further broken down to include:

Storage failures
Processor or memory failures
Other hardware failures
Network outages or hacker attacks
Software failures (applications, device drivers, memory leaks)
Improper configuration and planning
Environmental hazards
Operator error during usage

Devise Measurements to Monitor Reliability

Microsoft recommends using some form of automated procedure to monitor the availability and reliability of servers. Automated monitoring enables identification of failure conditions and potential problems. Monitoring can also help reduce the time requireded to recover from a failure.

Monitoring is important because operators must know a failure has occurred before they can begin restoring service.

Additionally, some customers monitor performance characteristics of their servers to spot usage trends. This allows customers to identify the conditions that contribute to system failure and take action to prevent those conditions from occurring. Monitoring strategies implemented by each customer vary depending on the type of service the customer provides.

For case studies that highlight customers that have successfully deployed a reliability monitoring solution, see the Microsoft WindowsNT Product Group High Availability Deployment Guide at

Deploy Reliable Hardware and Drivers

Reliable hardware platforms help ensure the reliability of BackOffice Server. Platforms have different characteristics; you must choose a platform that works best in your environment and with the operating system, applications, and network you plan to use.

Use hardware listed in the Microsoft Hardware Compatibility List (HCL); all these components have been thoroughly tested for their real-world reliability running WindowsNT Server and BackOffice Server. Using untested hardware or neglecting quality assurance procedures for hardware deployments may compromise reliability. To consult the HCL, go to
Hardware failures occur most often in mechanical parts such as fans, disks, or removable storage media. Failure in one component can induce failure in another. Make sure you provide the airflow and cooling that moving parts require.
Choose third-party device drivers that pass the Microsoft Windows® Hardware Quality Labs (WHQL) tests, and are thus certified to perform reliably in a WindowsNT Server-based environment.

Deploy Reliable Networks

In distributed systems, performance and reliability of the underlying network contributes significantly to the performance and reliability of the entire system.

Networks have multiple layers, each of whose topology and design can affect reliability. Too often, businesses view the network as a monolithic entity and fail to examine each layer’s contribution to the whole, the coupling between system and network, or the interface between applications and the operating system.

Installing multiple network interface cards (NICs) can enhance the reliability of critical network servers. A host configured with more than one network interface in this way is called a multihomed host. Using dual NICs in each server to separate network segments can increase reliability and performance. However, you must be aware of several issues to configure a multihomed host properly for WindowsNT Server. For more information, consult the knowledge base article entitled "Multihomed Issues with WindowsNT" (article ID Q181774), available from

Establish Operations Procedures

Operations procedures that work in an informal personal computer (PC) environment do not necessarily work with WindowsNT Server-based systems. The WindowsNT Resource Kit documents many features of WindowsNT that are essential for a reliability-conscious Information Technology (IT) manager to understand. For example, an IT manager should understand how changes to the system configuration change the registry. View the resource kit through the MSDN library at

The following important operations procedures will optimize BackOffice Server system availability:

Avoid unnecessary changes to the configuration and environment.
Follow a formal capacity-planning process to avoid expensive errors such as deploying the wrong technology at the wrong time or in the wrong way.
Avoid having any single point of failure.
Adhere to the Hardware Compatibility List (see p. 2).
Test before going operational.
Install a backup WindowsNT directory and back up the software on each production server to allow for quick file repair in the event of damage. Perform backups regularly and completely, and test them.
Run the RDISK utility each time you make hardware changes or make changes to the server's disk configuration.
Reconstruct the emergency repair disks when the system hardware configuration changes, particularly when new hardware is added.
Deploy proper data protection mechanisms, especially Redundant Array of Inexpensive Disks (RAID) systems.
Restrict logical and physical access to servers.
Do not leave diskettes and compact discs in disk drives, or disable these disk drives. Such stray items could cause autoreboot to fail during a recovery operation.
Monitor the system event log regularly to detect failures and potential failures of systems.
Keep an operations manual for every mission-critical environment.
Analyze those failures that do occur.
Understand the risks and benefits of system upgrades and service packs.
Deploy a mirrored test server first.

In addition, make sure you understand and use the Microsoft Logo Program. The Microsoft Windows 95/Windows 98 and WindowsNT Logo Program and the Designed for Microsoft BackOffice Logo Program are the two most important system software compatibility programs.

A Designed for WindowsNT Logo means that a particular product has been independently tested, that it conforms to widely accepted usability criteria, and that it has minimal conflicts with other applications and devices when deployed on a WindowsNT-based system.

The Designed for BackOffice Logo Program has more demanding test requirements, focusing exclusively on customer requirements for robustness and integration of the product in question with the BackOffice family and WindowsNT.

Prevent Power Loss

A disaster-recovery study by Contingency Planning Research found that power loss caused 27 percent of data-center disasters in which there was actual data loss in addition to loss of service. This figure includes power outages due to environmental disasters such as snowstorms, tornadoes, and hurricanes. To minimize these disasters, invest in an uninterruptible power supply (UPS). WindowsNT Server includes built-in support for UPS.

Install Fault-Tolerant Storage

RAID technology minimizes data loss due to problems accessing a hard disk. RAID is a fault-tolerant disk configuration in which part of the physical storage capacity contains redundant information about the data stored on the disks. The redundant information enables regeneration of the data if one of the disks or the access path to it fails, or if a sector on the disk cannot be read. (See Appendix C for more information on RAID Levels.)

Install RAID Level 1 for the operating system and LOGs.
Install RAID Level 5 or RAID 0+1 for DATA.

Use WindowsNT Server Reliability Features

WindowsNT Server offers a variety of operating system technologies that can enhance the reliability of your entire system, including BackOffice Server. These technologies include the following:

Uninterruptible power: The UPS system software component in WindowsNT Server can be configured to detect and warn of impending power failure. The built-in UPS functionality in WindowsNT takes advantage of the features that many UPS systems provide, ensuring the integrity of data on the system and allowing the computer system to be shut down in a controlled manner if a power failure outlasts UPS batteries.
Fault-tolerant storage: See information in the previous section.
Recoverable file systems: The WindowsNT File System (NTFS) uses intelligent caching and allows recovery of metadata such as file size after a disk failure but the actual contents of the file may be unrecoverable..
Distributed file system: Microsoft Distributed File System (DFS) for WindowsNT Server is a network server component that makes it easier to find and manage data on a network, improves data availability, and enables load balancing.
Reliable system services: WindowsNT Server 4.0 and WindowsNT Server 4.0, Enterprise Edition, support Microsoft Transaction Server and Microsoft Message Queue Server. Applications can use these reliable system services to deliver high availability to end users. Microsoft Transaction Server is a robust run-time environment for deploying high-performance, online transaction-processing applications. Microsoft Message Queue Server provides a high-performance, asynchronous messaging infrastructure.

Consider Cluster Servers

A cluster is a group of two or more computers that can operate as a single system. Although composed of multiple nodes, clusters appear to clients as a single unit.

Clusters can extend system availability through redundancy. A service running on a failed node can be restarted on the remaining node with minimal or no down time. WindowsNT Server 4.0, Enterprise Edition, supports two-node clustering. For more information, visit

Consider Offsite Storage

Several third parties offer offsite storage solutions. For a list of such providers, refer to the Microsoft BackOffice Server 4.5 Solutions Guide (

Subscribe to the Microsoft Security Notification Service and receive security bulletins by e-mail (

Continually Monitor and Tune Performance

After deployment of any environment or component within that environment, administrators should continue to evaluate infrastructure requirements and the appropriateness of their solution.

For most business-critical applications, planning and deploying reliable systems makes good economic sense — although doing so carries some associated costs. The following table summarizes recommendations and best practices for reliable and highly available deployments:

Table 2: Best Practices Summary

Good (99%) / Better (99.9%) / Best (99.99%)
Hardware Selection / Use only WindowsNT HCL-certified hardware. / Use only BackOffice Logo hardware with high availability features. / Use only Microsoft Cluster Server validated configurations.
Software Selection / Use only Designed for WindowsNT Logo software. / Use only Designed for Microsoft BackOffice Logo software. / Use Designed for Microsoft BackOffice Logo software and Microsoft Cluster Server-aware applications.
Storage Solution / Use RAID 0+1 for data disks, RAID 1 for log disks. / Use hardware solutions for enhanced performance and recoverability of data disks, (e.g., use RAID 5). / In addition, use multiple disk controllers with redundant data paths.
Random Access Memory / Use only error-detecting memories with parity. / Use only memories with error corrective coding (ECC) for enhanced memory error detection and correction. / In addition, use only systems which support ECC memories for L2 Cache segments.
Configuration Management / Make fewer than three changes per month. / Make fewer than three changes per quarter. / Make fewer than three changes per year.
Notification / Monitor system logs regularly (weekly, for example) or when problems occur. / Monitor all system logs daily. / Monitor all system logs at least daily, and institute procedures for automatic notification when warnings or error conditions occur.

Microsoft WinRAS Project

Microsoft has initiated the Windows Reliability, Availability, Serviceability (WinRAS) Project for WindowsNT Server- and BackOffice Server-based environments. The objectives of the Microsoft team working on this initiative are to:

Improve platform reliability.
Improve clustering and other distributed solutions for high availability.
Improve ease of service and maintenance through collaboration with partners.

The WinRAS Project offers immediate customer benefits through the WindowsNT Service Pack 4, which contains 1,200 bug fixes and a number of other content and quality improvements such as Year-2000 compliance, Internet security enhancements, a Security Configuration Editor, and new hardware support.

In the future, the WinRAS Project will deliver significant benefits via Windows 2000, including the following:

Eliminates 75 operating system reboot scenarios as well as some application reboot scenarios.
Offers numerous driver improvements, such as a driver verifier, to overcome a major source of reliability problems.
Offers "safe mode boot" (which loads minimal drivers) and command-line boot options.
Adds kernel-mode write protection to prevent bugs from overwriting read-only kernel code and data.
Adds kernel-only crash dumps, which provide faster dumps on systems with large amounts of random access memory (RAM).
Enhances the file system (for example, Windows 2000 Server includes user disk quotas for NTFS and the ability to dynamically grow NTFS partitions).
Provides a job object, which is another way to protect against application memory leaks.
Offers faster Chkdsk than WindowsNT 4.0 Service Pack 4.

Appendix A: SQLServer 7.0 and Exchange Server 5.5 Reliability

Microsoft SQLServer™ and Microsoft Exchange Server are so essential to the server environment that improving their reliability will improve the reliability of the system.

SQLServer 7.0

Microsoft SQLServer 7.0 plays an important role in your BackOffice Server reliability. It affects core database operations by reducing the configuration and tuning required to implement and run database applications. SQLServer 7.0 minimizes the requirement for database expertise by providing features such as on-demand memory, on-demand disk, and dynamic tuning of configuration parameters. Many users can now successfully implement a database application without knowledge of the internal architecture of the database system. (A small percentage of high-end applications require more detailed knowledge.)

SQLServer 7.0 and Exchange Server 5.5 on the Same Physical Server

If you run both SQLServer 7.0 and Exchange Server 5.5 Service Pack 2 on a computer running BackOffice Server 4.5, you must increase the minimum dynamic memory setting for SQLServer 7.0 from the default of zero to a value of at least 32 megabytes (MB).

It may be necessary to set the minimum memory for SQLServer higher than 32 MB to support SQLServer's processing load. This setting determines the memory used by SQLServer when Exchange Server is running and under load. In this environment, the maximum dynamic memory setting for SQLServer will not be reached. SQLServer and Exchange Server administrators should determine an amount of memory to be allocated to SQLServer that will optimize the overall performance of both applications and then set the SQLServer minimum memory option to the required value. If the SQLServer database is supporting a third-party application, you may have to consult the application's documentation or vendor to find out how much memory SQLServer requires to support the application processing load.