Scaling up your Data Warehouse with SQL Server 20081

Scaling Up Your Data Warehouse withSQL Server 2008 R2

SQL Server Technical Article

Writers: Eric N. Hanson, Kevin Cox, Alejandro Hernandez Saenz, Boris Baryshnikov, Joachim Hammer, Roman Schindlauer, and Grant Dickinson of Microsoft Corporation. Gunter Zink of HP.

Technical Reviewer: Eric N. Hanson, Microsoft Corporation

Editor: Diana Steinmetz

Published: June, 2008; Updated: December, 2010

Applies To: SQL Server 2008, SQL Server 2008 R2

Summary:

SQL Server 2008 introduced many new functional and performance improvements for data warehousing, and SQL Server 2008 R2 includes all these and more. This paper discusses how to use SQL Server2008 R2 to get great performance as your data warehouse scales up. We present lessons learned during extensive internal data warehouse testing on a 64-core HP Integrity Superdome during the development of the SQL Server2008 release, and via production experience with large-scale SQLServer customers. Our testing indicates that many customers can expect their performance to nearly double on the same hardware they are currently using, merely by upgrading to SQL Server2008 R2 from SQL Server 2005 or earlier, and compressing their fact tables. We cover techniques to improve manageability and performance at high-scale, encompassing data loading (extract, transform, load), query processing, partitioning, index maintenance, indexed view (aggregate) management, and backup and restore.

Audience: Data Warehouse architects, developers, and DBAs.

Copyright

This is a preliminary document and may be changed substantially prior to final commercial release of the software described herein.

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, email address, logo, person, place or event is intended or should be inferred.

2010 Microsoft Corporation. All rights reserved.

Microsoft,C#, SQL Server, Visual Basic, Windows, and Windows Serverare either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

Filename: ScaleUpDWinSQL2008.docx1

Table of Contents

Introduction

Benefits of Scale-Up Solutions for Data Warehousing

Reliability and Availability Reduces Unplanned Downtime

Best Practices for Building a Scale-Up Data Warehouse Solution

Database Design

Schema Design

Physical Database Design

Hardware Selection

Processors

Memory

Software Selection

SQL Server Settings

Query Design

Query Tuning

ETL

Pipeline Performance

Lookup Performance

ETL Features in the SQL Server Engine

Change Data Capture

Merge

Minimally Logged Insert

Periodic Maintenance

Index Maintenance

Statistics Maintenance

Backup

Internal Scale Testing During SQL Server2008 Development

Workload

Hardware Configuration

Performance Results

Conclusion

References

Scaling up your Data Warehouse with SQL Server 2008 R21

Introduction

For high performance, simplicity of management, and reliability, scale-up-based data warehouse configurations based on Microsoft®SQL Server®2008 R2 offer a very attractive solution for your enterprise data warehouse needs. Scale-up based data warehouse solutions are based on single shared-memory multiprocessor (SMP) computer systems. This paper describes how you can effectively create, manage, and query your enterprise data warehouse with SQL Server2008 R2 when using a large SMP.

The goals for a data warehouse solution are typically the following, although they may vary somewhat in different environments:

  • Support rapid response time for queries that summarize large volumes of data.
  • Support efficient drill-through queries that extract moderate volumes of detail data for further analysis.
  • Support rapid initial data loading.
  • Support efficient incremental loading (for example, load 24 hours of data in a one-hour window).
  • Support fast maintenance operations (backup, restore, loading, indexing, compression, bulk removal of old data).
  • Given that the above goals are met, limit total cost of ownership (TCO):
  • Minimize the database administrator (DBA) time required during setup and for normal operations.
  • Keep hardware costs reasonable.

Data warehouse designs are most commonly based on a star or snowflake schema, also known as a dimensional model. This approach has advantages for query performance and makes it easy for business users to understand the data. Some parts of this paper assume basic knowledge of star-schema-based data modeling.Ralph Kimball and Margy Rossprovide an excellent reference on dimensional modeling[i]. Mundy et al. give another good reference on applying dimensional modeling specifically with Microsoft products[ii].

Data warehouses are also sometimes based on normalized schemas. An example of this is a schema containing one-to-many relationships between two large tables, such as Orders and LineItems from an order-entry system. One reason people use this type of schema is that they sometimes mirror the schema of the online transaction processing (OLTP) system that is the primary source of the data. This makes it easier to load the data warehouse simply by accumulating more history from the OLTP system, with few or no other transformations. If historical data needs to be updated, a normalized schema can make this easier. A common approach is to build star-schema-based subject area data marts for reporting from a normalized enterprise data warehouse used for integrating data and storing large volumes of fine-grainedhistorical data. Whether you are using a star schema or a normalized schema for your data warehouse, the ideas covered in this white paper apply. Because the star schema approach is common and well-defined, we assume it is the approach being used unless stated otherwise.

Benefits of Scale-Up Solutions for Data Warehousing

For purposes of this discussion, we consider a “scale up” system to be any data warehouse machine based on a single SMP with 4 or more processors. These include 4 and 8 processor SQL Server Fast Track Data Warehouse[iii] configurations, as well as larger SMP systems such as the HP Integrity SuperDome. Scale-upbased solutions running SQL Server 2008 R2 can provide an excellent combination of rich programmability, broad application compatibility, high performance, scalability, management simplicity, high reliability, and reasonable total hardware purchase cost.

An alternative to scaling up is to scale out using a SQL Server Parallel Data Warehouse[iv] (PDW) configuration. PDW can support tremendous scalability, high performance loading and updates, and fast query processing, all at lower hardware cost than the highest-end scale-up systems. However, the other SQL Server Editions currently support a broader set of programming capabilities and language features than PDW V1, and integrates with a broader set of third party applications. For those customers who want to use these features, and otherwise want full compatibility with SQL Server, scaling up is attractive.

Reliability and Availability Reduces Unplanned Downtime

High-end scale-up systems typically offer very high reliability.For example, the HPIntegrity Superdome offers added RAS (Reliability, Availability, and Serviceability) capabilities. And the HP Integrity product line supports the ability to logically remove a single failed processor or memory module (DIMM) from the system automatically, until it can be replaced by a technician. The system can operate even with the failure of up to half of the processors. Similarly, automatic error detection, error correction, error isolation,and recovery from memory failures are supported. The memory systems of computers such as the HPIntegrity Superdome offer sophisticated error correction, making them tolerant of transient memory faults and hard faults in individual memory components. High-end scale-up systems also support hot-add of memory. Added memory becomes immediately visible to all the processors of the scale-up system—it is not necessary to restart the serveror SQLServer.

Scale-up based systems are often used together with Storage Area Networks (SANS), such as the HP XP240000. SANs also simplify some management operations compared to the direct-attached storage used in scale-out approaches. For example, SANs typically support operations such as:

  • Very fast copying of database files, which is useful as part of the extract, transform, load (ETL) process.
  • Block-level replication of entire volumes for disaster recovery.
  • Copy-on-write snapshot of a database from the production environment to use for many purposes.

These operations can make it much simpler to manage your data warehouse and they work well in a scale-up computing system.

Best Practices for Building a Scale-Up Data Warehouse Solution

A number of best practices will help you get the best performance and the simplest and most cost-effective system management using SQL Server2008 R2 in a scale-up configuration. These best practices cover the following:

  • Database design
  • Schema design
  • Physical design
  • Hardware selection and configuration
  • Software selection and configuration
  • Data loading
  • Query specification and tuning

Some of these best practices are independent of the type of hardware being used. Others are more closely related to the use of a scale-up configuration. We focus here on the best practices that are most relevant to scaleup, although we touch on other best practices briefly as well. This paper focuses mainly on software techniques associated with SQL Server2008 R2. For other best practices for using SQLServer on large SMPs from HP with an emphasis on hardware capabilities and configuration, you may wish to consult one of several detailed references on the subject from HP[v].

Database Design

The section covers best practices for designing schemas and for the physical design of a database such as disk layout, indexes, compression, and summary aggregates.

Schema Design

Database schema design (logical design) can have a major impact on the performance of queries and updates. For best performance, and ease of understanding the data and writing queries, we recommend using a star schema (also known asdimensional modeling[vi]) if that is feasible in your data warehouse environment. SQL Server2008 Enterprise Edition has special star join query optimizations that take advantage of this type of schema[vii]. If you prefer a normalized schema (typically third normal form[viii]), you can still get excellent query performance, but certain types of star join optimizations may not be possible for your queries.

The data types you choose for columns have a performance impact. Use integer (4byte) surrogate key columns to join your fact table to your dimension tables in a star schema. These compress well and support fast comparisons during joins. For decimal measure values, use the money data type if possible because operations on it are faster than for general numeric or decimal types. Avoid the use of long fields of any type in your fact table unless it is essential to the behavior of your application. This helps keep your fact table smaller and usually speeds up query processing and data loading.

Physical Database Design

Afteryou choose your logical database design, how you physically structure your data warehouse data has a major performance impact. Physical design considerations include partitioning, indexing, creation of summary aggregates, compression, layout of tables and partitions on storage devices, and configuration of tempdb. We highly recommend that you follow the guidelines for creating LUNs, files, filegroups, tables, indexes, and loading data, given in the most current Fast Track architecture guide[ix]. The guide goes into significantly more detail than we give here.

Table Partitioning

With the use of large SMPs for scale up, it becomes feasible to manage and query very large tables, eventables with billions of rows. With tables of this size, it is important to be able to break them down into manageable-sized chunks for system management operations so that these operations complete in a reasonable time. The table and index partitioning feature in SQLServer Enterprise Edition fills this need. For a data warehouse fact table beyond 50GB, it is recommended that you partition it by time, such as by week. This allows you to bulk purge a week of data at a time with a metadata-only operation. It also can improve performance of loading, and management of summary aggregates maintained using indexed views, as discussed later.

Note: SQL Server table partitioning is separate and distinct from the hardware server partitioning supported by the HPIntegrity Superdome.

Index Design

A basic index design that is often effective for a star schema in SQLServer is to:

  1. Create an integer date key in the format of YYYYMMDD as the surrogate key for both the date dimension and a foreign key in the fact table[1]. Create a clustered index on the date key column on the fact table. With the date key in this format, you can express date range filters conveniently and explicitly on the fact table. As we will discuss later, this in turn can give you better parallel query speed in some cases.
  2. Create nonclustered indexes on the fact table on the surrogate keys of the most frequently used dimensions (this is optional – see the Fast Track guide for more details).
  3. Create nonclustered indexes on large dimension tables (such as a Customer dimension with millions of rows) on the columns you search on frequently in those tables.

If you use partitioning to support a sliding-window scenario, partition the fact table and all its indexes on the date key. If your table partition width is one day, there is no need for the clustered index on the date keyof the fact table in this situation. Instead, create a clustered index on the most frequently used dimension key column of the fact table besides the date key.

Design of Summary Aggregates

Experienced data warehouse developers have long known that one of the best ways to get good data warehouse query performance is to use summary aggregates. These summarize data in the database into sets of rows that are typically much smaller than the fact table. If most of your queries can be answered from these aggregates, which are much smaller than the raw event data, query performance can be improved by an order of magnitude. This in turn can make it possible to meet your users’ performance requirements on a smaller hardware platform, or provide better quality service to your users, or both.

The primary forms of summary aggregates you can use with SQL Server are:

  • SQL Server indexed views
  • User-defined summary tables

Consider using indexed views[x] or user-defined summary tables[xi] to accelerate your common aggregate queries. Indexed views can be used to maintain summary aggregates automatically. In general, the SQL Server2008 R2 query processor can automatically use an indexed view to solve an aggregate query matching the structure of the view. If the SQLServer query processor is not able to use an indexed view to solve a query that actually could be solved using the indexed view, you can manually rewrite the query to reference the indexed view by name, and use the NOEXPAND hint on the reference to the indexed view in your query.

Indexed views are automatically maintained by SQLServer when you update your database. If you need to be able to incrementally update your database, such as to correct errors, and have the aggregates be updated automatically, this makes indexed views a good choice. A significant improvement introduced in SQL Server2008 is partition aligned indexed views[xii]. This feature enables you to switch partitions of tables in and out even if those tables have indexed views defined on them. This eliminates the need to drop and rebuild indexed views during your ETL process, and can greatly speed up ETL, or make it feasible to use indexed views on partitioned tables when it was not feasible before.