Optimizing Distributed Database Design for Analytics Platform System

Optimizing Distributed Database Design for Analytics Platform System

Optimizing Distributed Database Design for Analytics Platform System

Summary:Thisreference architecture is intended to provide proven practices for designing and implementing a distributed data warehouse with the Microsoft Analytics Platform system. The APS appliance is designed using a shared-nothing, massively parallel processing architecture, capable of loading and processing very large data volumes in an efficient manner. The contents of this paper are derived from a series of real-world deployments of the APS appliance for Microsoft customers.

Writers:Michael Hlobil,Ryan Mich

Technical Reviewers:Charles Feddersen,Mary Long,Brian Mitchell

Published:June 2015

Applies to:Analytics Platform System

Copyright

This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it.

Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred.

This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes.

© 2014 Microsoft. All rights reserved.

Contents

Introduction

Introduction to Analytics Platform System

Overview of Analytics Platform System (APS)

Shared Nothing Architecture of APS

APS Components

MPP Engine

Data Movement Service (DMS)

SQL Server Databases

Value Proposition of APS

Handled Scenarios

Greenfield DWH Design

Migration of an Existing DWH

Reference Architecture for a Distributed Database / Datamart Design

Overview of the Design Components

Database

APS Table Types

Clustered Indexes

Non-Clustered Indexes

Clustered Columnstore Index

Partitioning

Capabilities

Recommendations for Partitioning

Data Loading with Partition Switching

Statistics

Recommendations for Statistics

Conclusion

Introduction

In this whitepaper, we will describe a reference architecture for the design of the Analytics Platform System (APS). There are important decisions which have to be made in the early stages of solution design. This white paper provides guidance to these design decisions, potentially reducing the time to market of your data warehouse (DWH) solution.

The following two implementation scenarios are described in this paper:

•Greenfield Data Warehouse implementation

oDesign and develop a logical data model and implement the resulting dimension models as tables in APS.

•Migration of an existing Data Warehouse to APS

oRedesign of an existing symmetric multi-processor (SMP) solution, such as SQL Server, to a distributed architecture for APS.

This paper describes a number of design components and considerations for a distributed data warehouse, and provides explanations on why certain design patterns are optimal for this scale-out platform.

Introduction to Analytics Platform System

Overview of Analytics Platform System (APS)

APS is a massively parallel processing (MPP) data warehousing and analytics platform appliance built for processing large volumes of relational and semi-structured data. It provides seamless integration to Hadoop and Azure Blob Storage via Polybase (an integration layer included in APS), which enables non-relational storage to interoperate with relational storage through a unified language, SQL, and query optimizer.

APS ships in an appliance form factor comprised of factory built and configured hardware with the APS software preinstalled. One benefit of this appliance model is the ability to quickly add incremental resources that enable near linear performance gains. This is accomplished by adding scale units to the appliance, which consist of servers and storage pre-tuned for optimal performance.

The APS appliance is designed to support two different regions. The mandatory region is the parallel data warehouse (PDW) region, which supports relational storage and is a MPP-cluster of SQL servers tuned for data warehouse workloads. The optional region supports the Microsoft distribution of Hadoop, known as HDInsight. This whitepaper focuses exclusively on the design considerations for the MPP-RDBMS, the PDW region.

APS is generally used as the central data store for a modern data warehouse architecture. It parallelizes and distributes the processing across multiple SMP (Symmetric-Multi-Processing) compute nodes, each running an instance of Microsoft SQL Server. SQL Server Parallel Data Warehouse is only available as part of Microsoft’s Analytics Platform System (APS) appliance.

The implementation of Massively Parallel Processing (MPP) in APS is the coordinated processing of a single task by multiple processers, each working on a different part of the task. With each processor using its own operating system (OS), memory, and disk. The nodes within the MPP appliance communicate between each other using a high-speed Infiniband network.

Symmetric Multi-Processing (SMP/NUMA) is the primary architecture employed in servers. An SMP architecture is a tightly coupled multiprocessor system, where processors share a single copy of the operating system (OS) and resources that often include a common bus, memory, and an I/O system. The typical single server, with multi-core processors, locally attached storage, and running the Microsoft Windows OS is an example of an SMP server.

Shared Nothing Architecture of APS

APS follows the shared-nothing architecture; each processor has its own set of disks. Data in a table can be “distributed” across nodes, such that each node has a subset of the rows from the table in the database. Each node is then responsible for processing only the rows on its own disks.

In addition, every node maintains its own lock table and buffer pool, eliminating the need for complicated locking and software or hardware consistency mechanisms. Because shared nothing does not typically have a severe bus or resource contention it can be made to scale massively.

Figure 1: Region PDW

APS Components

MPP Engine

The MPP Engine runs on the control node. It is the brain of the SQL Server Parallel Data Warehouse (PDW) and delivers the Massively Parallel Processing (MPP) capabilities. It generates the parallel query execution plan and coordinates the parallel query execution across the compute nodes. It also stores metadata and configuration data for the PDW region.

Data Movement Service (DMS)

The data movement service (DMS) moves data between compute nodes and between the compute nodes and the control node. It bridges the shared nothing with the shared world.

SQL Server Databases

Each compute node runs an instance of SQL Server to process queries and manage user data.

Value Proposition of APS

An APS appliance consists of one or more physical server racks and is designed to be moved into an existing customer datacenter as a complete unit.

Buying an APS appliance has the following advantages in the overall data warehouse life cycle:

Reduced project cost / time to market

  • Project duration is significantly minimized due to higher automation of processes in the appliance.
  • The appliance is preconfigured and ready for use.

Reduced operational cost

  • Tooling for monitoring, high availability, and failover is available out of the box.
  • The scalability of the appliance is managed by adding resources.

Reduced database administration efforts

  • For DBAs, APS provides a reduced and documented set of administrative tasks for keeping the solution at peak performance.

Handled Scenarios

The following scenarios are based on the definition of OLAP or a data warehouse (DWH):

  • A data warehouse is a copy of transaction data specifically structured for query and analysis.

There are two most commonly discussed models when designing a DWH, the Kimball model and the Inmon model.

In Kimball’s dimensional design approach, the datamarts supporting reports and analysis are created first. The datasources and rules to populate this dimensional model are then identified. In contrast, Immon’s approach defines a normalized data model first. The dimensional datamarts to support reports and analysis are created from the data warehouse. In this white paper, the focus will be on the Kimball dimensional model, as this layer exists in both design approaches. The dimensional model is where the most query performance can be gained.

Greenfield DWH Design

For a greenfield data warehouse project, a logical data model is developed for the different subject area data marts in accordance to the building blocks described by Kimball. The resulting dimension model is implemented as tables in APS.

Migration of an Existing DWH

The migration of an existing DWH can be done in the following ways.

  • 1:1 Migration – The existing data model and schema is moved directly to the APS PDW region with little or no change to the model.
  • Redesign – The data model is redesigned and the application is re-architected, following the SQL Server PDW best practices. This can help reduce overall complexity and enable the ability for the data warehouse to be more flexible and answer any question at any time.
  • Evolution – The evolutionary approach is an enhancement to the above two, allowing to take advantage of the new capabilities of the new target platform.

In this white paper, we will describe approaches applicable for all migration scenarios.

Reference Architecture for a Distributed Database / Datamart Design

Overview of the Design Components

When optimizing a design for PDW, various components need to be considered. These include databases, tables, indexes, and partitions. We will describe design and runtime recommendations for each of these.

Database

When creating a PDW database, the database is created on each compute node and on the control node of the appliance. The database on the control node is called a shell database because it only holds the metadata of the database and no user data. Data is only stored on the compute nodes. The appliance automatically places the database files on the appropriate disks for optimum performance, and therefore the file specification and other file related parameters are not needed. The only parameters specified when creating a database are:

REPLICATED_SIZE

The amount of storage in GB allocated for replicated tables. This is the amount of storage allocated on a per compute node basis. For example, if the REPLICATED_SIZE is 10 GB, 10 GB will be allocated on each compute node for replicated tables. If the appliance has eight compute nodes, a total of 80 GB will be allocated for replicated tables (8 * 10 = 80).

DISTRIBUTED_SIZE

The amount of data in GB allocated for distributed tables. This is the amount of data allocated over the entire appliance. Per compute node, the disk space allocated equals the DISTRIBUTED_SIZE divided by the number of compute nodes in the appliance. For example, if the DISTRIBUTED_SIZE is 800 GB and the appliance has eight compute nodes, each compute node has 100 GB allocated for distributed table storage (800 / 8 = 100).

LOG_SIZE

The amount of data in GB allocated for the transaction log. This is the amount of data allocated over the entire appliance. Like the DISTRIBUTED_SIZE parameter, the disk space allocated per compute node equals the LOG_SIZE divided by the number of compute nodes in the appliance.

AUTOGROW

An optional parameter that specifies if the REPLICATED_SIZE, DISTRIBUTED_SIZE, and LOG_SIZE is fixed or if the appliance will automatically increase the disk allocation for the database as needed until all of the physical disk space in the appliance is consumed. The default value for AUTOGROW is OFF.

Design Time Recommendations for Databases
Sizing

The size of the database should be based on the calculated sizes of the replicated and distributed tables. In addition, it is recommended that database growth be estimated for two years and included with the initial database size calculation. This is the baseline for the size of the database. Additional space is also needed for the following reasons:

  • To make a copy of the largest distributed table if needed (e.g. for testing different distribution columns)
  • To have space for at least a few extra partitions if you are using partitioning
  • To store a copy of a table during an index rebuild
  • To account for deleted rows in a clustered columnstore index (deleted rows are not physically deleted until the index is rebuilt)

These simple guidelines will provide sufficient insight into sizing the database.

Autogrow

It is recommended that the database is created large enough before loading data, and that autogrow is turned off. Frequent database resizing can impact data loading and query performance due to database fragmentation.

APS Table Types

SQL Server Parallel Data Warehouse (PDW) is a Massively Parallel Processing (MPP) appliance which follows the Shared Nothing Architecture. This means that the data is spread across the compute nodes in order to benefit from the storage and query processing of the MPP architecture (divide and conquer paradigm). SQL Server PDW provides two options to define how the data can be distibuted: distributed and replicated tables.

Distributed Table

A distributed table is a table in which all rows have been spread across the SQL Server PDW compute nodes based upon a row hash function. Each row of the table is placed on a single distribution as assigned by a deterministic hash algorithm taking as input the value contained within the defined distribution column. The following diagram depicts how rows would typically be stored within a distributed table.

Figure 2: Distributed Table

Each SQL Server PDW compute node has eight distributions and each distribution is stored on its own set of disks, therefore ring-fencing the I/O resources. Distributed tables are what gives SQL Server PDW the ability to scale out the processing of a query across multiple compute nodes. Each distributed table has one column which is designated as the distribution column. This is the column that SQL Server PDW uses to assign a distributed table row to a distribution.

Design Time Recommendations for Distributed Tables

There are performance considerations for the selection of a distribution column:

  • Data Skew
  • Distinctness
  • Types of queries run on the appliance

When selecting a distribution column, there are performance considerations, one of which is data skew. Skew occurs when the rows of a distributed table are not spread uniformly across each distribution. When a query relies on a distributed table which is skewed, even if a smaller distribution completes quickly, you will still need to wait for the queries to finish on the larger distributions. Therefore, a parallel query performs as slow as the slowest distribution, and so it is essential to avoid data skew. Where data skew is unavoidable, PDW can endure 20 - 30% skew between the distributions with minimal impact to the queries. We recommend the following approach to find the optimal column for distribution.

Distribution Key

The first step in designing the data model is to identify the optimal distribution, otherwise, the benefits of the Massive Parallel Processing (MPP) architecture will not be realized. Therefore, always begin with designing distributed tables. Designing the data model with the optimal distribution key helps minimize as much movement of data as possible between compute nodes to optimize the divide and conquer paradigm.

Rules for Identifying the Distribution Column

Selecting a good distribution column is an important aspect to maximizing the benefits of the Massive Parallel Processing (MPP) architecture. This is possibly the most important design choice you will make in SQL Server Parallel Data Warehouse (PDW). The principal criteria you must consider when selecting a distribution column for a table are the following:

  • Access
  • Distribution
  • Volatility

The ideal distribution key is one that meets all three criteria. The reality is that you are often faced with trading one off from the other.

  • Select a distribution column that is also used within query join conditions –
    Consider a distribution column that is also one of the join conditions between two distributed tables within the queries being executed. This will improve query performance by removing the need to move data, making the query join execution distribution local, fulfilling the divide and conquer paradigm.
  • Select a distribution column that is also aggregation compatible –
    Consider a distribution column that is also commonly used within the GROUP BY clause within the queries being executed. The order in which the GROUP BY statement is written doesn’t matter as long as the distribution column is used. This will improve query performance by removing the need to move data to the control node for a two-step aggregation. Instead, the GROUP BY operation will execute locally on each distribution, fulfilling the divide and conquer paradigm.
  • Select a distribution column that is frequently used in COUNT DISTINCT’s –
    Consider a distribution column that is commonly used within a COUNT DISTINCT function. This will improve query performance by removing the need to redistribute data via a SHUFFLE MOVE operation, making the query COUNT DISTINCT function execute distribution local.
  • Select a distribution column that provides an even data distribution –
    Consider a distribution column that can provide an even number of rows per distribution, therefore balancing the resource requirements. To achieve this, look for a distribution column that provides a large number of distinct values, i.e. at least 10 times the number of table distributions. It is also important to check if the selected column leads to skew. Data skew, as already mentioned above, occurs when the rows of a distributed table are not spread uniformly across all of the distributions.
  • Select a distribution column that rarely, if ever changes value –
    PDW will not allow you to change the value of a distribution column for a given row by using an UPDATE statement, because changing the value of a distribution column for a given row will almost certainly mean the row will be moved to a different distribution. It is therefore recommended that you select a distribution column which rarely requires the value to be modified.

On some occasions, the choice of a distribution column may not be immediately obvious, and certain query patterns may not perform well when the distribution column is chosen using the general guidelines above. In these one-off scenarios, exercising some creativity in the physical database design can lead to performance gains. The following sections illustrate a few examples of these techniques that may not be readily apparent.