TdBench Support for TPC-DS and TPC-H

TdBench Support for TPC-DS and TPC-H Benchmarks

TdBench supports generating data and defining tables for TPC-DS and TPC-H standard benchmarks. It also generates a set of starter scripts for testing the queries.

Option 4 of the tdb.sh script brings up the following menu:

TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems. It includes the following tables:

call_center / customer_address / income_band / ship_mode / warehouse
catalog_page / customer_demographics / inventory / store / web_page
catalog_returns / date_dim / item / store_returns / web_returns
catalog_sales / dbgen_version / promotion / store_sales / web_sales
customer / household_demographics / reason / time_dim / web_site

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries. It includes the following tables:

CUSTOMER / PART
LINEITEM / PARTSUPP
NATION / REGION
ORDERS / SUPPLIER

For each benchmark there are 2 tool selections:

  • %_1_gen will execute the tpc data generator to create data in the ./tools/tpcds or ./tools/tpch directories.
  • %_2_define will logon via BTEQ to define the tables in the database you’ve defined and generate scripts to load the data, perform a serial and a workload test.

Both the TPC-DS and TPC-H scripts have the same parameters and setup requirements. The following is the help output from the tpcds_1_gen script:

The following is a session for generating a 1 GB sized benchmark. It took about 3 minutes to run on a 16 core Xeon processor running at 2.53 GHz:

The following is the resulting tables from the tpcds_1_gen at 1 GB:

Filename / Rows / Bytes
call_center.dat_1_2 / 6 / 1,891
catalog_page.dat_1_2 / 11,718 / 1,631,792
catalog_returns.dat_1_2 / 144,067 / 21,378,371
catalog_sales.dat_1_2 / 1,441,548 / 295,910,384
customer_address.dat_1_2 / 50,000 / 5,502,165
customer.dat_1_2 / 100,000 / 13,209,372
customer_demographics.dat_1_2 / 960,400 / 39,811,295
customer_demographics.dat_2_2 / 960,400 / 40,848,801
date_dim.dat_1_2 / 73,049 / 10,317,438
dbgen_version.dat_1_2 / 1 / 105
household_demographics.dat_1_2 / 7,200 / 151,653
income_band.dat_1_2 / 20 / 328
inventory.dat_1_2 / 5,872,500 / 118,206,952
inventory.dat_2_2 / 5,872,500 / 118,213,187
item.dat_1_2 / 18,000 / 5,051,899
promotion.dat_1_2 / 300 / 37,233
reason.dat_1_2 / 35 / 1,339
ship_mode.dat_1_2 / 20 / 1,113
store.dat_1_2 / 12 / 3,155
store_returns.dat_1_2 / 287,514 / 32,710,005
store_sales.dat_1_2 / 2,880,404 / 388,445,409
time_dim.dat_1_2 / 86,400 / 5,107,780
warehouse.dat_1_2 / 5 / 585
web_page.dat_1_2 / 60 / 5,776
web_returns.dat_1_2 / 71,763 / 9,806,236
web_sales.dat_1_2 / 719,384 / 146,877,674
web_site.dat_1_2 / 30 / 8,771
Total Bytes / 1,253,240,709
1.17 GB

The generate for 1 GB of TPC-H data took about 30 seconds on the same 16 core Xeon server at 2.53 GHz. The following is the resulting tables from the tpch_1_gen at 1 GB:

Filename / Rows / Bytes
customer.tbl / 150,000 / 24,346,144
lineitem.tbl / 6,001,215 / 759,863,287
nation.tbl / 25 / 2,224
orders.tbl / 1,500,000 / 171,952,161
partsupp.tbl / 800,000 / 118,984,616
part.tbl / 200,000 / 24,135,125
region.tbl / 5 / 389
supplier.tbl / 10,000 / 1,409,184
Total Bytes / 1,100,693,130
1.03 GB

The help from tpcds_2_define and tpch_2_define are nearly identical. Here is the tpcds_2_define help:

Generated Load Script:

The scripts/______load.cmd is executed under tdbench.sh with the EXEC command. It has the following contents:

Explanation:

  • The define statement provides a name (tpcdsload) and a title for the run
  • The queue statement puts 1 execution of tools/tptload.sh into q1 for each line in the parmfile. For OS statements, the parameters automatically are added to the end of the statement. No markers are needed (e.g. ;1, :2, … )
  • The worker statement initiates 2 workers to load data in parallel
  • The run statement initiates execution of the work in q1 to process until all tables are loaded

The first lines of the parameter file are:

There are 3 parameters on each line. The parameters:

  • The first provides a wild card search for a list of input files since the TPC-DS loader will create multiple files for each table.
  • The second gives a fully qualified table name
  • The third provides a job name for the tpc load job. (Note that job names must be unique or one job could be taken as a restart of another active job.

Generated Query Scripts:

A serial and workload script is generated by tpcds_2_define and tpch_2_define. The structure of the scripts is identical between the two benchmarks.

The Serial Script:

The queue statement will cause the databasename doug_tpch to be substituted for :1 which is used in the provided TPC-DS/TPC-H scripts as part of the tablename reference. One worker is specified using the control user’s logon and there is no limit to the duration of the run, so all queries will execute 1 time.

The Workload Script:

The workload script is similar to the serial script, except there are 5 workers logged on to Teradata with the control user’s logon ID and the test will only run for 15 minutes. At the end of 15 minutes, a kill will be issued to stop any queries still executing.

Copyright 2016 © Teradata CorporationPage 1