Cloud SQL Shootout: TPC-H on Redshift & Athena (or Hive)

Today's business analyst demands a SQL-like access to Big DataTM . Your task today is to design a SQL-in-the-cloud data warehouse system. You compare AWS Athena (or Apache Hive) and AWS Redshift, which is a hosted version of the parallel database system Actian Matrix (probably better known under its previous name ParAccel). As a benchmark we are going to use TPC-H, a industry standard benchmark for analytical SQL systems.

Three data sets with scale factors 1 & 10 and 300 have already been created for you and uploaded to S3. TPC-H scale factor 300 means the largest table "lineitem" has 1.799.989.091 records.(Scale factor 300 is omitted for budget reasons.)

Preparation for Redshift

Install SQL Workbench and the Redshift JDBC driver

Installation instructions:

Redshift JDBC driver can be downloaded from

This time, you will need your AWS access credentials. Create a new access key:

Go to "Security Credentials" in the console

Note down your access key id and secret access key somewhere where you will find it again.

You can also simply download the key file which contains this information.

Redshift Startup

I strongly request you read the rest of this tutorial before starting the cluster. Redshift clusters are quite costly, please keep their runtime minimal.

Go to the AWS Redshift console and click the "Launch Cluster" Button

For now, we will create a single-node cluster for testing purposes.

Make sure you note the username and password, you will need it later. Database name can remain empty.

Select a single node cluster for now

Additional configuration can all remain on defaults

Then launch

Go to the clusters dashboard, you will see your cluster launching

Wait until the cluster is available

Click the cluster name ("bads-test1" here) to show its details

It might show a warning due to the firewall not allowing access to the JDBC port. Click the warning sign and then "Edit security group"

Edit the security group on the "inbound" tab to allow Redshift connections from anywhere:

Afterwards, the warning should be gone.

If connecting with SQL Workbench (see below) does not work --- either after above changes or although you don’t get the above warning about “No Inbound Permissions” in the first place --- please try the following in the EC2 console to open-up access after all.

Then, use the displayed JDBC URL to connect using SQL Workbench.

You will be greeted by the Query/Query result screen. Run a simple query to check everything works: SELECT 42;

Redshift Schema / Data loading

Run the following query in the SQL Workbench to create and load the tables (a plain text file is available at

CREATE TABLE region (r_regionkey INT NOT NULL, r_name VARCHAR(25) NOT NULL, r_comment VARCHAR(152) NOT NULL, PRIMARY KEY (r_regionkey)) ;

CREATE TABLE nation (n_nationkey INT NOT NULL, n_name VARCHAR(25) NOT NULL, n_regionkey INT NOT NULL, n_comment VARCHAR(152) NOT NULL, PRIMARY KEY (n_nationkey)) ;

CREATE TABLE supplier (s_suppkey INT NOT NULL, s_name VARCHAR(25) NOT NULL, s_address VARCHAR(40) NOT NULL, s_nationkey INT NOT NULL, s_phone VARCHAR(15) NOT NULL, s_acctbal DECIMAL(15,2) NOT NULL, s_comment VARCHAR(101) NOT NULL, PRIMARY KEY (s_suppkey)) ;

CREATE TABLE customer (c_custkey INT NOT NULL, c_name VARCHAR(25) NOT NULL, c_address VARCHAR(40) NOT NULL, c_nationkey INT NOT NULL, c_phone VARCHAR(15) NOT NULL, c_acctbal DECIMAL(15,2) NOT NULL, c_mktsegment VARCHAR(10) NOT NULL, c_comment VARCHAR(117) NOT NULL, PRIMARY KEY (c_custkey)) ;

CREATE TABLE part (p_partkey INT NOT NULL, p_name VARCHAR(55) NOT NULL, p_mfgr VARCHAR(25) NOT NULL, p_brand VARCHAR(10) NOT NULL, p_type VARCHAR(25) NOT NULL, p_size INT NOT NULL, p_container VARCHAR(10) NOT NULL, p_retailprice DECIMAL(15,2) NOT NULL, p_comment VARCHAR(23) NOT NULL, PRIMARY KEY (p_partkey)) ;

CREATE TABLE partsupp (ps_partkey INT NOT NULL, ps_suppkey INT NOT NULL, ps_availqty INT NOT NULL, ps_supplycost DECIMAL(15,2) NOT NULL, ps_comment VARCHAR(199) NOT NULL, PRIMARY KEY (ps_partkey, ps_suppkey), FOREIGN KEY (ps_partkey) REFERENCES part (p_partkey), FOREIGN KEY (ps_suppkey) REFERENCES supplier (s_suppkey)) ;

CREATE TABLE orders (o_orderkey INT NOT NULL, o_custkey INT NOT NULL, o_orderstatus VARCHAR(1) NOT NULL, o_totalprice DECIMAL(15,2) NOT NULL, o_orderdate DATE NOT NULL, o_orderpriority VARCHAR(15) NOT NULL, o_clerk VARCHAR(15) NOT NULL, o_shippriority INT NOT NULL, o_comment VARCHAR(79) NOT NULL, PRIMARY KEY (o_orderkey)) ;

CREATE TABLE lineitem (l_orderkey INT NOT NULL, l_partkey INT NOT NULL, l_suppkey INT NOT NULL, l_linenumber INT NOT NULL, l_quantity INTEGER NOT NULL, l_extendedprice DECIMAL(15,2) NOT NULL, l_discount DECIMAL(15,2) NOT NULL, l_tax DECIMAL(15,2) NOT NULL, l_returnflag VARCHAR(1) NOT NULL, l_linestatus VARCHAR(1) NOT NULL, l_shipdate DATE NOT NULL, l_commitdate DATE NOT NULL, l_receiptdate DATE NOT NULL, l_shipinstruct VARCHAR(25) NOT NULL, l_shipmode VARCHAR(10) NOT NULL, l_comment VARCHAR(44) NOT NULL, PRIMARY KEY (l_orderkey,l_linenumber)) ;

COMMIT;

-- In the remainder, replace XXXXX / YYYYY with your access key / secret access key!

copy region from 's3://tpch-bads-data/sf1/region/' delimiter '|' gzip maxerror 210 credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY';

copy nation from 's3://tpch-bads-data/sf1/nation/' delimiter '|' gzip maxerror 210 credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY';

copy customer from 's3://tpch-bads-data/sf1/customer/' delimiter '|' gzip maxerror 210 credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY';

copy orders from 's3://tpch-bads-data/sf1/orders/' delimiter '|' gzip maxerror 210 credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY';

copy lineitem from 's3://tpch-bads-data/sf1/lineitem/' delimiter '|' gzip maxerror 210credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY';

copy part from 's3://tpch-bads-data/sf1/part/' delimiter '|' gzip maxerror 210 credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY';

copy partsupp from 's3://tpch-bads-data/sf1/partsupp/' delimiter '|' gzip maxerror 210 credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY';

copy supplier from 's3://tpch-bads-data/sf1/supplier/' delimiter '|' gzip maxerror 210 credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY';

COMMIT;

Replace XXXXX / YYYYY with your access key / secret access key!

You can observe your cluster working by going to the "Queries" Tab in the cluster details on the Redshift console.

Make sure the data is loaded by running a SELECT COUNT(*) FROM table for all the loaded tables after the COMMIT, i.e., (a plain text file is available at

SELECT count(*) from region;

SELECT count(*) from nation;

SELECT count(*) from supplier;

SELECT count(*) from customer;

SELECT count(*) from part;

SELECT count(*) from partsupp;

SELECT count(*) from orders;

SELECT count(*) from lineitem;

Run the following queries and note their runtime.

Redshift TPC-H Query 1

(a plain text file is available at

select

l_returnflag,

l_linestatus,

sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price,

sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

avg(l_quantity) as avg_qty,

avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc,

count(*) as count_order

from

lineitem

where

l_shipdate <= date '1998-12-01' - interval '108' day

group by

l_returnflag,

l_linestatus

order by

l_returnflag,

l_linestatus;

Redshift TPC-H Query 5

(a plain text file is available at

select

n_name,

sum(l_extendedprice * (1 - l_discount)) as revenue

from

customer,

orders,

lineitem,

supplier,

nation,

region

where

c_custkey = o_custkey

and l_orderkey = o_orderkey

and l_suppkey = s_suppkey

and c_nationkey = s_nationkey

and s_nationkey = n_nationkey

and n_regionkey = r_regionkey

and r_name = 'MIDDLE EAST'

group by

n_name

order by

revenue desc;

Redshift Shutting Down

After done with your queries, shut down your cluster

Athena Data "Loading"

Go to the Athena Web Interface

Create a database for you to use: Enter the query

CREATE DATABASE bd_student_XX;

(replace the database name with your user name), click "Run Query", and then make sure it's selected on the left side.

Create and load tables (not really loading!), paste the following queries (ONE BY ONE)in the query window (what does it do?!) and execute them. (a plain text file is available at

CREATE EXTERNAL TABLE customer(

C_CustKey int ,

C_Name varchar(64) ,

C_Address varchar(64) ,

C_NationKey int ,

C_Phone varchar(64) ,

C_AcctBal decimal(13, 2) ,

C_MktSegment varchar(64) ,

C_Comment varchar(120) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3://tpch-bads-data/sf1/customer/';

CREATE EXTERNAL TABLE lineitem(

L_OrderKey int ,

L_PartKey int ,

L_SuppKey int ,

L_LineNumber int ,

L_Quantity int ,

L_ExtendedPrice decimal(13, 2) ,

L_Discount decimal(13, 2) ,

L_Tax decimal(13, 2) ,

L_ReturnFlag varchar(64) ,

L_LineStatus varchar(64) ,

L_ShipDate date ,

L_CommitDate date ,

L_ReceiptDate date ,

L_ShipInstruct varchar(64) ,

L_ShipMode varchar(64) ,

L_Comment varchar(64) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3://tpch-bads-data/sf1/lineitem/';

CREATE EXTERNAL TABLE nation(

N_NationKey int ,

N_Name varchar(64) ,

N_RegionKey int ,

N_Comment varchar(160) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3://tpch-bads-data/sf1/nation/';

CREATE EXTERNAL TABLE orders(

O_OrderKey int ,

O_CustKey int ,

O_OrderStatus varchar(64) ,

O_TotalPrice decimal(13, 2) ,

O_OrderDate date ,

O_OrderPriority varchar(15) ,

O_Clerk varchar(64) ,

O_ShipPriority int ,

O_Comment varchar(80) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3://tpch-bads-data/sf1/orders/';

CREATE EXTERNAL TABLE part(

P_PartKey int ,

P_Name varchar(64) ,

P_Mfgr varchar(64) ,

P_Brand varchar(64) ,

P_Type varchar(64) ,

P_Size int ,

P_Container varchar(64) ,

P_RetailPrice decimal(13, 2) ,

P_Comment varchar(64) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3://tpch-bads-data/sf1/part/';

CREATE EXTERNAL TABLE partsupp(

PS_PartKey int ,

PS_SuppKey int ,

PS_AvailQty int ,

PS_SupplyCost decimal(13, 2) ,

PS_Comment varchar(200) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3://tpch-bads-data/sf1/partsupp/';

CREATE EXTERNAL TABLE region(

R_RegionKey int ,

R_Name varchar(64) ,

R_Comment varchar(160) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3://tpch-bads-data/sf1/region/';

CREATE EXTERNAL TABLE supplier(

S_SuppKey int ,

S_Name varchar(64) ,

S_Address varchar(64) ,

S_NationKey int ,

S_Phone varchar(18) ,

S_AcctBal decimal(13, 2) ,

S_Comment varchar(105) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3://tpch-bads-data/sf1/supplier/';

Make sure the tables have been loaded, e.g. they are in the tables list and all contain data. You can check they contain data by running the following queries (ONE BY ONE)

(a plain text file is available at

SELECT count(*) from region;

SELECT count(*) from nation;

SELECT count(*) from supplier;

SELECT count(*) from customer;

SELECT count(*) from part;

SELECT count(*) from partsupp;

SELECT count(*) from orders;

SELECT count(*) from lineitem;

Please note whether and if so how the execution times for running the SELECT COUNT(*) FROM table queries differ between RedShift & Athena. If they do, any idea why?

Run the following queries

Athena TPC-H Query 1

(a plain text file is available at

SELECT

L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), COUNT(1)

FROM

lineitem

WHERE

L_SHIPDATE<= CAST ('1998-09-02' AS DATE)

GROUP BY L_RETURNFLAG, L_LINESTATUS

ORDER BY L_RETURNFLAG, L_LINESTATUS;

Athena TPC-H Query 5

(a plain text file is available at

select

n_name, sum(l_extendedprice * (1 - l_discount)) as revenue

from

customer c join

( select n_name, l_extendedprice, l_discount, s_nationkey, o_custkey from orders o join

( select n_name, l_extendedprice, l_discount, l_orderkey, s_nationkey from lineitem l join

( select n_name, s_suppkey, s_nationkey from supplier s join

( select n_name, n_nationkey

from nation n join region r

on n.n_regionkey = r.r_regionkey and r.r_name = 'MIDDLE EAST'

) n1 on s.s_nationkey = n1.n_nationkey

) s1 on l.l_suppkey = s1.s_suppkey

) l1 on l1.l_orderkey = o.o_orderkey

) o1

on c.c_nationkey = o1.s_nationkey and c.c_custkey = o1.o_custkey

group by n_name

order by revenue desc;

Observe the query runtime and amount of data read.

Hive Schema / Data loading

Startup EMR cluster (all defaults, 2 nodes, remember Hue security group)

Access Hue's Hive query editor

Set S3 Credentials for Hive

Set first Key to fs.s3n.awsAccessKeyId, first value your S3 access key

Set second Value to fs.s3n.awsSecretAccessKey, second value to your S3 secret key

Create and load tables (not really loading…) (a plain text file is available at

CREATE EXTERNAL TABLE customer(

C_CustKey int ,

C_Name varchar(64) ,

C_Address varchar(64) ,

C_NationKey int ,

C_Phone varchar(64) ,

C_AcctBal decimal(13, 2) ,

C_MktSegment varchar(64) ,

C_Comment varchar(120) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch-bads-data/sf1/customer/';

CREATE EXTERNAL TABLE lineitem(

L_OrderKey int ,

L_PartKey int ,

L_SuppKey int ,

L_LineNumber int ,

L_Quantity int ,

L_ExtendedPrice decimal(13, 2) ,

L_Discount decimal(13, 2) ,

L_Tax decimal(13, 2) ,

L_ReturnFlag varchar(64) ,

L_LineStatus varchar(64) ,

L_ShipDate date ,

L_CommitDate date ,

L_ReceiptDate date ,

L_ShipInstruct varchar(64) ,

L_ShipMode varchar(64) ,

L_Comment varchar(64) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch-bads-data/sf1/lineitem/';

CREATE EXTERNAL TABLE nation(

N_NationKey int ,

N_Name varchar(64) ,

N_RegionKey int ,

N_Comment varchar(160) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch-bads-data/sf1/nation/';

CREATE EXTERNAL TABLE orders(

O_OrderKey int ,

O_CustKey int ,

O_OrderStatus varchar(64) ,

O_TotalPrice decimal(13, 2) ,

O_OrderDate date ,

O_OrderPriority varchar(15) ,

O_Clerk varchar(64) ,

O_ShipPriority int ,

O_Comment varchar(80) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch-bads-data/sf1/orders/';

CREATE EXTERNAL TABLE part(

P_PartKey int ,

P_Name varchar(64) ,

P_Mfgr varchar(64) ,

P_Brand varchar(64) ,

P_Type varchar(64) ,

P_Size int ,

P_Container varchar(64) ,

P_RetailPrice decimal(13, 2) ,

P_Comment varchar(64) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch-bads-data/sf1/part/';

CREATE EXTERNAL TABLE partsupp(

PS_PartKey int ,

PS_SuppKey int ,

PS_AvailQty int ,

PS_SupplyCost decimal(13, 2) ,

PS_Comment varchar(200) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch-bads-data/sf1/partsupp/';

CREATE EXTERNAL TABLE region(

R_RegionKey int ,

R_Name varchar(64) ,

R_Comment varchar(160) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch-bads-data/sf1/region/';

CREATE EXTERNAL TABLE supplier(

S_SuppKey int ,

S_Name varchar(64) ,

S_Address varchar(64) ,

S_NationKey int ,

S_Phone varchar(18) ,

S_AcctBal decimal(13, 2) ,

S_Comment varchar(105) ,

skip varchar(64)

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch-bads-data/sf1/supplier/';

Again, you might want to make sure the data is loaded by running a SELECT COUNT(*) FROM table for all the loaded tables after the COMMIT, i.e., (a plain text file is available at

SELECT count(*) from region;

SELECT count(*) from nation;

SELECT count(*) from supplier;

SELECT count(*) from customer;

SELECT count(*) from part;

SELECT count(*) from partsupp;

SELECT count(*) from orders;

SELECT count(*) from lineitem;

Please note whether and if so how the execution times for running the SELECT COUNT(*) FROM table queries differ between RedShift & Hive. If they do, any idea why?

Hive TPC-H Query 1

(a plain text file is available at

SELECT

L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), COUNT(1)

FROM

lineitem

WHERE

L_SHIPDATE<='1998-09-02'

GROUP BY L_RETURNFLAG, L_LINESTATUS

ORDER BY L_RETURNFLAG, L_LINESTATUS;

Hive TPC-H Query 5

(a plain text file is available at

select

n_name, sum(l_extendedprice * (1 - l_discount)) as revenue

from

customer c join

( select n_name, l_extendedprice, l_discount, s_nationkey, o_custkey from orders o join

( select n_name, l_extendedprice, l_discount, l_orderkey, s_nationkey from lineitem l join

( select n_name, s_suppkey, s_nationkey from supplier s join

( select n_name, n_nationkey

from nation n join region r

on n.n_regionkey = r.r_regionkey and r.r_name = 'MIDDLE EAST'

) n1 on s.s_nationkey = n1.n_nationkey

) s1 on l.l_suppkey = s1.s_suppkey

) l1 on l1.l_orderkey = o.o_orderkey

) o1

on c.c_nationkey = o1.s_nationkey and c.c_custkey = o1.o_custkey

group by n_name

order by revenue desc;

Hive/EMR Shutting Down

After done with your queries, shut down your cluster.

(Non-cloud) alternative: MonetDB

In case one of the above cloud approaches does not work for you in due time (report why not!), or in case you are just curious, you can use MonetDB ( (on your own laptop/desktop/workstation!) as an alternative. See the MonetDB website at and for instructions how to download and install the latest release of MonetDB (Jul2015-SP4) on your system, for a general tutorial how to use it and for documentation how to connect SQL Workbench (that we used with RedShift above) to MonetDB (however, you do not necessarily need to use SQL Workbench or any other graphcal client interface; simply using the MonetDB-provided textual / concole-based `mclient` is enought for this exercise).

You find TPC-H data as compressed CSV files for scale factors 1, 3, 10 (i.e., sizes 1GB, 3GB, 10GB) at . Please download all files for the scale factor(s) you want to try into a directory (use separate directory per scalefactor) on your machine (and recall the entire absolute path to that directory). There is no need to unpack / decompress these files, as MonetDB can bulk-load data directly from compressed CSV files.

Then start the MonetDB server (mserver) as per the instructions on the MonetDB website (see above for links).

At you also find the SQL scripts for MonetDB to create the database schema (tables), load the data (NOTE: in load_data.sql you need to replace “_MyDataPath_” with the entire absolute path to the directory where you downloaded the data files to!), run TPC-H queries 1 & 5, create foreign keys, and drop the tables, again.

Run these scripts (or their content) via mclient, SQL Workbench, or you favorite SQL client.

Tasks:

●Perform the following tasks first using scale factor (SF) 1 (1 GB dataset); once everything works fine with SF-1, repeat with SF-10 (10 GB dataset); for the latter, replace /tpch-bads-data/sf1/ by /tpch-bdit-data/sf10/ in above load scripts.Recall to drop and re-create your tables before loading an other dataset (or create and use a separate database for each dataset).

●Load the TPC-H dataset in two systems (Redshift and Athena (or Hive))

○Observe and record the time each system takes to load each table.

●Run each query (TPC-H queries 1 & 5) at least three times in each system.

○Observer and record the time it takes each system to complete each query.

○Hint: Look at the loads/queries tab in the Redshift console to monitor progress.

●Report both loading and querying times in your report --- at least as tables, but preferably using a graphical visualisation you consider suitable --- and discuss whether and why they do (not) differ between systems.

●Include query plans (EXPLAIN) in your report.

Run the queries on a 10-node cluster and the Scale factor 300 data in prefix s3://tpch-bads-data/sf300/

○Pending data availability

(Scale factor 300 is omitted for budget reasons.)

●Bonus 1: Re-save the data as something that Athena (or Hive) loads faster such as Parquet and measure again. See

●Bonus 2: Using the schema and the queries, design and explain a partitioning scheme for Redshift and Athena (or Hive) that optimizes the runtime of the two queries. Run the queries again, measure and extend your report accordingly. Check query plans to look for the impact of partitioning changes.

○Redshift partitoning guide:

○Hive partitioning guide:

○Hint: Look at individual joins, and the size of the tables involved.

○Hint: For Hive, consider moving the data from S3 into HDFS using the CREATE TABLE AS SELECT … method. Multiple schemas can help here.

Hint: Primary and Foreign keys for TPC-H

(a plain text file is available at

ALTER TABLE region ADD CONSTRAINT regionkey PRIMARY KEY (r_regionkey) ;

ALTER TABLE nation ADD CONSTRAINT nationkey PRIMARY KEY (n_nationkey) ;

ALTER TABLE supplier ADD CONSTRAINT suppkey PRIMARY KEY (s_suppkey) ;

ALTER TABLE customer ADD CONSTRAINT custkey PRIMARY KEY (c_custkey) ;

ALTER TABLE part ADD CONSTRAINT partkey PRIMARY KEY (p_partkey) ;

ALTER TABLE partsupp ADD CONSTRAINT partsuppkey PRIMARY KEY (ps_partkey, ps_suppkey) ;