HES/HPA Data Warehouse
LowLevel Design(DRAFT)
Version 0.42
Contents
1Document Control
1.1About this document
1.2Changes made to this document
1.3Purpose of this document
1.4Audience for this document
2Documentation Roadmap
3Data Model
3.1ERWIN models
3.2Key design decisions
3.2.1Star schema:
3.2.2Querying
3.2.3Shared FACT table:
3.2.4Grain
3.2.5Many-to-many relationships
3.2.6Nulls
3.2.7Data model naming standards
4Physical Database Design
4.1Key references
4.2Key design decisions.
4.2.1Environment
4.2.2Partitioning strategy
4.2.3Surrogate keys
4.2.4Files, Filegroups and RAID
4.2.5Index strategy
4.2.6Other key decisions
4.3Auditing
4.4Database administration
4.4.1Data loading
5Additional Databases
5.1HES Repository database
5.1.1Database structures
5.2Combination drugs, dual packs and macro id
6Views and Aggregate Tables
7Data Loading
A.References
B.Glossary
C.Approval for this document
1Document Control
1.1About this document
Author / David OwlerDistribution / HES/HPA project team
Published / 19Oct 2011
File location /
Status / Version 0.41
Confidentiality / This document is confidential and must not be distributed outside of IMS Health.
Copyright / © 2011 IMS Health Incorporated or its affiliates. All rights reserved.
1.2Changes made to this document
Version / Sections changed or created / Author or contributor name / Date of change / Role / Change summary0.1 / All / David Owler / 11-Sep-11 / Designer / Document Created
0.2 / Most / David Owler / 30 Nov / Additional Databases section moved to HLD document.
0.3, 0.4 / Several / David Owler / 22 Dec 2011, 4 Jan 2012 / Incorporate main suggestions from Graham Ferguson. Principally, improving the document links.
Added the field mapping document – ‘HES data to DWH field mappings v0.4.xlsx’ link
1.3Purpose of this document
The purpose of this document is topresent key design decisions for the HES HAP DWH. This document should be used in conjunction with other low level design documentation – see below in next section
present the detailed design for the HES DWH and to highlight the key design decisions. Once approved, this document will be used for the development of the HES DWH.
1.4Audience for this document
This document has been written for:
- HES/HPA Project Core Team
- Technical Architecture Department
- Development Team
2Documentation RoadmapOverview
This is a companion document to the High Level Design (HLD) document. The HLD covers scope, terms of reference, assumptions – please refer to this document for coverage of these.
The HLD document, despite its name, covers several topics in quite a bit of detail.
The low level design (LLD) document complements the HLD and provides additional, more detailed specifications, for the key components.
The LLD will contain references to detailed specifications and will also document key design decisions.
The LLD document will not repeat information contained in the HLD so please use both documents in conjunction.
Deliverablesinclude:
- Logical/physical data models for the DWH and Repository databases
- ETL documentation for HES reference data
- ETL documentation for IMS reference data
- Data mapping documentation, mapping the field specification form the NHS IC to the HES data model
Logical/physical data model
Physical DWH design
IMS reference data loader (ETL)
HES data loader (ETL)
3Data Model
3.1ERWIN models
Fully documented models are located in the folder:
SeeSee latest ERWIN model and associated repor
HES HPA logical physical data model.erwin
HES HPA Repository logical physical.erwin t
3.2Key design decisions
3.2.1Star schema:
This is discussed in the HLD.
3.2.2Querying
No requirement for a cube in thise first release. Data Analysts will be using SQL as the main querying tool to extract data into other systems such as SAS for example.
3.2.3Shared FACT table:
A single FACT table is implemented covering the 3 data streams – inpatient, outpatient and A&E. This was done to make it easier to track a patient across each streams, rather than joining conformed dimensions across multiple FACT tables.
3.2.4Grain
Discussed in the HLD.
The incoming data measures are at the level of the episode, attendance and visit across the 3 streams – inpatient outpatient and A&E streams respectively.
Please refer to the field specification document for definitions of the measures:
IMS HES fields required_290911_2.xlsx
The users are only interested in querying data at this level of granularity.
There are several multi-value fields within this level of granularity.
These are:
- Operations
- Diagnoses
- A&E classification schemes
- Multiple drugs associated with a single episode.
- Multiple births for an episode
The HLD discusses the issue of double counting data when querying some of these many-many associations and FACT data.
The multi-valued data above was not incorporated into the FACT table: this was done to clearly separate the facts/measures at episode level to try and avoid potential double-counting issues.
Bridge tables are joined on the primary key of the FACT table.
3.2.5Many-to-many relationships
Many-many relationships for the above data are resolved by a bridge table between the dimension and the FACT table.
Note that for the diagnosis and surgical procedure dimensions, the main diagnosis or procedure is indicated in the bridge table.
The Surgical Procedure-to-FACT bridge table is called Surgical Procedure FACT since it has a fact (operation date).
Other bridge tables that contain no fact are given the suffix ‘Association’ e.g. ‘Diagnosis Medical Event FACT Association’.
3.2.6Nulls
Null values may be present in the DWH tables due to the following:
- Initially, IMS will receive nulls for ‘sensitive’ fields such as regional, practioner, and organisation fields. When sufficient Trusts are signed up, IMS will receivedpseudonymised values in these fields. (These null values will limit some of the analyses that can be carried out in the first instance).
- Shared FACT table will result in null values for some measures and FK fields
- Nulls in incoming data
Decision: nulls will be stored in the database as nulls rather than as an unknown member.
Note that there are sometimes separate special codes in the HES data for unknown, unavailable, ‘not relevant’. The HES data will be loaded ‘as is’ and not manipulated in any way.
3.2.7Data model naming standards
The principles of the IMS naming standards have been applied to the HES logical data model.
Recommendation is to implement the logical names in the physical design (as was done for Data Customisation database).
Data Analysts will write queries against the database tables directly to extract data. It is therefore very important that the tables are ‘business user’ friendly.
Views and aggregate tables may be created to make it easier for users to search the DWH. This is discussed futher below.
UPDATE: In a formal review of the HES data model, the Data Analysts stated that they are comfortable with the shorter physical field names. The users observed that regardless of what nomenclature we use, the users would still need to relate each field back to the HES data dictionary anyway, and so the use of these physical names will not present any extra effort.
A project decision was subsequently taken to use IMS standard physical names for the HES DWH as generated by ERWIN.
4Data Mapping
The following document, maps the field specifications requested by the business to the fields in the DWH:
In folder
Refer to HES data to DWH field mappings.xlsx
45Physical Database Design
5.1Key references
- Best Practices for Data Warehousing with SQL Server 2008 R2 (Mark Whitehorn et al)
- The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2 and the Microsoft Business Intelligence Toolset (Kimball)
- Various Partitioning papers by Kimberley Tripp and others
4.25.2Key design decisions.
4.2.15.2.1Environment
RAID5 to be used for the user databases but a RAID10 partition could be used for the log file and for tempdb.
Hardware-based RAID should be implemented.
Tempdb: multiple database files may be specified for the tempdb (1 file per logical CPU). This is inline with recent Microsoft recommendations. This could be placed on the RAID10 array but tempdb can grow to a large size during normal operation, so care is required here.
Tempdb should be allocated sufficient space.
A SAN has been ordered for the Bunker which should alleviate disk space issues.
4.2.25.2.2Partitioning strategy
The FACT table should be partitioned by (event) date. We discussed whether this was the best approach given that many studies would be longitudinal over time. However, other studies will probably be based on latest data and these would benefit from partitioning.
Each partition should be placed on its own filegroup.
Partitioning should be managed inside the database rather than as part of the ETL. A scheduled job could be setup to do this.
Tables and their indexes should be placed on the same filegroup.
SQL Server 2008 is partition-aware in terms of querying.
4.2.2.15.2.2.1Partitioning column
A column called Event Date Period Id is defined on the FACT table. This is either the episode start date, appointment date or the A&E attendance date.
Event Date must not be null.
4.2.35.2.3Surrogate keys
Meaningless surrogate keys should will be used for all tables, except for possibly one – the Period dimension table.
In the paper, ‘Best Practices for Data Warehousing with SQL Server 2008 R2 (Mark Whitehorn et al)’, there is a recommendation that the integer Date key be implemented in the physical design as a meaningful surrogate key – e.g. 20111025 for significantly performance benefits in SQL Server.
Several colleagues within the Technical Architecture team commented that using a meaningful date key, is good design and implementation practice within the Microsoft BI stack.
Please note that many of the measures in the HES HPA DWH FACT table are dates, so this approach could really help with performance in a large DWH.
A meaningful ‘date’ surrogate key would also be useful in the implementation of the partitioning strategy – see above.
I’ve discussed this with the Data Architect.
4.2.45.2.4Files, Filegroups and RAID
RAID will stripe data across all of the physical devices automatically giving performance benefits due to I/O balancing across multiple disks.
Having multiple database files does seem to improve performance in a RAID configuration. It seems that having multiple database files, is beneficial principally for manageability – for example, allowing piecemeal restore and partial availability of a file.Since all of the HES data needs to be available for longitudinal studies (i.e. no archiving of data), this consideration may not apply to the HES DWH.
Initially, a single We should not over-complicate the design of multiple database filescontaining multiple filegroups will be specified in the DWH, since only partial DBA support may be available post-launch.
4.2.55.2.5Index strategy
Data analysts will query the DWH tables in order to produce data extracts for downstream packages such as SAS. Indexes should be designed to support the searches being carried out.
Good database design practice should always be followed; in particular the following should be implemented:
- Every table should have a primary key constraint defined
- Foreign key columns should have indexes defined on them.
- Columns used frequently in searching should be indexed.
4.2.65.2.6Other key decisions
- Initial database size: this should be set to a sufficiently high value in line with database size estimates in order to avoid automatic file growth. Auto-growth should be switched on additionally.
- Use SIMPLE recover model to improve throughput during data loading – this also keeps the log file size small.
- A full backup should be taken prior to a data load commencing.This is done to enable us to go back to the state of the DWH prior to a load commencing, should this be required in the event of a critical load failure. It is envisaged however, that simply re-loading the latest data or restarting a data load from the point of failure should correct most load failures
- Use page compression which can markedly improve performance
- Foreign Key (FK) and Unique (FK) Constraints: these should always be defined as they are an important guard against data validation errors during data loading. However, FK constraints should be disabled and indexes dropped during data loading. FKs and Indexes should be reinstated after the data load.
- Index fill-factors should be set to 100%. The DWH will be read-only for normal use (but this is reset just prior to a data load commencing). Note this can lead to page splitting during data load. Index maintenance needs to be carried out as part of the data load.
- The clustered index on the FACT table should be on the event date column (this is either the episode start date, appointment date, or the A&E arrival date. The column should not be NULL.
- Sound database practices should always be followed.
4.35.3Auditing
Covered in the HLD. New auditing features in SQL Server 2008 Enterprise Edition are to be used for auditing purposes. It is important to audit actions and capture SQL queries run against the data.
4.45.4Database administration
The following will need to be setup in both development and production environments:
- Weekly maintenance plan
- Database backup strategies (full/differential) – In line with DBA practices of the European Database Services team, a full database will be performed just prior to each data load and additionally, once a week.
4.4.15.4.1Data loading
As part of the data load process, the following tasks should be carried out:
- A full database backup should be taken prior to the data load commencing
- Database set to read/write prior to data loading
- Following a successful data load, index maintenance should be carried out
- Post-load, an UPDATE STATISTICS step should be carried out
- DWH should be set to read-only
A single SQL Server Agent job can be defined incorporate these steps into the data load process.
Note that
.
56Additional Databases
Please refer to the HES Database Overview diagram that describes the roles of the other databases in the HES application.
The other database types are:
- Staging databases – temporary store of incoming IMS reference data and HES data. Data is initialised just before a new data load.
- Config databases – ETL parameters, error logs etc
- HES Repository database – persisted store of HES reference data and drug link information.
.
5.16.1HES Repository database
Key features:
- Permanent store of HES reference data such as OPCS codes, ICD10 and a variety of other HES reference lists. The data feeds from the NHS IC invariably supply codes but not code descriptions.
- Descriptions for HES reference data is available from the HES Data Dictionary and will need to loaded into the HES Repository database as an initial task.
- The reference data tables are used as lookups during data loading.
- Any new codes identified during data loading needs to be highlighted. A check staging step can identify missing descriptions for new codes. At the current time, the decision is that we should not load data until all code descriptions are updated in the Repository database.
- All reference lists should be based on latest value – see HLD for full discussion of this.
- Codes and their descriptions should be managed in the HES Repository. This should be managed as DBA task – no user interface application will be provided for this. However, a stored procedure may be provided for routine additions/updates of new codes and their descriptions.
- The drug file data received from the NHS IC should be persisted in the HES Repository database to provide a historical record of drug linkage data that can be checked in case of any data linkage type queries.
- Ardentia may be able to supply many of the initial reference data lists for us from their system.
5.1.16.1.1Database structures
The following table structure should be cloned for all managed reference table lists.
Column / Data type / Nullable / DescriptionIdentifier (integer) not null / Int, identity / Not null / Primary key, clustered index
Code / Varchar(20) / Not null / Reference list code
Description / Varchar(500) / null / Code description
Parent Identifier null / int / null / For hierarchical coding lists, this links to the parent level
Added Date / Datetime / Not null / Date that the row is added. Automatically assigned
Use SQL Server CTEs to derive the hierarchies and the level.
For hierarchical reference data, codes and their descriptions are held once and propagated to the denormalised table structure in the DWH – e.g. consider ICD10 and OPCS4 codes.
The drug file should be stored in a table similar to how it is received from the NHS IC:
Drug Linkage
Column / Data Type / Nullable / DescriptionDrug Linkage Identifier int, not null / Bigint, identity / Not null / Primary key, clustered index
FCC Code / Varchar(10) / null / FCC code
Drug Dispensed Date Time / Datetime / null / Date the drug is dispensed according to the Hosptial Pharmacy system
IMS Specialty Code / Varchar(10) / Null / From KF
Drug Quantity Amount / Numeric (10,4) / null / Drug amount from KF reference bridge file
Added Date / datetime / Not null / Date that the row is added. Automatically assigned
5.26.2Combination drugs, dual packs and macro id
The KF system records instances where packs are split up in the Hospital Pharmacy and a unique drug mix is created for administering to patients.
For combination drugs, KF will deliver the macro id (or duals id)to the NHS IC in the KF reference bridging file instead of the IMS FCC code.
The Bunker will receive a ‘duals’ mapping file from KF that maps macro ids to the constituent FCC codes.
As previously described, the Bunker will receive a drug file from the NHS IC as part of the regular quarterly deliverables.
The macro id is 10 characters in length, while the FCC code is 6 characters in length, thus enabling identification of ‘dual’ or combination drugs.
In the bunker, the macro id will be switched to the constituent packs.
The drug quantity amount value will be 1 for combination drugs in the drug file.