Analysing HES data session one
Analysing Hospital Episode Statistics (HES)
Practical Session 1
Available to download from:
Saturday, 31 January 2009
Written by: Robin Beaumont with the help of Steve Price
Status: V1.0
Contents
1.Acknowledgements / Contacts......
2.Required Resources......
3.Aims of the session......
4.Starting SPSS and loading a data file......
5.The SPSS Data Editor Window......
6.The fictitious Trauma & Orthopaedic data set......
6.1How can I tell what fields there are in the dataset?......
6.2How do I find out about the codes in the various fields?......
6.2.1International codes used in the dataset ICD9 and OPCS4......
6.3Seeing Value labels in Data view......
6.4A strategy for analysing data......
6.4.1How many records are there?......
6.4.2How many records are there for each consultant?......
6.5What type of patient does each consultant have and are they different?......
6.5.1How do I compare proportions/ counts across consultants?......
6.5.2How do I exclude certain cases from an analysis?......
6.5.3How do I compare age distributions of patients across consultants?......
6.5.4Preparing the ground for a Length Of Stay (LOS) analysis......
6.5.5How do I find the number of records left after applying a filter?......
6.5.6How do I compare Lengths Of Stay (LOS) across consultants......
7.How can HES help in a small clinical research project?......
8.Summary......
1.Acknowledgements / Contacts
Please see the document Introduction to HES at
2.Required Resources
- SPSS
- tando1.sav file saved locally so that you can load it into SPSS
- The HES Book – Provides a description of each fieldin the 1994 dataset, which is used in this handout
- Current HES Data Dictionary
Certain fields in the dataset make use of internationally recognised clinical codes (OPCS and ICD)
- OPCS-4 codes for the dataset have been provided in a file. To find a particular OPCS4 code use:
- ICD9 codes for the dataset have been provided in a file.
3.Aims of the session
This session has several aims which can be thought of as learning outcomes for yourself. I would hope that by the end of the session you would have gained:
- A working knowledge of the structure of the main fields and codes found in the HES data set
- Knowledge of the process one goes through when confronted with a large data set
- Skills in using SPSS for basic analysis of HES
- Realise how HES can help with clinical research projects
- It is important to realise that this session should only be considered to be a basic introduction to HES.
- The workbook takes the form of a series of problems, each presentedas a question which you will attempt to answer.
Before you start this workbook you should have worked through the introductory document:
Introduction to NHS Information Systems and Hospital Episode Statistics (HES)
available from
4.Starting SPSS and loading a data file
Start up SPSS, either by installing it locally on your machine or by using a copy on a remote server (for example using VPN or a desktop anywhere facility)
Once SPSS starts you are asked what you would like to do. Double left mouse click on the option “More Files . . “ Picture opposite.
This will bring up the open file Dialogue box (shown below). Find the file named tando1.sav and load this datafile into SPSS
You have now loaded the dataset you are going to analyse in this session into SPSS. Depending upon how SPSS is set up you will now see two or three windows:
1. The Data Editor which has two tabs allowing you two views:
Data View in which you can see the raw data, where you and can add, edit or delete data
Variable View where you can inspect and edit the names and characteristics of the variables (fields).
2. A output window where you get your results – more about that latter.
The screen shot on the next page shows the Data Editor window in Data view.
5.The SPSS Data Editor Window
6.The fictitious Trauma & Orthopaedic data set
The dataset we have loaded consists of 2,653 records from a fictitious Trauma and Orthopaedic department. The data was collected over the period of a year from the 31st of May 1994. The data is a sample of the year and does not contain all the records. However it has been selected in such a way as to reflect reality.
When you are first presented with a large file it always feels fairly daunting. However the secret is to break the data up into manageable chunks. This is usually achieved firstly by carrying out various procedures which describe the data in a more concise, digestible way. However before you can start to do this it is necessary to become familiar with the fields, along with the codes each may take, in the dataset. Lets consider some basic questions.
6.1How can I tell what fields there are in the dataset?
Choose the menu option File -> Display Data File Information. This produces a list of the fields in the output window which you can then print out.
6.2How do I find out about the codes in the various fields?
Traditionally code details for fields were kept in a paper document called a coding frame which was a type of data dictionary. These are still used in questionnaire and database design, often being one of the most important documents. If a field only has a few codes these can be defined in the SPSS data set. An example of this is given below.
One of the fields in the dataset is called ADMIMETH. This field provides details of the types of admission, each value in the field represents a particular type of admission. To find out what the values mean follow the steps below in the diagram.
You should notice that for the variable ADMIMETH several codes (i.e. 'labels') have been defined. These are nationally agreed codes for that particular field. In fact all the fields in the HES dataset have nationally (or even internationally) agreed standards which includes defining a set of codes for each. You will have been given a document listing a few of these. Alternatively you can go to and select the field you want more information about. Notice that this web site has the up to date codes (Auguest 2007), whereas our dataset is using codes from 1994, thirteen years ago!
Exercise 1
Go to the above web site and compare the current codes for ADMIMETH with those used in our dataset (using “The HES Book”). Is there much difference? Is this a good or bad thing?
6.2.1International codes used in the dataset ICD9 and OPCS4
The dataset also uses the following internationally agreed codes:
ICD
OPCS
Exercise
1.If you do not know about these coding / classification systems Spend some time doing a google search to find out about them, you might like to discuss your findings on the discussion board.
You will have discovered that over the years different versions of these codes have been developed, we are up to ICD-10 now whereas OPCS 4 has had a much slower development.
This is reflected in our 1994 HES dataset which makes use of ICD9 and OPCS4, whereas the current HES data makes use of ICD10 and OPCS4.3 (?).
6.3Seeing Value labels in Data view
You can also see the various value labels (codes) for ADMIMETH. This field provides details of the types of admission, each value in the field represents a particular type of admission. To find out what the values mean follow the steps below in the diagram
We so far have just investigated one field, the following two exercises will help you become more familiar with some other fields in the dataset.
To complete the exercise you will need both access to the website mentioned above and a list of the HES fields used in 1994 which has been provided in another document for you.
Exercise
- What is the Diag_1 field? What standard type of coding does the field use? How does this compare with the Diag_nn field now used.
- What is the purpose of the ADMIMETH field?
3. What type of code does the OPER_1 field use? What does this tell you? What do you think of the codes? ? How does this compare with the OPER_nn field now used.
4 Explain the STARTAGE field. How does the field store information about children who are less than one year old? Has the method of coding changed over the thirteen years
5. In the sex field what are the values one would expect to find? Has this changed over the last thirteen years?
I'm sure several of you are already thinking of additional fields you would like to have available. The important thing to realise is that the data presented here is already collected nationally. You should therefore build your own requirements from this foundation rather than reinventing the wheel. Why go through the process of re-collecting data that is already available.
6.4A strategy for analysing data
The classic method of hypothesis testing and cursory data analysis based upon one or more significance tests is of limited usefulness when it comes to this type of data analysis. Often one only begins to realise what both the possibilities and limitations of a dataset are after a large amount of exploration has taken place. The process of getting to understand the dataset should be the first stage in any type of analysis. This is often called Exploratory Data Analysis (EDA).
Exploratory data analysis involves looking at the data using both graphical and tabular techniques. Histograms, boxplots and scattergraphs of different variable(s) often for particular groups of records (e.g. a diagnosis or age range etc.) allow immediate visual 'eyeballing' of the data. They aid understanding along with summary statistics such as the mean, median and mode to describe the centre of each field in the dataset along with the range, interquartile range and standard deviation for a description of the spread of values for each field encountered. We will use these techniques to look at the data.
6.4.1How many records are there?
The easiest way to find this out is to scroll to the bottom of the file using the scroll bar on the right hand ride of the data window. Another way is to use the menu option Analyse -> reports -> Case Summaries, and follow ther instructions below
.
In the output window you will then see:
Case Processing Summary
CasesIncluded / Excluded / Total
N / Percent / N / Percent / N / Percent
admidate / 2653 / 100.0% / 0 / .0% / 2653 / 100.0%
6.4.2How many records are there for each consultant?
To find out the number of records (i.e. consultant episodes) each consultant has you need to carry out what is called in SPSS, a frequency count, for each consultant ID in the CONSC* field.
Exercise:
Choose the Menu option Analyze -> Descriptive Statistics -> frequencies then add the CONSC* field to the variables box. You can request a bar chart , histogram or pie chart of the data by clicking on the charts button. Select the bar chart option then click the continue button and finally click 'OK' on the frequencies dialogue box. The appropriate results will be shown in the output window, in various formats depending upon which version of SPSS you have.
The bar chart shown opposite is not particularly wonderful but it serves the purpose. The chart can be fully edited within SPSS or you can copy it to a document and partially edit it in that. A table is also produced, of which an example is given below.
Clearly consultant '1' has over twice as mainly episodes as the other two in the same time period. There may be many reasons for this. Possibly the other two consultants work part time, or in another specialty, consultant '1' may be more careful with his patients in some way or has fitter patients. Considering organisational factors Consultant '1' may be allocated far more beds or have a larger team of supporting clinicians. Some of these questions can be answered looking further at the data while others require additional information collecting. This type of debate 'with the data' is one of the exciting aspects of examining the data in this way - EDA - in that is often creates a large amount of very valuable debate and encourages further more detailed research. The following sections attempt to demonstrate this process for the problem raised above namely why does consultant 1 have twice as many episodes as the others. It basically comes down to examining resource/demand factors.
6.5What type of patient does each consultant have and are they different?
Many of the possible causes, described in the above paragraph, for the discrepancy between consultants can not be investigated in the data set.
However some of the demand factors can be investigated including the possibility that the discrepancy is the result of differences in the type of patient seen. Before this can be investigated however it is necessary to consider what one actually means by 'type of patient' along with how and if it can be measured. It seems reasonable to me to consider that 'type of patient' in this context can be measured by the patients age, diagnosis, procedures carried out and length of stay (equating this approximately to illness severity). Arguably a severity index, including a functional assessment score, would be more useful but we will make use of what we have.
From the dataset we can investigate each of the following possible causes:
- Sex differences - One consultant may have a different sex proportion to the others.
- Age distributions - One consultant may take younger patients
- Length of stay - The consultants may have different lengths of stay
- Diagnosis - The consultants may have a different mix of diagnoses
- Procedures - The consultants may carry out different procedures.
Although more sophisticated data analysis techniques could consider several of these factors together we will investigate each in turn as this demonstrates the process more clearly.
It should also be pointed out that the dataset contains other fields that may help to shed light on the problem which we will ignore in this analysis.
6.5.1How do I compare proportions/ counts across consultants?
Lets start with the number of patients of each sex. We will produce a table showing the actual and expected values for each of the consultants and carry out a Chi square test at the same time. This test tells us if the observed differences in the proportions across consultants is due to sampling (i.e. random variability), or is it beyond what we are prepared to accept as random.
An aside - What is Sampling Error
Imagine that you have a jar with 10,0000 marbles in it half red, half white. If you randomly take out 10 marbles in the long run you will get 50% white and 50% red balls but each time you take 10 marbles from the jar you will frequently get uneven numbers of the different colours, in fact you probably will occasionally get almost all the marbles of the same colour, this variability is the sampling error that we are taking account of with statistical tests such as the Chi Square test.
Choose the menu option Analyze -> Descriptive Statistics -> Crosstabs
You will then be presented with the following crosstabs dialogue box:
Move CONSC* into the rows box and SEX into the column(s) box.
To format appropriately the information in each of the tables' cells click on the 'Cells' button. This brings up the Crosstab: Cell display dialogue box Click the Expected counts and the No adjustmentsoptions as shown below:
When you have set the options click the 'Continue' button in the 'Crosstabs: Cell Display' dialogue box. You should be back in the crosstabs dialogue box now. Select the 'Statistics' button and choose the Chi square option in the 'Crosstabs: Statistics' dialogue box as shown below; again press the 'Continue' button after you have set the option:
You are now ready to produce the required table. To do this click on the 'OK' button in the 'Crosstabs' dialogue box.You will then obtain the following set of results:
consc * sex Crosstabulation
sex / Total1.00 / 2.00
consc / 1.00 / Count / 700 / 709 / 1409
Expected Count / 676.9 / 732.1 / 1409.0
2.00 / Count / 314 / 357 / 671
Expected Count / 322.3 / 348.7 / 671.0
3.00 / Count / 260 / 312 / 572
Expected Count / 274.8 / 297.2 / 572.0
Total / Count / 1274 / 1378 / 2652
Expected Count / 1274.0 / 1378.0 / 2652.0
It would appear from the first table that the observed and expected counts (calculated by assuming that the proportions are the same for each consultant) are very similar, But how can we quantify our belief in this similarity? In other words are the observed small differences due to random sampling or is there an actual difference between the counts for each of the consultants?
The way we do this is to assume that there is actually no difference other than that due to random samplingand then work out a probability of obtaining a result for a particular test statistic based on our data, in this case the test statistic is the Pearson Chi-square statistic taking into account that there is no real difference (i.e. no difference other than that due to random sampling). This belief that our observed differences between proportions in males/females between consultants is purely due to random sampling is called the null hypothesis.