Data warehousing – defining outputs and functions
Consider first data appearing from nowhere:
Datasets go into the data warehouse (DWH) as does the business register (BR). The DWH in this picture holds data and allows it to be extracted. The DWH in this model is a technical facility. The DWH is used to produce aggregate statistics and microdatasets, which are used for further analysis.
Now consider the structure of the DWH in more detail:
For the purposes of this discussion we can summarise the DWH as having
- a staging area where inputs are cleaned/integrated
- a working area where outputs can be produced
The DWH also includes snapshots of the business register (possibly several of them) but these are datasets in the DWH like any other; that is, they can be analysed, and can be combined with other datasets to produce outputs. Hence they are coloured the same, above. the snapshots held in the DWH for analysis are not updated in the DWH.
Now consider the outputs from the DWH in more detail:
In this picture we note that one of the outputs from the DWH is data extracts. These are technically identical as the ‘microdata’ we noted might be produced for further analysis, but conceptually we call these something different because we want to identify these types of output as providing information to help the statistical system.
Take the Labour Force Survey as an example. By ‘microdata’ we could mean a de-identified dataset intended for release to researchers under licence. A ‘data extract’ could be identified microdata used for a follow up survey.
As another example, look at the register snapshops. ‘Microdata’ would be anonymised until record information for analysis. ‘Data extracts’ could be a specific identified sample to collect data from. However, before we take that leap forward, consider where the datasets come from:
Data could either come from (a) a sample, selected and then surveyed (b) administrative data (c) both.
For arguments sake here we have assumed that the business register is fed entirely by admin data. But the business register is also potentially updated from survey responses. How does this relate to the DWH – should there not be a two-way link?
The answer is no, because we’ve already agreed that one of the outputs from the DWH is a data extract – which could be information from a survey. What we then need are rules on how to update the register:
So, the BR does not have a two-way relationship with the DWH. The BR generates snapshots which go into the DWH like any other dataset. The DWH generates data extracts, which might be useful for updating the BR, but might not. The question of how the extracts are used is not part of the functionality of this technical facility called the DWH. The BR updating process is independent of the functioning of the DWH – all that matter is that the necessary extracts can be produced.
So we now have an almost complete view of the statistical process – but only almost. Where do the samples come from? Again, the answer is to consider the use of DWH outputs:
As for the BR, consider that the DWH produces data extracts. The fact that this extract is an identified sampling frame rather than a dataset for more analysis is irrelevant to the running of the DWH. The DWH just needs to be able to produce an extract; and the survey managers just need an extract with relevant information, not any idea of how the data is produced.
Where does this leave the ‘data warehouse’? The DWH above was described as a ‘technical’ facility. Now let us consider what might be the domain of the ‘statistical data warehouse’ and how this fits into the scope of the ESSNet:
The shaded areas on the above diagram could be consider the ‘statistical data warehouse’ (SDWH): that is, the SDWH comprises
- technical facilities for storing and processing data, receiving data in and producing outputs in a flexible way
- rules for updating the sources for the DWH
- rules for generating samples
- definitions necessary to achieve those sample/source generation
- the data flow model
Finally, what is the difference between ‘statistical data warehouse’ and the ‘data warehouse architecture? I would argue that this is the difference between how and what:
- statistical data warehouse: what are we doing and why?
- data warehouse architecture: how do we do this?
but that both of these cover the same domain.