DDI – SRG

Memo

To: Structural Reform Group

From: Jostein R

CC: Date: 20.01.2005

Re: Nested categories proposal

Scope

The aim of this document is to discuss issues related to the description of hierarchical dimensions (variables or classifications) in the DDI on the background of the proposal to reintroduce nested categories as a suplement to the current category group method.

Background

Even the very first versions of the DDI which solely was designed to describe micro-data included a concept of hierarchical variables. This was however a fairly limited concept enabling the user to classify a list of categories into single-level groups (by means of a catgrygrp-element). The concept was meant to provide some support for variables that represent classifications (industry classifications etc.). By only allowing for a single level of groups above the categories (leafs), this was however a fairly limited concept.

Before the release of 1.0 the concept was extended to allow one catgrygrp-element to include another catgrygrp-element (using IDREFS to describe the relationships). This provided the very first mechanism in the DDI to describe multi-level hierarchical dimensions. However, the scope of this was still micro-data.

With the introduction of aggregated data support in the DDI, another way of describing hierarchical dimensions was introduced. This was nested categories, using standard XML nesting of catgry-elements to describe the hierarchical relationships (instead of catgrygrps and IDREFs).

The effect of this was that we for a period of time had two alternative and relatively useful ways of describing hierarchical dimensions. This was seen as confusing, so before the release of version 2.0 a decision was made that one of the methods had to go. The end effect of the discussion was that the nested categories was taken out of the specification and the atgrygrp-element extended with several attributes in order to make it more useful to describe complex hierarchies. This was attributes like levelno and name that would make it easier to identify and describe levels, and compl and excl that would allow for the description of more rich relationships.

It is this decision that is the background for the current proposal to reintroduce nested categories (with some extension) and to add some recommendations concerning the appropriate use of the two methods.

Arguments used in favour of the two methods

There are several sets of arguments that have been used to justify the coexistence of two alternative methods to describe hierarchical dimensions in the DDI. It is necessary to analyse the validity of these arguments in order to make the right decision. For readability I will use the terms “category groups” and “nested categories” to describe the two methods.

A: It is impossible to describe n-level (n>2) hierarchies with the existing category group method

This is the main argument put forward to justify the reintroduction of nested groups in the proposal. As far as I can see this is based on a misunderstanding of what you can do with the existing category group construct. The ability to include catgrygrps in other catgrygrps by means of IDREFs (introduced before the release of 1.0) is still valid and can be used to describe hierarchies of any dept.

This can be verified by tagging up the example used in the proposal by means of the existing category group method:

Example:

-Management, professional and related occupations (Catgry C1)

-Management occupations (Catgry C2)

-Top executives (Catgry C3)

-Financial managers (Catgry C4)

-Business and financial operations occupations (Catgry C5)

-Computer and mathematical occupations (Catgry C6)

-Architecture and engineering occupations (Catgry C7)

-Architects (Catgry C8)

-Engineers (Catgry C9)

-Legal occupations (Catgry C10)

-Education, training and library occupations (Catgry C11)

-Teachers (Catgry C12)

-Librarians (Catgry C13)

Tagging:

<catgryGrp ID=”G1” catgryGrp=”G2,G3,G4” catgry= “ C5,C6,C10” levelno=”1”>

<labl>Management, professional and related occupations</labl>

</catgryGrp>

<catgryGrp ID=”G2” catgry=”C3 C4” levelno=”2”>

<labl>Management occupations</labl>

</catgryGrp>

<catgryGrp ID=”G3” catgry=”C8,C9” levelno=”2”>

<labl>Architecture and engineering occupations</labl>

</catgryGrp>

<catgryGrp ID=”G4” catgry=”C12, C13” levelno=”2”>

<labl>Management occupations</labl>

</catgryGrp>

<catgry ID="C3">

<labl> Top executives </labl</catgry>

<catgry ID="C4">

<labl> Financial managers </labl</catgry>

<catgry ID="C5">

<labl> Business and financial operations occupations </labl</catgry>

and so on for the leaf-nodes C6,C8,C9,C10,C112 and C13)

In this example the top level group (G1) includes 3 other groups and 3 categories. The 3 remaining groups are only including categories.

All leaf-nodes are tagged as categories even though they might reside at a different distance (measured as n-of-levels) from the top-node.

All parent nodes are tagged as category-groups even if they might reside on the same level as other leaf-nodes.

Note that the DDI will allow all nodes category - groups as well as categories - to include category statistics. There is no semantic difference between category groups and categories in this respect.

The only semantic difference between a category group and a category is that the former has children.

Note that it is impossible to assign leaf-nodes (categories) level number and name. This is a weakness in the current spec.). The specification is however complete enough for a software process to construct the hierarchy.

There is another deficiency that also should be noted. Category groups do not have a CatValu-element. For categories this element is normally used as an identifier that can associate the category to a specific data value in a micro-data file or a particular dimension-coordinate in a cube or multidimensional file. Without a catValu-element it is difficult to connect the description of the dimension to the data.

B: Catagory groups should be used when you have conceptual classifications of categories and where these classifications are not represented in the data. Only categories that have a representation in the data should be called a category, any higher level classification of these should be called category groups.

This a secondary but still important argument used in the current proposal.

It should first be mentioned that any classification is ”conceptual” – it describes how concepts (and instances of concepts) are related. The argument here is however that only concepts that are represented in the data (physically) should be called real-life categories. All other concepts that represents groupings or aggregations of these categories should be called category groups.

The example given is micro-data where physical data only exists for the most fine-grained categories, but where it still can make sense to aggregate these categories (software-wise) to higher levels of aggregation. The proposal is that we in these situations should use flat category-lists to describe the “real categories” and category groups to describe how they could be aggregated.

This argument is making some sense (at least if you allow for category group hierarchies). Using category groups for this purpose will among other things make it easier to store multiple recoding/aggregation structures for a single variable (example: an age-variable recorded in the data as single years, but with one or more associated age-group hierarchies). This is a very useful concept supported by a few software systems, amongst other CSPro. (Note that this will require a small extension to the Category group method.) Description of multiple aggregation hierarchies can on the other hand never be supported by nested categories.

However, it should also be mentioned that even for micro-data there will be situations where several levels of a hierarchical dimension will be represented with physical data. An example of this is a single classification of occupations (that conceptually forms a nice hierarchy) and where some individuals are classified according to the most fine-grained level, whereas others (due to lack of detailed information) are classified according to higher levels of the hierarchy. In situation like this it would not make sense to apply the proposed distinction between categories and category groups.

A variety of the same argument has been put forward in the discussions in the SRG group (Wendy’s line of argument). This is related to requirements deriving from the description of existing tables (printed or digital) which sometimes will have data for higher levels of a hierarchy and at other times not. The proposed solution in this context is to use the category element for all categories represented by data and the category group element for higher level aggregates where data is missing (please arrest me if I am wrong Wendy).

The problem with this argument is that the decision to populate the higher level aggregation levels in a table (like sub-totals and totals) with data is a pragmatic design or lay-out decision made by the publisher of the table - a decision which necessarily do not have to be related to the logical properties of the table.

An example: A table based on counts (like population counts) do not need pre-aggregated numbers for higher levels (they can all be calculated at display-time by a relevant software process). However, when a table like this is printed or published as a digital table from a data producing unit, the aggregates will often be added for reading purposes. A table based on rates (like average income, % unemployed etc) is logically different. Higher level aggregates can not be calculated from lower level data (the data is non-additive) and will normally be added to the dataset as well as to the printed table.

Given our desire to remove display or design considerations from our metadata model, this argument seems to be undermined. We might even argue that the requirement can be met with a hasdata-attribute (default=yes) added to either the category-element or the category group element (whatever method we decide to stick with).

This brings us to another potential argument for having two different ways of describing hierarchical dimensions in the DDI, the logical difference between additive and none-additive tables/cubes. This is the argument brought forward by myself in the mail just before Xmas:

C: There is a logical difference between additive and none-additive cubes that might justify two different methods:

Additive cubes are traditional OLAP-cubes that only include counts. Counts can always and easily be aggregated by a software process to higher levels of a hierarchy through a standard roll-up process (the only logical exception being aggregation of stock-data across time). OLAP cubes will thus normally be populated with data only at the lowest level of granularity – all other aggregates being produced at run-time by the OLAP-engine.

None-additive cubes on the other hand are cubes where the measure variables are calculated numbers like rates, percentages, indexes etc. These numbers will normally not be possible to aggregate to a higher level in the hierarchy by a software process unless all the components needed to calculate the higher level numbers are present in the cube along with a specification of the calculation formula. The latter might be feasible for simple variables like voter turnout (requiring data about votes cast and eligible voters to be stored in the cube). For more complex data, especially international tables, where the data components come from a variety of sources and even the compute formula might differ across categories, this is simply not achievable. For this reason, none-additive cubes will normally be fully populated with data all the way up to the top-level of the various hierarchies.

It might be argued that additive cubes are best described by the category group method, whereas none-additive cubes are best described by nested categories. However, by inspecting the differences between the two data models in further detail even this argument seems to fall apart.

Firstly, in most real life OLAP-like application even additive data will be pre-aggregated to the higher levels of the hierarchy for efficiency reasons (to-off load the OLAP engine at run-time). This is normally done at publishing time. There are thus no real differences between leaf-nodes and parent nodes justifying the use of two different elements

Secondly and more important, additivity is a logical property of a measure and not a cube (or its dimensions). Given that the ncube element support multiple measures (in line with real life multidimensional data structures), we will easily have situations where a single cube consists of additive as well as none-additive measures (like a table including the number of unemployed (a count) as well as the unemployment rate). Additivity can thus not be used as an argument to justify the coexistence of two different ways of describing hierarchical dimensions.

An argument which I believe never has been brought forward, but which nonetheless should be analysed is the difference between ragged and regular hierarchies.

D: Ragged hierarchies represents a more complex model than regular hierarchies and might justify the use of a different metadata construct

A ragged hierarchy is a hierarchy where the leaf-nodes resides at different distance from the top-node (measured as n-of-levels), and where nodes that belongs to the same conceptual level (like cities in a geographical hierarchy) might reside at different distance from the top-node (measured as n-of-levels). In other words, this is an unbalanced tree where levels might be missing for certain branches.

The hierarchy from the nested category proposal described under argument A above is thus a simple example of a ragged hierarchy. So how are the two methods able to handle these complex hierachies.

As far as I can see, the hierarchical relationships of a ragged hierarchy can be fully described by both methods.

However, the category group method seems to be better equipped to add semantics to the various levels of a ragged hierarchy. By this I mean that a city can be assigned to the level of cities even though its distance to the top-node (measured as n-of-levels) might be different from other cities. Of course with the additional problem that the leaf-nodes do not have level semantics in this model.

The nested category proposal do also include a new way of describing the levels of the hierarchy by introducing a new catlevel element that numbers the levels from top to leaf. This proposal will simply not be able to handle ragged hierarchies where nodes belonging to the same conceptual level might have different distance from the top.

For completeness, we should also add the following argument to the discussion:

E: In some hierarchies a single node might belong to multiple parents. This adds complexity and might justify the use of a different metadata construct.

Nodes (typically leaf-nodes but sometimes also parent-nodes) might in some cases belong to more than one parent. Examples of this can be found in standard disease classification where a single diagnosis belongs to more than one higher level diagnosis. It can also be found in product classifications where as an example a specific cell phone can be classified as a phone as well as a digital camera.