Sharing and Re-Use of Classification Systems: the Need for a Common Data Model

Aida Slavic

University College London, U.K.,

Maria Ines Cordeiro

Art Library, Calouste Gulbenkian Foundation, Lisbon,

SHARING AND RE-USE OF CLASSIFICATION SYSTEMS: THE NEED FOR A COMMON DATA MODEL

Abstract: Classifications can help to overcome difficulties in information retrieval of heterogeneous and multilingual collections for which linguistic and free text searching isnot sufficient or applicable. However, there are problems in the machine readability of classification systems which do not facilitate their wider use and full exploitation. The authors focus on issues of automation of analytico-synthetic classification systems such as Universal Decimal Classification (UDC), Bliss Bibliographic Classification (BC2) and Broad System of Ordering (BSO). 'Analytico-synthetic’ means classification systems that offer the possibility of building compound index/search terms and that lend themselves to post-coordinate searching.

INTRODUCTION

Classification systems can support adaptation of subject browsing to specific users’ profiles, a specific collection’s content or particular needs (e.g., the choice of hierarchical presentation) or they can be used to support alternative subject access when used to map between different indexing languages (Hodge, 2001). Besides the usual support for linear subject browsing, many portals and gateways are looking into classifications to underpin different interactive information retrieval techniques. The specific advantage of using faceted, i.e., analytico-synthetic classifications for resource discovery on the Internet was, for instance, put forward by Ellis & Vasconcelos (1999, 2000), Devadason et al. (2002) and Devadason (2003). Analytico-synthetic classifications can enable post-coordinate searching by providing the structure for combining concepts (entities), their characteristics (attributes) and their associations (relationships) in friendly ways that do not require users' previous knowledge of the system. In modern IR, for example, classification systems have been exploited to build semantic tools for both term control and graphical subject representation.

General library classification systems provide vocabularies with established hierarchical, collateral and coordinated relationships, linking subjects that are distributed across the whole universe of knowledge in highly formalized ways. It is a characteristic of library classifications that they serve to convey an 'objective' view of knowledge, based on given scientific or academic consensus. This is whythey are often regarded as too 'rigid' toaccommodate new knowledge. Analytico-synthetic classifications, however, are equipped to deal with 'rigidity' as they are usually based on facet analysis which provides mechanisms to accommodate new concepts, express complex subjects and establish relationships across disciplines.

THE PROBLEM WITH DATA STANDARDS

When a classification system is made available electronically as simple text[1] and not as structured, machine-readable data, it is very difficult to fully manage and control concept hierarchies and vocabulary facets and sub-facets, which are important structural units of any analytico-synthetic classification. This represents major drawbacks to the full automation of 'classmark building' in the process of indexing and to the exploitation of classification in IR for search expansion, facet browsing, post-coordinate searching etc. If the definition of adequate data representation is left to each specific application and implementation, solutions are both costly and limited. Most importantly, a localized solution approach does not solve the problem of exchange and re-use of classifications that is paramount for networked information services. The way forward is to standardize data structures used to automate classification systems, so that they can share a common data vehicle. UDC, for example,pioneered the distribution in an electronic structured format, yet it is a proprietary format, thus requiring specific knowledge of it and conversion routines to integrate data in different systems [2].

A MARC format for classification data has been available since 1992(see Guenther, 1992; USMARC format for classification data, 1997; MARC 21 Concise Format for Classification Data, 2003). However, as it was designed for two particular enumerative systems (Library of Congress Classification – LCC and Dewey Decimal Classification – DDC) the underlying data model did not consider aspects that are specific to the structure, principles and rules of analytico-syntheticclassifications. The usability of this format has been restricted to systems that implement the MARC 21 family of formats and that have a simple enumerative hierarchical structure similar toLCC orDDC. The format has, thus, limitations for the encoding of features/functionalities particular to classifications such as UDC or BC2. Although any analytico-synthetic classification can be used in a simple enumerative way this results in a loss of functionality that is linked to facet analysis and concept synthesis. These aspects were recentlydiscussed during the preparation of a UNIMARC Format for Classification Data (see Concise UNIMARC Format for Classification Data, 2000). In both cases, MARC21 and UNIMARC, the objectives have been limited to the authority control function of class numbers as they are presently used in bibliographic systems. And here the classmarks have been treated merely as text strings, i.e., not taking into account the need for machine readability of the semantic structure of complex and compound class numbers. This is why the majority of library OPACs are still unableto make full use of classification data to improve subject access functionalities.

Addressing the problem of data structures for classification systems requires, as a first important step, the analysis ofcommonalities and differences of classification systems at different levels, to which a common data standard would apply. This stage is essential in devising a common data model. Only then can some kind of standardized transport and exchange format, with all its details, be successfully raised.

Figure1: The levels of classification systems that need to be mapped in order to create common standards for their automated data exchange

THE CASE OF ANALYTICO-SYNTHETIC CLASSIFICATIONS

The main characteristics of analytico-synthetic classifications rely on the structure that keeps concepts in a logical set of mutually exclusive hierarchies (facets) and that provides the logic for their sorting (filing) and for the combination of their elements (citation order). Instead of offering a range of ready-made subject classes defined a priori, analytico-synthetic classification offers concepts/units of vocabulary that may be freely combined in the indexing stage as well as in the retrieval stage. This concept-based approach was first used in the Universal Decimal Classification in 1905 and it was later 're-created' in 1930 into a logical and sophisticated system built on the postulates of facet analysis by Ranganathan. The theory of facet analysis was accepted, further developed and became widely used by the UK Classification Research Group (UK CRG).

The term faceted is often used as a synonym for all analytico-synthetic classifications. In the theory of library classifications, however, faceted is used in a very strict sense and only when associated with classifications based on theoretical postulates of fundamental facet categories "personality-matter-energy-space-time" (Ranaganathan) or "Things (Products)-Kinds-Parts-Materials-Properties-Operations-Agents" (UK CRG). In some systems, however, the term 'faceted' is used to describe any kind of vocabulary that consists of facets of mutually exclusive categories that may not bedirectly related to the theoretical postulates mentioned above. Typical examples arethesauri such as the AAT [3]or faceted vocabularies of different kinds used in commercial portals and information gateways. There are good examples of analytico-synthetic classification systems that do not strictly comply with the Ranganathan's or CRG postulates, as is the case with the UDC.

The focus of this paper is on the functional intelligence of analytico-synthetic vocabulary which enables different classes of concepts to be related, indexed and searched using different attributes or relationships, provided that they can be encoded and machine processed as such. For analytico-synthetic classifications such as UDC, BC2 and BSO it is possible to identify the following common structural features (see Figure 2):

- organization of vocabularies of subject fields/disciplines into facets and subfacets so concepts can be combined to produce an unlimited number of a compound subjects (e.g. agent+operation+product, agent+property).

-the isolation/separation of vocabulary of generally applicable concepts (also known as common isolates) such as time, place, form, persons, languages etc. which serve to specify any given subject

-the existence of codes for relationships (also called role operators) that can be used to link concepts from different disciplines in building complex subject expressions (influence, comparison, application, bias etc.)

-the existence of rules for the ordering of facets in pre-combined expressions (citation order)

Figure 2: Macrostructure of analytico-synthetic classifications

An important quality of these classifications is that a reasonable number of general rules can be established for the entire system and can be used to handle, manage, validate and control concept combinations. However, the systems may differ in exploiting these possibilities. Two extremes are UDC, which has general rules to allow the synthesis of defined vocabularies on all levels; and BC2 which rules allow combinations of facets within a single subject, but have restrictions when it comes to combination with general concepts and no possibility to freely combine subjects from different disciplines. For automation, i.e., for establishing machine-readable conventions, these aspects are important and the degree of difficulty depends on whether the underpinning rules are consistent throughout a system and whether the types and attributes of the elements of vocabulary are made explicit or not.

ISSUES OF DATA REPRESENTATION

Although analytico-synthetic classifications may have common structural elements, as explained above, classification systems deploy different levels of formality in recording, expressing and presenting their structures. Representation of concepts in a classification can be done in a highly formalized and predictable way so thatthe semantic and syntactic relationships are made obvious. This is achieved through the so called 'expressive' notations which can denote not only the position of a concept in a hierarchy but can also indicate the position of a hierarchy with respect to facets. If the notational system is expressive, the number of representational symbols is coextensive with the concept hierarchy (the more specific the concept, the longer the notation). If the notational system uses facet indicators, synthesized classmarks will contain a facet indicator for each concept in combination. Because of this an expressive notation tends to be longer but also structured and more complex. This is illustrated in Figure 3 with BC2 representing an analytico-synthetic classification with non-expressive notation, while UDC shows the most typical characteristics of expressive notation.

BC2 SYNTHESIS / UDC SYNTHESIS
PB
PBKSacred books
…
PIJVedic religion
PIJ BKSacred books
PIJ C Samhitas
PIJ DTrayi Vidya / 2-2 [spec. auxiliary table]
2-24Sacred books
…
231Vedism
231-24Sacred books
231-242Samhitas
231-242.2Trayi Vidya

Figure. 3 Comparison of notation for BC2 and UDC

The class hierarchy: Sacred books - Samhitas - Trayi Vidya is not explicit in BC2 notation, whereas it is fully recorded in UDC. In BC2, two classmarks (from two facets) are synthesized by omitting two letters BK in front of C and D, to make the notation shorter (the general rule is to omit only the first letter when two classes are combined). Also, there is no indicator that the BC2 classmark is a combination. In UDC, Sacred books is listed in the facet Evidences of religion which is formalised as a special auxiliary table in class 2 Religion that starts with indicator -2. The indicator of combination is “–“ (hyphen) which precedes every combination with facets from special auxiliary tables, and this rule is applicable to the entire system.

In BC2 the primary requirement for notation is to have an "ordinal value" as its only purpose is to "maintain the order of classes in a mechanical fashion" while it seems appropriate "to allocate notation purely with a view to its maximum effectiveness in visual scanning and to assume that any problems that this raises for machine use are best left to the programmers to resolve" (Mills & Broughton, 1977: 46). BSO shares the same thinking behind its notational representation but its compilers have admitted that this approach is meant for "human use only" and may not be suitable for automation: "For computer search all these matters must be explicit in notation. Such a fully expressive notation would be far too long and complex for direct use by human beings" (Coates & Lloyd & Simandl, 1979: 49). UDC notation, on the other hand, is often said to be too long and complicated as it records semantic and syntax indicators (similar to Colon Classification), a feature that may be cumbersome for shelf ordering but thatis valuable for machine processing and the automation of number building.

SOME BASIC RECOMMENDATIONS

From the compositional characteristics of classification systems, as highlighted in this paper, it is clear that all data required for automation is not always present in notation and that some forms of representation are more amenable to translation into machine-readable form than others. Some fully faceted classification systems, such as BC2 and BSO, are very difficult to automate, or implement for browsing and post-coordinate searching because the available data does not contain all the information necessary for its straightforward automation and use in any other way but simple linear ordering.

The central, and most obvious, requirement for automation of analytico-synthetic classifications is the full declaration and encoding of each compositional element of a synthesized notation. Such a requirement has already been noted by several authors (cf. Wajenberg, 1983; Gödert, 1991; Gopinath & Prasad, 1994; Madalli & Prasad, 2002; Pollitt & Tinker, 2000). This requirement corresponds to a fundamental principle inautomation: the independence and integrity of data elements. More specifically, there is the need for: i) clear encoding of facet types and their rules, to support IR and sorting functions; ii) consistency of facet data identification and content, to support verification and validation procedures; iii) encoding of semantic relationships, to support procedures based on semantic inheritance (i.e. notation independent encoding of the class hierarchy).

Common understanding of these issues would enable the construction of a conceptual model against which a standard data structure for analytico-synthetic classifications could be defined. Following the above requirements, such a standard implies sophisticated data structures, i.e., carrying all the information needed to support the intelligence behind the schedules, not just the text of the schedules themselves. This is far from being current practice, as the most common approach has been treating notations as text strings. A common data standard would significantly help to distribute, share, manage and exploit classifications to their full potential. Besides contributing to the shareability and reuse of classifications, such a standard would also help in automating classification schemes such as BC2 and BSO, or other special faceted classifications that have been scarcely used in information retrieval systems.

REFERENCES:

COATES, E.; LLOYD, G.; SIMANDL, D. (1979). The BSO manual: the development, rationale and use of the Broad System of Ordering. The Hague: FID, 1979.

Concise UNIMARC Format for Classification Data (2000). IFLA, PUC, 31 October. Available at:

DEVADASON, F. J. (2003) "Faceted indexing application for organizing accessing Internet resources", Subject retrieval in a networked environment: proceedings of the IFLA Satellite Meeting held in Dublin, OH, 14-16 August 2001. Edited by I.C. McIlwaine. München: K G Saur, 2003. (UBCIM Publications - New Series 25), 147-159.

DEVADASON, F. J. et al. (2002) "Faceted Indexing Based System for Organizing and Accessing Internet Resources", Knowledge Organization, 29 (2) 2002, 65-77.

ELLIS, D.; VASCONCELOS, A. (1999) “Ranganathan and the Net: using facet analysis to search and organise the World Wide Web”. Aslib Proceedings. 51 (1) 1999, 3-10.

ELLIS, D.; VASCONCELOS, A. (2000) "The relevance of facet analysis for World Wide Web subject organization and searching", Internet searching and indexing: the subject approach. Edited by A. R. Thomas, J. R. Shearer. New York etc: The Haworth Information Press, 2000, 97-114.

GOEDERT, W. (1991) “Facet classification in online retrieval”. International Classification, 18 (2) 1991: 98-105.

GOPINATH, M. A.; PRASAD, A. R. D. (1994) "A knowledge representation model for analytico-synthetic classification", Knowledge organization and quality management: proceedings of the Third International ISKO Conference, 20-24 June 1994, Copenhagen, Denmark. Edited by H. Alberchtsen, S. Oernager. Frankfurt/Main: Indeks Verlag, 1994. (Advances in knowledge organization 4), 320-327.

GUENTHER, R. S. (1992) "The USMARC format for classification data: development and implementation", Classification research for knowledge representation and organization: proceedings of the 5th International Study Conference on Classification Research, Toronto, Canada, 24-28 June 1991. Edited by N. J. Williamson, M. Hudon. Amsterdam: Elsevier Science Publishers: FID, 1992, 235-245.

MADALLI, D. P.; PRASAD, A. R. D. (2002) "VYASA: a knowledge representation system for automatic maintenance of analytico-synthetic scheme", Challenges in knowledge representation and organization for the 21st century: integration of knowledge across boundaries: proceedings of the the Seventh International ISKO Conference, 10-13 July 2002, Granada, Spain. Edited by María J. López-Huertas with the assistance of Francisco J. Muñoz-Fernández. Würzburg: Ergon Verlag, 2002. (Advances in Knowledge Organization 8), 113- 119.

MARC 21 Concise Format for Classification Data (2003) Concise edition. Library of Congress, 2003.

MILLS, J.; BROUGHTON, V. (1977). Bliss Bibliographic Classification: introduction and auxiliary schedules. 2nd ed. London etc.: Butterworths, 1977.

POLLITT, S.; TINKER, A. J. (2000) "Enhanced view-based searching through the decomposition of Dewey Decimal Classification Codes", Dynamism and stability of knowledge organization: proceedings of the Sixth International ISKO Conference,10-13 July 2000, Toronto, Canada. Edited by Clare Beghtol, Lynne C.Howarth, Nancy J. Williamson. Würzburg: Ergon Verlag, 2000. (Advances in knowledge organization 7), 288-294.

USMARC format for classification data: including guidelines for content designation (1997). Cummulation of 1990 edition and Update No. 1 95 interfiled. Prepared by Network Development and MARC Standards Office. Washington, DC: Library of Congress, Cataloging Distribution Service, 1997.

WAJENBERG, A. S. (1983) "MARC Coding of DDC for Subject Retrieval", Information Technology and Libraries, 2 (3) 1983, 246-251.

[1] Such is the case with BC2 and BSO. For information on BC2 see The BSO is available online at

[2] The UDC Master Reference (MRF), available from the UDC Consortium (

[3]Art and Architecture Thesaurus (