Testing FRBR Against the National Bibliographic Database

Testing FRBR against the National Bibliographic Database

The Functional Requirements for Bibliographic Records is a model in which the bibliographic universe is viewed and described in context. Entities, their attributes and the relationships that operate between them are defined and offer a mechanism through which the catalogue can be structured to help the user interpret and navigate between the four primary entities of works, expressions, manifestations and items. FRBR does address additional groups of entities, such as persons or organisations responsible for intellectual and artistic output and the subjects of this output (i.e. concepts, objects, events or places). Though to-date most of the initiatives or implementations of the FRBR model have addressed what is known as the Group 1 entities, reflecting the actual intellectual or artistic product or output.

FRBR is not a standard or specification. The FRBR study report provides a conceptual model that agencies should evaluate and implement in a way that best serves their user communities to find, identify, select and obtain material. The report acknowledges areas that warrant further review such as ‘seriality’, the impact of digital formats, and extending the data model to accommodate entities associated with authority records.

Significant work has been done by North America’s largest bibliographic agencies, OCLC, RLG and Library of Congress, in implementing FBRR. Each agency has undertaken research projects, to consider the utility of applying FRBR against their own databases. Each agency has specific goals, but all are been motivated by a common agenda, which is to assist users to sift through the myriad of information resources available. This is being achieved by adapting FRBR to simplify record retrieval and aggregate result sets into manageable clusters, testing the feasibility of implementing FRBR in a large bibliographic catalogue, and exploring how FRBR can assist in the integration of the traditional catalogue with the web environment, making it more relevant to library users.

Aim

This work stemmed from a relatively broad inquiry into the applicability of FRBR within Australia’s bibliographic network, primarily embodied within the National Bibliographic Database (NBD). The scope of this inquiry included

- quantifying the results of applying FRBR to the NBD

- determining future strategies for improvements to the NBD which are informed by exploiting the FRBR model

This paper addresses the first of these scope statements and considers the results of applying FRBR to the NBD using existing algorithms and tools such as the OCLC FRBR Work-Set Algorithm and the Library of Congress FRBR Display Tool.

Methodology

Tests

Two tests were devised using the freely provided Library of Congress FRBR Display Tool available from http://www.loc.gov/marc/marc-functional-analysis/tool.html and in anticipation of the availability of the OCLC Work-Set algorithm. While the latter was obtained, the OCLC algorithm was not encoded in any software or tool that could be used to test NBD data.

The first test was to identify the 100 most commonly occurring uniform and name-title index entries in the NBD, extract the linked bibliographic records as a result set and feed these through the LC display tool. Once done, extrapolate the number of records per work, qualify missed items and identify any material type trends in the result set. The complete set of bibliographic records as well as sample subsets of records pertaining to single works were processed with the LC display tool in order to compare the affects on and across specific genres.

An attempt was made to identify the most commonly occurring Australian name titles and title headings. The presence of an Australian content indicator flag in a bibliographic record did not equate to the associated work being Australian. The only practical way to achieve this was to identify headings linked to authority records identified as ‘Australian’ through control field coding (i.e. 008/10=’z’).

Owing to the limitations on the availability of a fully encoded alternate algorithm and a Unicode platform, the second test is still to proceed. This test would expand the methodology to a larger result set i.e. the entire NBD to identify the top 1000 results as a means of comparing NBD work to record ratios against those defined for WorldCat (1.5 manifestations to work) and the RLG Union Catalog (3 manifestations to work). The test would use the OCLC algorithm, and be deployed on a Unicode-enabled platform to assess the impact of including CJK records.

T1: 100 most commonly occurring names and titles

Commonly occurring name and uniform title headings were defined as those individual headings with the most number of bibliographic records attached. The MARC fields included tags 130, 240, 243, 730 for Uniform titles, and any tag 100, 110, 111, 700, 710, 711 with a $t subfield for Name/Titles.

The first list of commonly occurring names and uniform title headings yielded problematic results with a number of inappropriate work headings ranked in the top 100 (Attachment 1 has selections from this first list). The most consistent problems reflected the impact of series in the title indexes and headings established as collocating devices. As works in series can number into the thousands this type of index entry can significantly distort the ranking of popular headings.

This first list was refined by applying global exclusion criteria i.e. automated SQL scripts that looked for common conditions on which to exclude a heading. This resulted in a more accurate and meaningful name/titles list though less successfully with uniform titles (Attachment 2 has selections from the final list). The following exclusion criteria were applied to name/title and uniform title headings.

Name/Title criteria / Example / Result
Omit headings coded as series / 410 *20 ºaUnited Nations. ºtDocument
830 ºaUnited States.ºbCongress.ºbSenate / As expected
Omit headings with text=’Thesis’ or ‘Theses’ / 721 ºaMonash UniversityºaThesis
810 ºaUniversity of TasmaniaºaTheses / As expected
Omit headings linked to bibliographic records that also contain a series tags / AN 227689 has heading:
721 2 ºaSouth Australian Institute of Technology. ºbSchool of Social Studies. ºtField Education Project A,
but also has series statement:
490 1 ºaReport / School of Social Studies. Field Education Project A ; ºv1 / Removed Shakespeare. Had to forego this criteria.
Uniform title criteria / Example / Result
Omit headings coded as series / 440 ºaSmall world / As expected
Omit headings with text=’radio’ or ‘program’ or ‘television’ / 130 ºaRadio theater (1984)
730 ºaAdventures in good music (Radio program) / As expected
Omit headings linked to bibliographic records that also contains a series tags / 130 ºaBetter homes and gardens
730 ºaSunset / As expected
Omit headings linked to bibliographic records coded as serials (LDR/07=’s’) / 730 ºaKluwer Online journals / As expected

Issues in identifying commonly occurring names and titles

Both lists had to be further refined by nominating specific name/titles to be manually excluded prior to extracting a dataset. While manual compilation was possible with a limited set of works, it would prove difficult to perform if implemented in a full production environment.

For example, ‘United Nations Association of Australia. Media peace awards’ had to be dropped from the list as it failed to meet the global exclusion criteria and there was no way to apply criteria that would reliably eliminate just this heading or those of a similar nature.

This was the same problem occurred more frequently for title headings. For example ‘Dr. Demento (1986)’ did not meet the global exclusion criteria. The heading was not coded as a series heading or a serial, and the attached bibliographic records did not contain a series statement. Additional exclusion criteria such as material type (LDR/06=’i’) would have dropped other legitimate titles. Other options were considered but discarded as they built in additional complexity and could not be proven to be 100% reliable. For example another criteria could have been to exclude any heading that contained a 6XX$v tag that contains the text ‘radio’. It proved simpler to manually identify unwanted titles.

Duplicate or near-duplicate headings were present. Based on a sampling of the first 50 name/title and first 40 uniform title headings, there was an average of 5 duplicate headings for each name/title and 2.4 duplicate headings for each title. The most was 13 duplicate headings for a name/title and 6 duplicates for a title heading. However, in most instances this did not affect the ranking of commonly occurring headings or the aggregations resulting from the display tool. The LC FRBR tool could ignore the causes of duplication and a future option could include normalising and aligning headings against an authority file prior to the identification of works. In a few instances duplicate headings did dramatically affect ranking with the numbers of bibliographic records relatively evenly split between two identical or nearly identical terms. (Refer to Attachment 2 and near-identical headings for ‘Bible. English. Authorised’’ & ‘Bible. English. Authorized’’and identical headings for Chopin and Liszt)

A cataloguers decision to enter a work under either a title or name/title heading can affect the ranking for example, the heading ‘Aesop’s fables’ has significantly more manifestion entries than the alternate headings ‘Aesop. Aesop’s Fables’ or ‘Aesop. Fables’. This is problematic in that the identification of common works and the algorithms that cluster them do not look across these heading or index boundaries.

Initially only a list of 25 Australian name/title headings was extracted (Attachment 3 contains the full list) of which only 5 represented works of interest to this study.

Issues with the Library of Congress FRBR algorithm and display tool

Conversion to XML could be an issue. The RedLightGreen project had challenges in this area with the XML DTD because of the number of locally defined tags deployed in the RLG database. The NBD uses almost no local tags but may need to consider adapting an existing DTD (such as the MARC XML DTD) to accommodate data from external sources (overseas and local) and possibly also support obsolete data within the NBD if superseded MARC elements are not defined in the current MARC XML DTD.

The tool is very sensitive to data errors within NBD records particularly in the transformation from MARC to XML for example subfield delimiters (1F) or punctuation (CCA6) occurring mid field or in coded data fields caused the tool to abend. It would be preferable for records that failed validation to be written to an error file and the tool continue processing the remaining records. Common problems could be self-corrected (automatic substitution of values)

There is doubt as to whether the tool itself and/or the resolution of data errors will scale to FRBRise the entire NBD. It was not possible to test this given the technical and resource constraints in this investigation. A unix environment may be required to clean up files that fail validation as a Hex editor and standalone PCs will have file size and memory limits.

The FRBR tool does not generate statistics upon completion of processing. It was possible to work around this by employing functionality within an XML editor. The editor will provide an occurrence count against each MODS element in the file. Even here the XML statistics for MODS element <manifestations> had to be interpreted carefully as they do not reflect the actual number of manifestations owing to the MODS syntax structure. The <manifestation> element is a parent element or wrapper for each <imprint> element so the correct number of manifestations was harvested from the latter.

The sorting rules imposed by the tool lead to curious and sometimes misleading displays. For example:

- Works with Author main entry filed alphabetically but before any works with title entry. This presented problems where cataloguers had made different choices of main entry.

- Expressions carry text labels but are sorted alphabetically by the code value in LDR/06 so ‘text’ (LDR06=a) files before ‘sound recording’ (LDR06=j), and ‘sound recording’ files before ‘software, multimedia’ (LDR06=m).

- Manifestations were sorted chronologically by date descending under each Expression. Where manifestations shared the same date no other sort rule was applied.

- Some leader codes represented as their code value possibly a display conversion error e.g. Form: LDR6=o

The algorithm appears to generate apparently identical works, possibly due to a fault or as yet misunderstood feature of the algorithm. The explanatory text does not reveal any clues to interpret this behaviour. The examples below should have generated two manifestions under 1 work if the match was only performed on a 1XX and 240 field (as stated in the algorithm):

W1

Author: Schubert, Franz, 1797-1828
Work: Songs. Selections
Form: notated music - German
Edition: Original-Ausg.
Title: Lieder für eine Singstimme mit Pianofortebegleitung. Band IV
Statement of responsibility: Franz Schubert ; revidiert von Max Friedlaender. Imprint: C.F. Peters, [19--?]
/

W2

Author: Schubert, Franz, 1797-1828
Work: Songs. Selections
Form: notated music - German
Edition: Original-Ausgabe.
Title: Lieder für eine Singstimme mit Pianofortebegleitung
Statement of responsibility: Franz Schubert ; nach den ersten drucken revidiert von Max Friedlaender. Imprint: C.F. Peters, [19--]

The algorithm does not differentiate between musical and non-musical sound recordings for display purposes. Both forms are assigned the same label ‘sound recording’ leaving the user to determine which is applicable, presumably from the physical description area but this can be inconclusive.

Algorithms

The two algorithms that were easiest to assess were the LC FRBR Display Tool algorithm and the OCLC Work-Set Algorithm.

The most noticeable difference between the two is that the LC FRBR Display Tool seeks to identify and display works, expressions and manifestations whereas the OCLC Work-Set algorithm is only designed to identify a work-set.

The other key differentiator is that the OCLC algorithm exploits data in authority files. It does so by mapping variant forms of names in the bibliographic record to the form authorised in the authority file according to certain rules, for example where an exact match fails for multiple matches attach to the form used most frequently (approximates to form with most number of bibliographic records attached), or ignore birth and death dates to get a variant form of name to match. All authority and work-set keys are normalised according to the NACO rules prior to matching.

There are many similarities in the data elements that both algorithms exploit when constructing author and title parts for the work keys. For example, both use the same subfield codes for the construction of authors, though there is less overlap in the construction of the title (generally fewer subfields are present in the OCLC algorithm).