Data mining MARC to find FRBR?
Finnish Norwegian project
ELAG, Rome, 17.4.2002
Eeva Murtomaa, Helsinki University Library
One year ago, in the beginnig of March 2001, there were only questions to be answered. These questions related to the implementations of the IFLA (International Federation of Library Associations and Institutions) FRBR-model (Functional Requirements for Bibliographic Records) in bibliographic systems.
Knut Hegna from the University of Oslo Library and Eeva Murtomaa from the Helsinki University Library started a project for getting some answers.
Our first question was: Could we find the FRBR entities: work, expression, and manifestation from the MARC-records included in the Finnish and Norwegian national bibliographies, and BIBSYS (a library system serving most university and college libraries in Norway). But this was not enough. We were also curious to see what kind of problems we are facing when looking at the MARC-records or at the search results and displays in the light of FRBR, or how to design better hit lists based on the FRBR-structure.
After one year we had two main answers to our first question: yes and no. Yes, because the FRBR model is to some extent present in the MARC record, and it is partly to be found by a computer program as well. Usually bibliographic records are created on the manifestation level. This means, that we can identify and separate elements describing works and manifestations from the bibliographic records. Even data describing the expression is to be found, depending of the level and quality of description. In addition we realized, that the relationships between the description and the main and added entries as well as subject descriptors would help in the identifying process.
Example 1. (work , expression, manifestation)
• 001f521254 001f521255
• 008 941208s1946 fi j f 008941208s1946 se j f c
• 015 $a f521254 015 $a f521255
• 041 $a swe 041 $a swe
• 080 $a 839.79-3(024.7) 080 $a 839.79-3(024.7)
• 1001 $a Jansson $h Tove 1001 $a Jansson $h Tove
• 2452 $a Kometjakten 2452 $a Kometjakten
• $d Tove Jansson $d Tove Jansson
• 260 $a Helsingfors 260 $a Norrköping
• $b Söderström $c 1946 $b Sörlin $c 1946
• 300 $a [2], 179 s. $c 8:o 300 $a [2], 179 s. $c 8:o
•
Example 2:
• 008 8700909s1967 gb j 008 940214s1991 us j f c
• 015 f688998 021 $a 0-374-41331-2 $c nid.
• 0411 $a eng $c swe 0411 $a eng $c swe
• 1001 $a Jansson $h Tove 1001 $a Jansson $h Tove
• 241 $a Kometjakten 241 $a Kometjakten
• 2452 $a Comet in Moominland 245 $a Comet in Moominland $d Tove
$d written and ill. by Tove Jansson Jansson $e translated by Elizabeth
$e translated by Elizabeth Portch Portch
• 250 $a [New ed.]
• 260 $a Harmondsworth 260 $a [New York, Ny] $b Farrar
• $b Penguin books $c 1967 Straus, and Giroux $c 1991
• 300 $a 157 s. $b kuv. 300 $a 192 s. $b kuv. $c 19 cm
• 490 $a A Puffin book 490 $a A Sunburst book
• 555 $a 5. Impr. London &
• New York: Benn & Walk, 1970
• 70011$a Portch $h Elizabeth 70011 $a Portch $h Elizabeth
Example 3:
• 001f521254 001f521255
• 008 941208s1946 fi j f 008941208s1946 se j f c
• 015 $a f521254 015 $a f521255
• 041 $a swe 041 $a fin $c swe
• 080 $a 839.79-3(024.7) 080 $a 839.79-3(024.7)
• 1001 $a Jansson $h Tove 1001 $a Jansson $h Tove
• 2452 $a Kometjakten 2452 $a Takaisin Muumilaaksoon
• $d Tove Jansson $d Tove Jansson
• 260 $a Helsingfors 260 $a Porvoov$a Hki $a Juva
• $b Söderström $c 1946 $b WSOY $c 1988
300 $a [2], 179 s. $c 8:o 300 $a 250 s. $b kuv. $c 30 cm
500 $a Alkuteokset: Kometjakten ;
• Trollkarlens hatt …
• 505 Sisältö: Muumipeikko ja pyrstötähti ; ... 70011 $a Järvinen $h Liisa
Example 4 ( Identifying the manifestation)
• *001887091044X
• *008941107s1994 it j c
• *021 $a 88-7091-044-X $c nid.
• *0411 $a ita $c swe
• *080 $a 839.79-3
• *1001 $a Jansson $h Tove $c 1914-
• *241 $a Resa med lätt bagage
• *2452 $a Viaggio con bagaglio leggero $d Tove Jansson $e introduzione di Carmen Giorgetti Cima $e [traduzione dallo svedese di Carmen Giorgetti Cima]
• *260 $a Milano $b Iperborea $c 1994
• *300 $a 187, [1] s. $c 20 cm
• *490 $a Iperborea $v 44 $y 0044
• *70021$a Giorgetti Cima $h Carmen
•
Example 5
008880325 no esp
*02000 $a 82-991075-2-0 $b h. $c Nkr 60.00
*04110 $a espnor
*08200 $a 839.822[S]
*10010 $a Ibsen, Henrik $d 1828-1906
*24510 $a Puphejmo (1879) $c Henrik Ibsen ; tradukis:
Odd Tangerud ; lingve kontrolita de
Esperantista Verkista Asocio (EVA)
*26000 $a Hokksund $b Eldonejo Odd Tangerud $c 1987
*26900 $a [Drammen] : Tangen-trykk
*30000 $a [1], 57 s. $c 24 cm
*50000 $a Originaltittel: Et dukkehjem. -
Originalutgave: København : Gyldendal, 1879
*99100 $a Tangerud, Odd
Why the answer was also No ? We realized, that the cataloguing rules are designed for the card catalogues and printed bibliographies, not for the FRBR model based
displays. The central information is often recorded in a way more suitable for the human mind and eye, than for a computer.
However, this was not the end of the questions. We had to know, what kind of problems we meet when looking at the MARC records or at the structure of the hit lists, and displays in the light of FRBR. What should the hit lists look like?
Methodology
Our goal was to collocate similarities, and to analyse differences. Therefore we created strings for identifying works, espressions and manifestations by mapping the FRBR attributes associated with the entities work, expression, and manifestation to the elements included in the national MARC fields and subfields. Only attributes and relationships of high or moderate value to identify the entities were taken into consideration. The idea was to bring together identical strings and to separate different strings.
Table 1:
Table 2:
Our examples consisted mainly of single works or collections of works for which a single person is responsible. From the tables above you can see, that for identifying the work, we looked at the original titles or uniform titles. The title of the work and the relation to the person(s) responsible were used as attributes identifying the works.
With these attributes we could collocate the identical works existing in several records and differentiate them from other works. When this was done, we used other data to differentate and collocate different expressions.
At the expression level the language of expression supplemented with the entity responsible for the expression (usually the translator) were selected. For identifying the manifestation following elements from the MARC records were used: title, publisher, date and extent of carrier.
Some results of our examples:
Results of using the work reduction procedure on records from the Norwegian (n) and the Finnish (f) national bibliographies.
From the table we see, that the "Number of records" means the existence of the author as main or added entry. In the "Number of work lines" we see the number of work identifiers from these records extracted by the programme. In the last column we see the "Number of unique woks" of the author. Looking at the variation of numbers, we have to realize, that there are some problems between the reality and the numbers given on the table.
What kind of problems ?
Of course the results of the study depend on the quality of cataloguing. There are inconsistencies in the logic of cataloguing caused by historical or individual differencies. In addition there are other reasons for "lying statistics"like:
- records with lack some information or the information is wrong
- records without original or uniform titles, or unidentifiable titles
- misprints and spelling differencies in the original titles
- inaccurate cataloguing
- inconsistent registration of collections (several works in the same manifestation)
- relationships, which are usually expressed in natural language (usually as notes)
- lack of qualifying information (roles) in added entries (700 $e)
User intefaces
One of our questions concerned the meaning of the FRBR structure to the hit lists and displays. We thought, that the hist list should be in line with the search and search results. In addition we supposed that the search results for the works of a single person should be arranged in alphabetical or chronological order, or perhaps according to the function of the person related to the work, expression or manifestation.
For designing the hit lists and displays we looked at the attributes/elements of the entities that are important in the selection process in the FRBR. It seems, that on the work level the title and relation to the creator are most important. On the expression level the language and the relation to the person responsible for the expression should be taken into consideration. Finally, important elements on the manifestation level are the statements of responsibility, edition, publisher, and date .
Card catalogues
In the card catalogue the hit list is presented by overlapping cards. The headings are composed of work title and person responsible for the work. The expressions are represented by the language of the expression and the title of manifestation. The original title is given first, and after that other titles in alphabetical order. At the end of the title of expression the number of manifestations of this expression is given.
The manifestations are sorted according to the publishing year.
Tree structure
Here the work titles are sorted alphabetically according to the original title. The number in the end of the titles indicates the amount of expressions belonging to this work. The work nodes are expandable with the expressions as leaves. The expressions are sorted alphabetically according to the languages of the original titles.
Conclusions
First of all, I would like to stress the importance of good cataloguing. If bibliographic records are created logically, it is possible to manipulate the data by computer.
That's why we have to put a question to ourselves: how, why, and to whom we are cataloguing?
Our investigation showed, that the meaning of the authority data and of the language codes should be stressed in cataloguing. We notized, that our analysis had been easier, if original titles were recorded in a more consistent way e.g. in separate, repeatable fields.
With help of authority files we can give our customers the possibility to navigate in the bibliographic universe. Besides authority files of names (of persons and corporate bodies and series) and subject descriptors there is a need for work authoritities for collocating the same work under one heading.
With help of language codes we can identify the manifestation as translation. Language codes are important attributes to identify different expressions of a single work as well.
During the project , the role of the functions became more and more important. Functions or roles of the person or corporate body is usually indicated in the description only. In the environment structured according to the FRBR -model the function statement in the main or added entry field would be very helpful. The search systems and the design of hit lists could make good use of the function statements. In addition our users could benefit from the function statement in their bibliographical navigation. That's why we suggested, that the functions should not be optional.
Suggestions for continuing work
After finishing the project, many aspects are still open for further investigation.
One important topic concern relationships, which provide additional information for the user in making connections between the entity found and other entities that are related to that entity. Relationships give new aspects for display designers. They also offer new ways to create navigation possibilities for the user.
We have to find out what kind of relationships we can find from the descriptions. Unfortunately relations are mainly indicated in the description, and textual information is very much language dependent. In the near future this problem may be partly solved by diferent kinds of identification numbers, which link different entities with each other. In addition the role of coded information, subject headings, classification and authority files are worth of examination.
