Validation System Documentation / 11/08/2018
Page 1/15

Validation System

A Tool for Providing Information for Use in the Validation Process

Author: L.Gillam

Table of Contents

1. INTRODUCTION3

2. RUNNING THE VALIDATION SYSTEM3

2.1. Main Validation System Dialog3

2.1.1. Selecting Files4

2.2. Further Tasks6

2.2.1. Choose Statistical Nucleates7

2.2.2. Generate Collocations8

2.2.3. Generate Orthographic Cues9

1.introduction

The Validation System combines the strengths of methods of terminology extraction to give the user a view of how the terms are being used within texts. A combination of orthographic and statistical methods enables the creation of a coordinate term set, with similar terms being grouped. The grouping is made on the basis of lexical similarity.

Document version 0.1

Software prototype to be made available.

At the University, this is run using the ‘Interval proto’ button from the Quirk menu, or by typing INTERVAL at a Zen window command prompt.

2.Running the Validation System

The system consists of a number of tasks which can be run as determined by the user in order to produce evidence for terminology. It is possible, for example, to find evidence for terms from Termbases via the Interval Interchange Format (IIF) and textual analysis. Textual analysis can also be used to extract and compare terms found in texts.

3.Main Validation System Dialog

The main dialog consists of 4 Buttons and 2 Listboxes. Each of the Buttons controls a specific action. These actions are defined in the table below:

Button / Function / Dependencies
Appends the selected task in the Available Tasks Listbox, to the Chosen Tasks Listbox
Transfers the selected task in the Chosen Tasks Listbox back to the Available Tasks Listbox
Exit / Terminates the program
Prepare Tasks / Generates the Task-based application, based on the order of the chosen tasks

4.Selecting Files

Prior to the setup for the individual tasks, two generic tasks need to be performed. Firstly, a file or set of files has to be chosen for the analysis. Secondly, the user can choose a set of defaults for information to retain during the analysis.

When the ‘Prepare Tasks’ button is first pressed, a file dialog is displayed allowing the initial selection of a file. Selecting a file results in the path to this file being displayed in the Validation System::Select Files for Tasks dialog.

The user can Add Files to the list, use Drop Selection to drop a selected (highlighted) file, or use Drop All to remove all files from the list.

As we will see, the ‘<’ and ‘>’ buttons appear in many of the screens. This enables the user to move between the tasks if, for example, there are items that they wish to change for a particular analysis.

Once the appropriate files have been selected, pressing the button displays the Validation System::Set Generic Options dialog.

5.Generic Options

Currently, this allows the user to choose whether to perform a case-sensitive or -insensitive analysis, and whether punctuation should be retained in the analysis phases. If the Keep Case checkbox is sunken(?), this means the analysis will be case sensitive. Similarly, if the Keep Punctuation checkbox is sunken(?), punctuation will be kept in the results.

Once these default screens have been displayed and the button has been pressed, tasks chosen in the Chosen Tasks listbox (Validation System:Choose Tasks) will be presented and executed in the order chosen.

(Because of the prototype nature of the tool, it is not known what certain orders of tasks will do)

Task descriptions

Task Name / Function
Work from Markup / Allows selection of certain term items for searching from IIF
Select Searchable Resources / none yet
Choose Statistical Nucleates / Creates ‘weird words’ by comparing chosen files against an existing frequency list
Generate Orthographic Cues / Creates phrases from the text that occur between specific items of syntax (boundaries)
Generate Collocational Cues / Creates n-grams based on the statistical properties of certain nucleating words
Compare Cues / Compares and groups items generated from the analysis.

6.Further Tasks

Three of the five available tasks are concerned with the textual analysis. For these tasks: Choose Statistical Nucleates; Generate Orthographic Cues and Generate Collocational Cues, the files that were chosen in the initial Select Files for Tasks dialog are used in conjunction with the options chosen in the Set Generic Options dialog.

7.Work from Markup

This task enables the selection of certain terms with which to work, from a set of markup texts. Currently, this enables us to use IIIF data (that is, data in the Interval Interchange Format) with the DTD (Document Type Definition - layout form for the IIF) to gain items in specific languages, which can subsequently be matched in texts.

In order to Work from Markup it is ESSENTIAL that both a DTD definition file AND an IIF data file have been loaded. The Selected DTD Elements will then be used in combination with the selected language to get the term information that we require. In the example above, we have selected ‘iif&entry&term&termstr’ and ‘iif&entry&equi&termstr’ for English. We will therefore be using all the English Terms and Equi’s to locate our evidence. Subsequently, this would allow us to select def’s for use in future tasks. Once we have our IIF data elements, we can search for them directly (KWIC) perform further tasks to enable data comparisons to be made. [Needs better explanation].

Traversal of the DTD Elements is done using the Up and Down buttons in the dialog. The Select Terms button allows the DTD data for the selected elements to be selected for further work:

In order to complete the Work from Markup task, some terms have to have been chosen via the Select Terms button - that is, placed in the right-hand-side of the dialog shown above, as “head-end“ and “head end” are.

8.Choose Statistical Nucleates

This task enables the simple selection of a set of nucleating words. These nucleating words could then be used as seeds for the collocation tool, or output directly. This analysis is achieved by comparing the frequency of the words in the selected texts with the frequency of their occurrence in a corpus of general language. By constraining values of both frequency and ‘weirdness’, a list of candidate tokens is produced.

Changing the values of weirdness and frequency will alter the group of tokens being returned. The Find File button allows the user to select a frequency list for comparison with the frequency list that is produced as part of this task.

9.Generate Collocations

Collocation patterns (cf Smadja) are generated automatically from a set of ‘nucleating words’ that are built in the Lexica. Pressing the Lexica button displays a dialog similar to that below screen:

This dialog shows two text input areas, one for nucleating words and one for words which are not to be considered as collocates, for example, collocates of the (English) may not be of interest, so we Ignore them. The Reset button clears the contents of each of these areas. The Lists button allows an existing list to be loaded and the Save button allows the current list to be saved to a file.

Notes

Depending on the task ordering, certain suggestions may be made as to the words to be used. If the Choose Statistical Nucleates task is carried out prior to this task, those words which matched the required constraints will be suggested as Nucleates in the lexica dialog.

Any words placed in the Ignore part of the lexica dialog will be suggested for use as Boundaries in the lexica dialog in the Generate Orthographic Cues if that task is performed subsequently. The reverse of this is true (and equally confusing and should be put into simpler English)

10.Generate Orthographic Cues

Orthographic cues are generated by treating certain tokens as phrasal boundaries. By collecting the tokens that occur between these phrasal boundaries, we can generate a second set of terms to that generated by purely statistical methods. We can insist that we are only interested in terms made up of more than one word by pressing the Multi-word terms only button (need more options). The boundaries and phrases we wish to ignore are set in the Lexica dialog obtained by pressing the Lexica button.

Notes

Depending on the task ordering, if the Generate Collocational Cues task if chosen before this one, then any Ignore words chosen in the Lexica dialog there, will be displayed as suggested Boundary words here. If this task appears before Generate Collocational Cues then the Boundary words will become Ignore words for that task. Both sets can be edited.

Any words placed in the Ignore part of the lexica dialog will be suggested for use as Boundaries in the lexica dialog in the Generate Orthographic Cues if that task is performed subsequently. The reverse of this is true

11.Compare Cues

When the required tasks have been undertaken, the results can be either combined (default) or compared and built into ‘term sets’. The comparisons can be done in a number of ways.

A simple comparison between two terms consists of comparing the number of words occurring in both terms.

For example:

Term / No. words / No. similar / Score ( ( s1 + s2 ) / (w1 + w2 ) )
catalytic converter / (w1 =) 2 / (s1 =) 2 / 0.6666
4-way honeycomb catalytic converter / (w2 =) 4 / (s2 =) 2

The two terms have a similarity score of 0.6666. This scoring could also be carried out on the first N characters of each word, which would increase the probability of e.g. plurals (English) being counted together. A second method is to take N-character patterns. For example, if we were to take word patterns made from each 3 character pattern in the words:

Term / Patterns / No. similar / Score ( ( s1 + s2 ) / (w1 + w2 ) )
catalytic converter / cat ata tal aly lyt yti tic con onv nve ver ert rte ter
w1 = 14 / (s1 =) 14 / 0.7368
4-way honeycomb catalytic converter / 4-w -wa way hon one ney eyc yco com omb cat ata tal aly lyt yti tic con onv nve ver ert rte ter
w2 = 24 / (s2 =) 14

This gives the two terms a similarity score of 0.7368

Using this type of information, we can create a ‘cluster’ of terms around a single chosen term by taking, for example, those which have a similarity score above 0.65 (or above 0.8 as shown in the example dialog below).

Using these tasks in combination, along with two files about the use of drugs, the following set of coordinated terms were produced.

The results can be improved by the removal of other tokens from the tasks, but the usage coordination in the example shows a particular item.

In this case, Dangerous Drugs Act, 1920 Dangerous Drugs Act (partially visible) and Dangerous Drugs Act 1920 seem to be used synonymously within the documents. This is differentiated in a further entry of Drugs Act which matches closely to 1971 Drugs Act. These results will be of use in subsequent tasks.

Important Note: The nature of the algorithm of this tasks makes the task itself, very computationally intensive.

A further task is available that will allow interfacing between the various other tools. Primarily, this allows the user to create KWIC analysis for the various term candidates. It will eventually allow an interface to the Tracker system via the creation of a profile.

Pressing the Generate KWIC results will generate the KWIC results.

New KWIC!!!

KWIC information now contains details about term, frequency and task, plus the KWIC data and a reference to file/line number. This information will be useful in generating the final report from the system. It can also be used, by selecting the reference (e.g. highlighting the 0_38) with the mouse, to show the approximate location within the text of the located item.

12.Technical details

The design of the system allows for a certain amount of automatic data to be made use of. The options currently available will occur in a valtasks.ini file placed in the user’s home directory on the first use of the tool. This allows each user to set default values for many of the tasks, thus reducing the amount of work required during a usage session. It also allows for certain default files to be set so that, for example, Ignore words in the Choose Collocational Cues task can be set automatically for each session.

Localisation of the system for certain languages will be straightforward, although this will be governed to some extent by the Operating System on each machine.