January 23, 1999

Gr8 In4mation: An Orthographic Survey of Numbers Online

Primacy of numbers

(1)All computer representations of text is fundamentally numbers

(2)Numbers are part of search systems in numbering sets, reporting set size, limiting by years

"People in the data processing community have gotten used to viewing things in a highly simplistic way, dictated by the kind of tools they have at their disposal. And this may suggest another wonderful irony. People are awed by the sophistication and complexity of computers, and tend to assume that such things are beyond their comprehension. But that view is entirely backwards! The thing that makes computers so hard to deal with is not their complexity, but their utter simplicity." (Kent, 1978, p. viii)

** Establish database systems at beginning and give the survey results in brackets thoughout the essay.

These are bibliographic databases – focus is not on databases of numbers.

Parts of a bibliographic record that are numeric:

>toxline
>datastar

YF Field (Publication Year): You can search by publication year by specifying the entire year (e.g., 1991), or by using only the last two digits of the year (e.g., 91). In the YF field, the year appears as four digits. Note that there are currently over 3,000 records with publication year = 0000 (i.e., the publication year is unknown). Records go back at least as far as 1932. So by 2032, there may be some confusion when entering '32' to indicate the publication year.

The Y2K problem as a representation of numbers problem

“The Bugs in Your Future” Wired January 1999, 7.01, pp. 76 – 77

January 1, 1999 – programs that use “99” as a sentinel value (for example, to indicate that no year value was available for a given database entry) start treating everyday dates a special cases.

August 22, 1999 – GPS software rolls over its week counter for the first time. .. with the system dating from January 5, 1980, the rollover has never been tested live before.

September 9, 1999 – End-of-File Bug (Part 1) Programs that use “9999” as an end-of-file marker may mistake the date 9/9/99 as an end of file.

January 1, 2000 – Even if 85% of Y2K-prone applications are fixed, about 1.7 million will still fail next New Year’s Day.

September 8, 2001 – End-of-file Bug (Part 2) Unix programs using 999,999,999 as an end of file maker confuse the data with the date.

Numbers as identifiers

"A person might be identified by a social security number, employee number, membership number in various organizations, military service number, various account or plicy numbers (strictly speaking, these latter don't identify him, but sometjhing he's related to; on the other hand, you might also say that about a social security number). A department may have a name (Accounting) and a number (Z99). A book has a title, a Library of Congress number, an ISBN (International Standard Book Number), not to mention various Dewey decimal identifiers in local library catalogs. And each copy of a book may have an 'accession number', assigned locally by a library for their overall inventory management." (Kent, 1978, p.43).

Maping of number system of one database to another: Standard Industrial codes. Dialog’s Map Command.

Patent databases?

** Importance of tokenization and normalization. Numbers and punctuation push normalization routines to the limit.

The various ways of Writing Numbers

How to write a number: 1, 1.0, one, I, unity, 3 –2, etc.

Forms of words describing numbers

“…billion, trillion and such jocular coinages as jillion, skillion, zillion…” Hurford, 1987, p. 44

How does pluralize a number: 1s

Adding an ‘s’ for a plural form changes the number/letter name of a tool, weapon

dialog

defense newsletters

Another example (here I am looking for articles about the U.S. Navy's ES-3 aircraft):

? s ES()3/ti

This fails to capture a record which has the title, "ES-3s Play Increasing Role in Carrier Operations."

This is not a huge problem, for more often than not, both singular and plural forms exist in the record; I did however find some instances where either only the singular or only the plural form existed in the record. This is also worthy of mention because to someone familiar with these and other aircraft, adding the "s" to the search argument doesn't really make sense (at least initially). It is one thing to see the plural form in text, quite another to search for an "HH-65AS" - there is no such thing. The letter designators at the end have meaning, they indicate the "model" or "version" of the aircraft, and it requires a shift in thinking so as to accommodate the computer.

How to search for a negative number?

Ancient forms of numbers/written numbers

There are no numerals to worry about at least. All numbers in the King James Bible appear as words. The form taken by these words (beyond the aforementioned lack of hyphens), however, can throw off the searcher unfamiliar with the period conventions. To take a famous example, if you were looking for references to the "Number of the Beast" and simply entered the query:

?s six(w)hundred(w1)sixty(w)six

You'd return zero hits. Finding these references will require either a much-modified search or the knowledge that period form of this number is "six hundred threescore and six". Questions concerning the configuration taken by numbers are made more difficult by the fact that modern and archaic forms are used interchangeably. Alongside the 91 uses of word 'threescore' are 13 of 'sixty' and the 35 'fourscore's alternate with 3 'eighty's etc. Just a point to keep in mind when searching for Biblical references to a number (a subject of great interest for certain types).

“But the use of an alternative notation for numerals is seldom, if ever, obligatory, and conventional orthographic forms exist. 365 can be written out three hundred and sixty five. The alternative notation can be seen as an efficient shorthand for the longer forms, although it is no doubt significant that such shorthands are especially common for numeral expressions. But there are other shorthands, such as e.g., i.e., &, +, @, £, =,%, in quite common use.” Hurford, 1987, p. 5

“It is interesting to note that the number 2 is never (standardly) named by an expression like one plus one, although the number 11 is, not surprisingly, often expressed as something like ten plus one.” Hurford, 1987, p. 8.

“…that the French numeral system has the remains of a 20-based system in the expression quatre-vingts. And then someone else will usually chip in with the information that in parts of French-speaking Belgium and Switzerland a purely decimal system with septante, octante, and nonante is found.” Hurford, 1987, p. 15.

Tokenizing and Normalizing Numbers

Query Tokenization makes target impossible or part of number conflict with parts of query system.

Normalization of numbers strips them of meaning

geobase

Searching for numbers is difficult. It is impossible to determine what punctuation is between numbers.

A search for 0()0()0()0 brought the following results:

0-0-0-0

(0.0(0.0-5.0,P<0.05)

percentage of nickel Ni 0.0(0.0-0.1)

site densities 0.0, 0.0, 0.01

A search for 1()1()1()1 brought:

ratios of 1:0, 3:1, 1:1, 1:3, and 0:1

Period as decimal point in a number. Also as a period in a number form such as vol. 21

mantis
datastar

The following abstract phrase is more troublesome:

AB ..."Vol. 21, No. 1, pp. 40-44..." ("No" is a stopword in this database.) 00008212.an

1_: vol adj '21'

Cannot be found. In this case the period abbreviating volume is recognized as ending a sentence.

European numbers vs American numbers

(art literature international)

The use of a super script to describe an art movement art n makes for a challenging item to search for. Trying for adjacency with the query:

?s art()n

The results were none. Expanding to look at the basic index around such a search resulted in the monstrous term:

1 ART ) N, INC., ARTISTS' GROUP

Going backwards and shooting queries at the index gave only error messages stating that the "parentheses do not balance."

>datastar
>dissertation abstracts

The UMI print edition of Dissertation Abstracts has allowed upper and lower cases, subscript and superscript elements, italics, script letters, and other symbols in scientific notation since 1989. UMI editors transcribe certain combinations of these elements in various ways for consistency and clarity and explain these conventions in the special guide, "Information on Dissertations in the Sciences," included in the print edition. The DataStar version does seem to respect some of these transcriptions. However, DataStar itself also adds another layer of transcription conventions that make some combinations unsearchable.

For example, UMI distinguishes lower case letters from upper case letters in mathematical equations by changing lower case to upper case, but enclosing it with apostrophes. The UMI transcription of the equation, X = X(x,t) is X = X (`X',T). This does retrieve 84 records with the argument x adj 'x' adj t, but DataStar would ignore the quote marks and all upper and lower cases. So the entry in the DataStar version loses the Dissertation Abstracts refinements to distinguish upper and lower cases where these are significant.

However, in the abstracts DataStar displays its transcription codes (mixtures of the delimiter $, back slashes, curly brackets, and other symbols) along with the actual transcribed content. This creates "words" such as $q(x)$ and strings such as the equation ${\rm log(\mu)} = {\bf x\sp{T}\beta }, $. Figuring out which of the DataStar versions of a desired expression is reproduced in a given abstract is beyond the patience of most searchers.

--- dates with ? for uncertainty --- the ? conflict with dialog’s prompt

>biography master index
Numbers: >Punctation is stripped out of numbers with the exception of hyphens. (We did not find an example of ampersands embedded within numbers and therefore make no claims on this.) As a result, decimal points and commas should be excluded from queries. For example, a record containing the birth and death dates of a particular individual make look like this:

>Smith, Henry 1545?-1649.

(We suspect the question-mark indicates indexer uncertainty.) Queries finding this record would not include the question-mark as the question-mark is not indexed. A question-mark in a query would either be interpreted as a truncation, e.g. 47=>f 1545?, or a wild-card, e.g. 48=>f 1545?-1649.

Roman numerals

dialog
(quotation database)

This was a relatively harmless error, but a more insidious orthographic error is the difference between a citation for "bk.I" in which the first book is represented by the Roman numeral (essentially, the letter "I") and a citation for "bk.1" in which the first book is represented by the Arabic numeral. The use of Roman numerals (I, II, III, etc.) and Arabic numerals (1, 2, 3, etc.) in the descriptions of books, volumes, parts, chapters, etc. seems to be about equal: most often, the first category is a Roman numeral, followed by Arabic ("bk.I, vol.1, ch.1"). However, this pattern does not always hold true and, in fact, sometimes no descriptions are used at all (I, 12, 150). It's necessary to know how the source is described if you're searching based on this criteria. A search for

s bk()i/nt

would not find a record in which the Note field included the phrase "bk.1."

Confusion of numbers/words = zero and “Oh”

agricola

Another number-related problem in Dialog will undoubtedly stump even the most prepared database user. Scanning the basic index reveals that

"0" (the number zero) sometimes stands in the place of "O" (the letter O):

E16 1 0BSERVATIONS

E23 2 0CTOBER

E48 2 0LD

>ei compendex
7. In the CN field (as in the ID field), periods are honored. However, this isn't necessarily true in other fields.
?s 804.2
S25 2 804.2
------

>419-008
>foodline
>datastar
>numbers, punctuated

>numbers, formulas

>punctuation, comma

Interestingly, a search for '0' resulted in 44,358 records and a search for '000' resulted in 21 records. Let me qualify the preceding sentence by stating that this phenomenon is interesting to humans only, not computers. The fact that the number zero could be searched using one, two, three or more "zero characters" seems silly from a human perspective because, simply, zero is zero. The thought might never occur to a novice searcher to look for numbers in this fashion. Of course, to a computer, `0' and `000' are quite different and distinct words.

1_:'000' results in 21 records

One citation (FOST Accession #0000384609) included the phrase "20 000" and another citation (FOST Accession #0000372387) contained the phrase "Petrothene NA 214-000 Resin." Therefore, a search for "000" may result in numbers or parts of formulas or titles.

2_:`250' adj `000' results in 1 record

This citation (FOST Accession #0000357359) includes the phrase "250 000." One might conclude that this type of search will work for any number larger than 999. However, a search for:

3_:`14' adj `000' results in 0 records

4_:`14000' results in 26 records

Of these records, one included the word "14000" (without a comma) while another included the word "14,000" (with a comma). (FOST Accession #0000418881 and #0000387345) Searching for exact numbers in the FOST database has proven to be quite arbitrary.

Numbers treated phrases

>datastar
>pais
Sometimes, though, the period is retained:
PAIS 16_: a.14
AN 961108500 961216.

SD (Sales no. E.96.II.A.14) (UNCTAD/DTCI/32).

The SD (Series Description) field seems to contain many, many examples of numbers and letters joined with punctuation; how these are indexed is rather unpredictable. DataStar claims to drop all punctuation except decimal points in the middle of numbers ; this obviously is not always true. Decimal points at the beginning of numbers also seem to be indexed, even when immediately preceded by a letter; and hyphens are sometimes indexed.

>102-024
>agricola
>dialog
>numbers, punctuated
>fields, different rules across
>punctuation, retained in field

Note that a searcher might be confused further if she notices that number punctuation is retained in hard phrases:

?s '0.70 disease ratio'/id

S6 1 '0.70 DISEASE RATIO'/ID

>411-005
>book review index
>dialog
?S "1,2,3"/TI
S53 0 "1,2,3"/TI
?S 1, 2, 3
S55 5 1, 2, 3

Numerical aspects of Dialog and Datastar themselves

- Calibrating truncation or wildcards

- Making back references to sets. Numbers to label search sets. Note DataStar's default to a set lable and then if that fails then to a text number.

Creation of forms and crosstabulations online

Numbers as corporate names

>112-005
>world reporter

>dialog

>names, corporate

>names, with embedded numbers

>numbers, spelled out

>names, corporate

>punctuation, hyphen

Numbers can be searched in Dialog, but they may appear as numbers or they may appear spelled out as words:

? s 7 () 11/co

30 7/CO

4 11/CO

S1 0 7 () 11/CO

? s 7 () eleven/co

30 7/CO

33 ELEVEN/CO

S2 29 7 () ELEVEN/CO

? s seven () 11/co

105 SEVEN/CO

4 11/CO

S3 0 SEVEN () 11/CO

? s seven () eleven/co

105 SEVEN/CO

33 ELEVEN/CO

S4 4 SEVEN () ELEVEN/CO

This company also has an embedded hyphen in its name, which gets stripped out in word-indexing but reinstated in phrase-indexing:

? e co=7-eleven

Ref Items Index-term

E1 1 CO=50 OFF STORES INCORPORATED

E2 2 CO=600 GROUP PUBLIC LIMITED COMPANY

E3 0 *CO=7-ELEVEN

E4 29 CO=7-ELEVEN CO.

E5 6 CO=7TH LEVEL INCORPORATED

? e co=seven-eleven

Ref Items Index-term

E1 84 CO=SEVEN NETWORK LIMITED

E2 3 CO=SEVEN SEAS PETROLEUM CORPN

E3 0 *CO=SEVEN-ELEVEN

E4 4 CO=SEVEN-ELEVEN JAPAN CO LIMITED

E5 1 CO=SEVEN-UP BOTTLING CO PLC (NIGERIA)

Numbers inside words

A10tion - attention
Gr8 – great

“H4ck1ng for g1rl13z” at http:

Dialog example of 4-6 is broken into two words.

>423-006
>dialog
>art literature international
>words, spelling errors
>numbers, plus letters - punctuated
>punctuation, retained in field
>punctuation, apostrophe
An example of a typo being indexed is the string of terms below:

E5 1 THE'70S

The embedded single quote is part of the string of characters and needs to be used when searching for that thing the computer thinks is a word.

?s "the'70s"

S19 1 "THE'70S"

This is the title that the string is pulling from," Bookmaking in the'70s; redefining the artist's book."

e

>103-029
>art literature international
>dialog
>numbers, punctuated
>punctuation, retained
>punctuation, stripped and used to break words

>punctuation, hyphen

>punctuation, colon

>numbers, standard

Numbers are treated the same as words in Dialog database 191, and are broken on the same punctuation. Numbers do not have to be literalized, unless they can be confused with search statements.

**The >phrase 6:9-11 (as in the Bible verse Revelations 6:9-11) retains the hyphen, but the colon is stripped out and an adjacency operator must be used.

?s 6()9-11

Record #0126780

Numbers beginning words

Sorting

(dissertation abstracts)

Scanning Dialog's basic index and Datastar's dictionary file reveals numerous index entries that have a low probability of ever being targetted by a query. Here, for example, is the result of a root command in Datastar starting at "0":

0.0L

0.02V

0.2-2-3-32REV

0.30S

0.33PER

0.5MM

"0.33PER" refers to thirty-three cents (per unit), "0.55MM" refers to a precise metric measurement, "0.02V" refers to voltage, and so on. It might be interesting to try to determine what percentage of the Dialog and Datastar index space is wasted, how much network bandwidth is wasted, and how much searcher time is wasted in dealing with the indexes as they now stand.

>301-008
>numbers, homonymic use
>words, spelling errors
This record shows two different spellings for the same man. The Ti field spells his name "2pac" while the su field spells his name "Tupac."
NO: BBIO94018201
AU: Hamilton, Kendall.

TI: Double trouble for 2pac.

SO: Newsweek v. 124 (Dec. 12 '94) p. 62-3

PH: p. 62-3 : pors.

IS: 0028-9604

PB: H. W. Wilson Co.

PL: United States

PD: 1994

RT: art

AC: biography

SU: Shakur, Tupac, rap musician and actor.

>301-002

>numbers, and letters

>numbers, homonymic usage

Numbers as Text

Numbers as words

2 - to

4 - for

8 - ate

4 2sday night - for Tuesday night

Here is an entertaining record containing a word made-up of a number, hyphen and letters:

NO: BBIO85000760

AU: Donahue, Deirdre.; Kelley, Jack.; Schindehette, Susan.

TI: In the golden afterglow 10-acious Mary Lou Retton attacks the rest of

her life.

Numbers ending words

N umbers as words

Dialog can use single numbers such as 4, but deconstructs compound numbers such as 1,234

“In Beeptalk, The Words Add Up,” NY Times Wed April 29, 1998.

Numbers as money

>109-040
>datastar
>pais
ROOT *L$
R1 1 DOC *L0.25
R2 2 DOCS *L0.28
R3 6 DOCS *L0.3
The '*l' seems to denote english pounds.

Numbers as time

publishing house on the Internet. Its name? "00h00", or "Zero

Heure", because he knows very well that he is taking off into

new, uncharted territory.