IN350 Exam Key – Fall 2000 – was an open book exam.

(This was just a general guideline to the answers. The answers would have been more

precise in nature. Alternative answers with good explanations were also acceptable.)


Zipfs law - frequency of occurrence of some event is a function

of the rank where the frequency distribution is a power law

function and the exponent is close to unity (that is 1). That

is the frequency of occurence of words in a document follows

this function, it decreases exponentially, according to the

rank order of the word. To designers this means a few

hundred words make up about 50% of the text. Stopword lists

can then be used to make indexes smaller. To users, they

get faster response.

Heaps law - the number of words in a document increases logarithmically

with increasing size of text. (It looks like the Zips distribution

transposed.) To designers most words are found in the first volume

of an encyclopedia, for example. So the index will not grow so fast

after a certain base sample. Heaps law also implies that there are

more unique shorter words than longer (word length),

but the average word length of all words in the text is

constant because there are more (greater number) of shorter words

occuring in the text than longer ones.

To users, the law is significant because it shows there is a bound

on how much compression you can get (effecting response times).


They have a open book-note exam. So there is a chart and discussion in

their book that describes main compression techniques. The alternatives

are a.Arithmetic, b.Character Huffman, c.Word Huffman, d.Ziv-Lempel

The best answer would be c.Word Huffman, but they can argue otherwise

if they like, and we accept valid-correct reasons. A chart in the

book is as follows.

a b c d

Compression ratio vg p vg g

Compression speed s f f vf

Decompression speed s f vf vf

Memory space l l h m

compressed pattern match n y y y

random access n y y n

vg = very good, p = poor, g= good, s= slow, f=fast, vf= veryfast

l=low, h=high, m=medium, n=no, y=yes

For effective operation in an IR environment a compression method

should satisfy the following requirements: good compression ratio,

fast coding, fast decoding, fast random access without the need

to decode from the beginning, direct searching without the need

to decompress the compressed text. (They should discuss how their

choice deals with these.)

Compression ratio: 2 bits per character is vg, 30-45% compression

ratio is good, over 45% is poor.

Compression speed: arithmetic is complex so slow, huffman have two

passes not as dynamic as ziv-lempel but not far behind.

Decompression speed: word huffman and ziv-lempel are equal fast. word

more efficient than character huffman. arithmetic is again slow from


Memory space: depends on size of vocabulary and text in tables

containing strings. Word methods take more space than character

methods. ziv-lempel saves on repeated strings. arithmetic does

not use words.

compressed pattern matching (direct access to compressed text)-see

table. huffman allows decompression to start anywhere. arithmetic

and ziv-lempel must start from beginning. word huffman has methods

for searching on compressed text also.

Some research algorithms on ziv-lempel allow searching on compressed

text, that is 2x faster than decompressing and searching, but

slower than searching on decompressed text.

Huffman codes are choice in full text retrieval where both speed

and random access are important.


Write an index. This one below works.

CREATE INDEX standards_idx1

ON standards (abstract etx_doc_ops) USING etx(



STOPWORD_LIST= 'my_stopwords',



IN sblobspace;

See there are 8 lines. Line 3 is optional. It is to support Pattern

searching which is not used in the example select statement.

Line 4 is required for our example. Here the phrase support can


Line 5 and 6 are optional. The example does not use stopwords.

Q3b. Explain output.

The search engine returns hits for documents that contain the

phrase with all the words in the clue, and in the exact order.

No misspellings or transpositions in words are allowed. There

is no pattern matching in the select statement. They do not have

to draw the output table. But it does return fields group_no,

s_name and abstract. In the abstract field there is a file name

that is identified by its full local file system path name.

It is a pointer to the file that contains the phrase and not

the actual document.

Q3c. Explain why sql-statement is not supported in AppPages.

(I said answer in a sentence or two, so they can just have the

spirit of this response.)

The abstract column does not actually contain the search text

but a pointer of data type LLD_Locator to the operating system

file specified by the Insert statement. The etx index contains

the seach text stripped of all formatting information. The AppPage

Web Datablade Module does not access the etx index. Maybe the

index is internal to the local file system and is not made

available to the Web datablade module. (Using an analogy,

my files on the hsm file system are not available on the web,

but only the files under my html directory are.)


D=2.1 log (8x10^8) = 2.1(8.903) = 18.7 = 19

D=2.1 log (8x10^10) = 2.1(10.903) = 22.896 = 23

D= the number of clicks between any 2 documents on the internet

if the number of documents is N.

To narrow a hit list there are different suggestions for handling

queries and steps that could be taken.

Starting points for a query type:

specific query - look in an encyclopedia, use a library.

broad query - start with web directories.

vauge query - use web search engines, refine query on the relevant


Steps: 1. teach the user to make better queries, teach them about boolean logic,

2. use web directories to start on a subject,

3. use ranking and refine searches to narrow search,

4. use metasearch engines to compare results from search engines.

Q5. This question can be answered by filling in the tables A and B.

For table A, Interpolated precision is exactly the same as the precision

column, 100, 67, 60, 57, 36. The 3pt average is 217/3=72.33

The R-Precision is 3/5= .6 (That is there are 5 relevant documents.

By the time you get to the 5th rank position you have encountered

3 relevant documents. Thus, 3/5.)

For table B, Interpolated precision is: 100, 50, 50, 50, 50. 3pt

average = 200/3 = 66.67 (a little worse than A.)

The R-Precision is 1/5=.2 (That is much worse than A.)

Q6. They need to draw the tables. I cannot draw them here.

There should be a line between the Order-Number and Order-Number in

tables 1 and 2. There should be a line between the Part-Number and

Part-Number of tables 2 and 4. There should be a line between

Supplier-Number and Supplier-Number of table 3 and 4. Table 2 is

labled the fact table. Tables 1, 3, 4 are each labeled dimension

tables. The primary key is both Order-Number and Part-Number in

Table 2 only. The foreign keys are Order-Number in table 1,

Part-Number in table 4. Supplier-Number in table 3 can only be

searched from Table 4, so it is a foreign key to table 4, but

not to the fact table.

What is the advantage of the snowflake schema? Smaller fact table.

What is the disadvantage of the snowflake schema? poor performance on



The on-trend in-stock strategies are for optimizing selection and

availability while minimizing absolute inventory levels. They

are also called "quick response merchandise management" and are

intended to improve the efficiency of the retail demand chain.

The participants in the chain try to improve management of

obtaining materials (procurement), inventory, and distribution.

In the Retail supply chain:

Customer Assest Management - draws data and sales from the customer.

It can be at the POS terminal, also with any contact with the

customer (when they view a web page, or call). They use the information

to manage customer needs proactively.

Integrated logistics - is managing the flow of physical goods from

suppliers to the customer. This is management of production planning, procurement,

and inventory.

Agile manufacturing - is managing the manufacturing process to ensure

low production costs. It allows finished goods to be made to order using

real-time sales and configuration information collected from the customer.

In the Microsoft case they had a problem that their distribution facility

was in Seattle and most of their customers on the east coast of the US.

So it took a week to get goods to customers. They did several things:

1. they outsourced consumer product production to a turnkey software

producer that was a reliable partner-supplier, good at acquiring resources.

Products supplied to distribution center went from 6 weeks to 7 days.

This was part of improving agile manufacturing and integrating the Retail

supply chain.

2. they outsourced and moved the distribution plant to from Seattle to

Indianapolis (middle of the country) to cut delivery from distribution center

to customers from 7 days to 2 days; This was part of customer assest mangagemnt.

3. created a returns and overruns center in

Toledo (also center US, also this was an outsourcing of a warehouse

function that allowed Microsoft to contract and expand the size of

this function as needed; Also part of better management of the Retail

supply chain and better customer service.

4. Microsoft installed a demand forcasting system that took in sales data

using SKU (sales keeping unit) sales data and compared it with

inventory levels. This was an integrated logistics system. This they

did themselves and it was a key part of integrating the Retail supply chain.

Results was short production lead times of products to customer. AND they could

leave their production schedule open until one week before it was made.

So, they are better able to make only what is demanded.


A Continuous Process Manufacturer produces products where there in no

interuption in the production process, like in oil production. Process

manufacturers have a big investment in their manufacturing plant and

equipment, they use diverse packaging to sell the product (can think

of telecom bandwidth services) and they have a hard time attributing

costs and profits with the specific product line.

For EDI notes see web notes from class 12 and 13. I expect students to

use their reasoning and some of the book and note facts to answer this one.

They can go across notes to identify: goals, participants, systems to

match their example. They are the same as those in Retail Supply Chain


The book also states,

1.Major issues in EDI document exchange - order processing, demand forcasting,

customer information sharing,

sales recording system, stock recording system

electronic message formats.

Also mention the documents exchanged - quotations, purchase orders, change orders,

bills, receiving advice, invoices

2.Technology issues: software used, translation software to EDI formats, was

expensive to write. Now packaged solutions exist. Also, applications can be

based on a common exchange format - XML/EDI

3.Important technology alternatives include - intranets, bar coding,

XML application support, security for intranets and internet retail sites.

security for intranets and internet (technology limitations below).

4. Security - see class notes: can use SET/SSL, firewalls, encryption, certificates.

5. Expected improvements to participants - reduce paperwork, improve quality, reduce

inventory, better information available for decision making, audit


6.Limitations with EDI, why it lags adoption - high cost of software development,

limited access (partners must be VAN members before, now internet technology

used), rigid rules to set up partnerships (protocols, now XML will make it

easier), partial solutions (purchase order, but not electronic funds

transfer. Must integrate solutions using Intranets).