IN350 Exam Key – Fall 2000 – was an open book exam.
(This was just a general guideline to the answers. The answers would have been more
precise in nature. Alternative answers with good explanations were also acceptable.)
Q1.
Zipfs law - frequency of occurrence of some event is a function
of the rank where the frequency distribution is a power law
function and the exponent is close to unity (that is 1). That
is the frequency of occurence of words in a document follows
this function, it decreases exponentially, according to the
rank order of the word. To designers this means a few
hundred words make up about 50% of the text. Stopword lists
can then be used to make indexes smaller. To users, they
get faster response.
Heaps law - the number of words in a document increases logarithmically
with increasing size of text. (It looks like the Zips distribution
transposed.) To designers most words are found in the first volume
of an encyclopedia, for example. So the index will not grow so fast
after a certain base sample. Heaps law also implies that there are
more unique shorter words than longer (word length),
but the average word length of all words in the text is
constant because there are more (greater number) of shorter words
occuring in the text than longer ones.
To users, the law is significant because it shows there is a bound
on how much compression you can get (effecting response times).
Q2.
They have a open book-note exam. So there is a chart and discussion in
their book that describes main compression techniques. The alternatives
are a.Arithmetic, b.Character Huffman, c.Word Huffman, d.Ziv-Lempel
The best answer would be c.Word Huffman, but they can argue otherwise
if they like, and we accept valid-correct reasons. A chart in the
book is as follows.
a b c d
Compression ratio vg p vg g
Compression speed s f f vf
Decompression speed s f vf vf
Memory space l l h m
compressed pattern match n y y y
random access n y y n
vg = very good, p = poor, g= good, s= slow, f=fast, vf= veryfast
l=low, h=high, m=medium, n=no, y=yes
For effective operation in an IR environment a compression method
should satisfy the following requirements: good compression ratio,
fast coding, fast decoding, fast random access without the need
to decode from the beginning, direct searching without the need
to decompress the compressed text. (They should discuss how their
choice deals with these.)
Compression ratio: 2 bits per character is vg, 30-45% compression
ratio is good, over 45% is poor.
Compression speed: arithmetic is complex so slow, huffman have two
passes not as dynamic as ziv-lempel but not far behind.
Decompression speed: word huffman and ziv-lempel are equal fast. word
more efficient than character huffman. arithmetic is again slow from
complexity.
Memory space: depends on size of vocabulary and text in tables
containing strings. Word methods take more space than character
methods. ziv-lempel saves on repeated strings. arithmetic does
not use words.
compressed pattern matching (direct access to compressed text)-see
table. huffman allows decompression to start anywhere. arithmetic
and ziv-lempel must start from beginning. word huffman has methods
for searching on compressed text also.
Some research algorithms on ziv-lempel allow searching on compressed
text, that is 2x faster than decompressing and searching, but
slower than searching on decompressed text.
Huffman codes are choice in full text retrieval where both speed
and random access are important.
3.
Write an index. This one below works.
CREATE INDEX standards_idx1
ON standards (abstract etx_doc_ops) USING etx(
WORD_SUPPORT='PATTERN',
PHRASE_SUPPORT='MEDIUM',
STOPWORD_LIST= 'my_stopwords',
INCLUDE_STOPWORDS='TRUE'
)
IN sblobspace;
See there are 8 lines. Line 3 is optional. It is to support Pattern
searching which is not used in the example select statement.
Line 4 is required for our example. Here the phrase support can
equal MEDIUM or MAXIMUM.
Line 5 and 6 are optional. The example does not use stopwords.
Q3b. Explain output.
The search engine returns hits for documents that contain the
phrase with all the words in the clue, and in the exact order.
No misspellings or transpositions in words are allowed. There
is no pattern matching in the select statement. They do not have
to draw the output table. But it does return fields group_no,
s_name and abstract. In the abstract field there is a file name
that is identified by its full local file system path name.
It is a pointer to the file that contains the phrase and not
the actual document.
Q3c. Explain why sql-statement is not supported in AppPages.
(I said answer in a sentence or two, so they can just have the
spirit of this response.)
The abstract column does not actually contain the search text
but a pointer of data type LLD_Locator to the operating system
file specified by the Insert statement. The etx index contains
the seach text stripped of all formatting information. The AppPage
Web Datablade Module does not access the etx index. Maybe the
index is internal to the local file system and is not made
available to the Web datablade module. (Using an analogy,
my files on the hsm file system are not available on the web,
but only the files under my html directory are.)
Q4.
D=2.1 log (8x10^8) = 2.1(8.903) = 18.7 = 19
D=2.1 log (8x10^10) = 2.1(10.903) = 22.896 = 23
D= the number of clicks between any 2 documents on the internet
if the number of documents is N.
To narrow a hit list there are different suggestions for handling
queries and steps that could be taken.
Starting points for a query type:
specific query - look in an encyclopedia, use a library.
broad query - start with web directories.
vauge query - use web search engines, refine query on the relevant
answers.
Steps: 1. teach the user to make better queries, teach them about boolean logic,
2. use web directories to start on a subject,
3. use ranking and refine searches to narrow search,
4. use metasearch engines to compare results from search engines.
Q5. This question can be answered by filling in the tables A and B.
For table A, Interpolated precision is exactly the same as the precision
column, 100, 67, 60, 57, 36. The 3pt average is 217/3=72.33
The R-Precision is 3/5= .6 (That is there are 5 relevant documents.
By the time you get to the 5th rank position you have encountered
3 relevant documents. Thus, 3/5.)
For table B, Interpolated precision is: 100, 50, 50, 50, 50. 3pt
average = 200/3 = 66.67 (a little worse than A.)
The R-Precision is 1/5=.2 (That is much worse than A.)
Q6. They need to draw the tables. I cannot draw them here.
There should be a line between the Order-Number and Order-Number in
tables 1 and 2. There should be a line between the Part-Number and
Part-Number of tables 2 and 4. There should be a line between
Supplier-Number and Supplier-Number of table 3 and 4. Table 2 is
labled the fact table. Tables 1, 3, 4 are each labeled dimension
tables. The primary key is both Order-Number and Part-Number in
Table 2 only. The foreign keys are Order-Number in table 1,
Part-Number in table 4. Supplier-Number in table 3 can only be
searched from Table 4, so it is a foreign key to table 4, but
not to the fact table.
What is the advantage of the snowflake schema? Smaller fact table.
What is the disadvantage of the snowflake schema? poor performance on
browsing.
Q7.
The on-trend in-stock strategies are for optimizing selection and
availability while minimizing absolute inventory levels. They
are also called "quick response merchandise management" and are
intended to improve the efficiency of the retail demand chain.
The participants in the chain try to improve management of
obtaining materials (procurement), inventory, and distribution.
In the Retail supply chain:
Customer Assest Management - draws data and sales from the customer.
It can be at the POS terminal, also with any contact with the
customer (when they view a web page, or call). They use the information
to manage customer needs proactively.
Integrated logistics - is managing the flow of physical goods from
suppliers to the customer. This is management of production planning, procurement,
and inventory.
Agile manufacturing - is managing the manufacturing process to ensure
low production costs. It allows finished goods to be made to order using
real-time sales and configuration information collected from the customer.
In the Microsoft case they had a problem that their distribution facility
was in Seattle and most of their customers on the east coast of the US.
So it took a week to get goods to customers. They did several things:
1. they outsourced consumer product production to a turnkey software
producer that was a reliable partner-supplier, good at acquiring resources.
Products supplied to distribution center went from 6 weeks to 7 days.
This was part of improving agile manufacturing and integrating the Retail
supply chain.
2. they outsourced and moved the distribution plant to from Seattle to
Indianapolis (middle of the country) to cut delivery from distribution center
to customers from 7 days to 2 days; This was part of customer assest mangagemnt.
3. created a returns and overruns center in
Toledo (also center US, also this was an outsourcing of a warehouse
function that allowed Microsoft to contract and expand the size of
this function as needed; Also part of better management of the Retail
supply chain and better customer service.
4. Microsoft installed a demand forcasting system that took in sales data
using SKU (sales keeping unit) sales data and compared it with
inventory levels. This was an integrated logistics system. This they
did themselves and it was a key part of integrating the Retail supply chain.
Results was short production lead times of products to customer. AND they could
leave their production schedule open until one week before it was made.
So, they are better able to make only what is demanded.
Q8.
A Continuous Process Manufacturer produces products where there in no
interuption in the production process, like in oil production. Process
manufacturers have a big investment in their manufacturing plant and
equipment, they use diverse packaging to sell the product (can think
of telecom bandwidth services) and they have a hard time attributing
costs and profits with the specific product line.
For EDI notes see web notes from class 12 and 13. I expect students to
use their reasoning and some of the book and note facts to answer this one.
They can go across notes to identify: goals, participants, systems to
match their example. They are the same as those in Retail Supply Chain
Management.
The book also states,
1.Major issues in EDI document exchange - order processing, demand forcasting,
customer information sharing,
sales recording system, stock recording system
electronic message formats.
Also mention the documents exchanged - quotations, purchase orders, change orders,
bills, receiving advice, invoices
2.Technology issues: software used, translation software to EDI formats, was
expensive to write. Now packaged solutions exist. Also, applications can be
based on a common exchange format - XML/EDI
3.Important technology alternatives include - intranets, bar coding,
XML application support, security for intranets and internet retail sites.
security for intranets and internet (technology limitations below).
4. Security - see class notes: can use SET/SSL, firewalls, encryption, certificates.
5. Expected improvements to participants - reduce paperwork, improve quality, reduce
inventory, better information available for decision making, audit
trails.
6.Limitations with EDI, why it lags adoption - high cost of software development,
limited access (partners must be VAN members before, now internet technology
used), rigid rules to set up partnerships (protocols, now XML will make it
easier), partial solutions (purchase order, but not electronic funds
transfer. Must integrate solutions using Intranets).