EECS 767 Information Retrieval

EECS 767 – Information Retrieval

Midterm Exam – Spring 2004

Name:______

You may consult your notes and/or your textbook. This is an 80 minute, in class exam. If there is information missing in any of the question formulations, state your assumptions and proceed with your answer. Calculators are not allowed.

The exam is out of a total 100 points.

1) (25 points) Inverted Files

(i) (10 points) Consider the following file format for an inverted file:

Dictionary file: Word (20 chars) FirstPosting (int) NumDocs (int)

Posting file: DocID (int) Weight (float)

Assume that ints and floats both require 4 bytes, that there are 100,000 documents in the collection, 250,000 unique words, and that the average word appears in 100 documents.

How big would the following files be:

Dictionary File:

Postings File:

(ii) (7 points) Describe one technique to enhance indexing to support efficient phrase processing at retrieval time. What would the new file formats be?

Description (2 points):

Files and record formats (5 points):

(iii) (8 points) For each file in (b), calculate how big the new files would be.

Additional Assumptions (if necessary):

File Sizes:

2) (25 points) Pruning the Postings

For a small document collection, all postings for a given word are usually stored. However, it is not scalable for large document collections.

What are the two primary problems created by storing all postings for large document collections? Describe a modified postings file structure that would solve the identified problems. What changes, if any, to the sort-based index algorithm would need to be made to create this new type of file? What changes, if any, to retrieve would be needed?.

(i) (4 points) Two Problems:

Problem 1:

Problem 2:

(ii) (10 points) File Structure that Solves both Problems:

(5 points) Description:

(5 points) Small Example (before and after):

(iii) (5 points) Changes required to Sort-Based Indexing:

(iv) (3 points) Changes (if any) required to Retrieve:

(v) (3 points) New Problem Created by using the above File Structure:

3. (18 points) The following multiple choice questions are worth 3 points each. Circle the most accurate answer.

(i) A TRIE is a good data structure to store the dictionary because:

(a) it doesn’t use much memory

(b) it is balanced

(d) it is faster than a hashtable

(ii) When processing only the top postings for each query term, which type of queries require the most postings to be used:

(a) Single word queries

(b) All Boolean queries

(d) They all need the same number of postings used

(iii) In the vector space model, the dimensionality of the query vector is:

(a) N – the number of documents in the collection

(b) m – the number of unique words in the vocabulary

(d) q – the number of unique words in the query

(iv) Stopwords are primarily removed to

(a) Decrease the size of the lexicon (dict file).

(b) Decrease the size of the postings file.

(d) Increase retrieval speed.

(v) A heap is a good choice for the accumulator during retrieve when

(a) we want to store information about a large number of results.

(b) we want to present results in sorted order to the user.

(d) we want to increase retrieve efficiency.

(vi) Incremental indexing requires that

(a) we can grow the dict file dynamically

(b) we can grow the post file dynamically

(d) all of the above

4. (18 points) The following fill in the blank questions are worth 3 points.

(i) One way we can reduce the size of the nodes in a TRIE is to

______.

(ii) One way we can reduce the number of nodes in a TRIE is to

______.

(iii) The vector space model calculates the similarity between documents and queries using the ______.

(iv) The two methods by which data is transferred from a <FORM> to a Web server are

______.

(v) If there are 10 relevant documents in a collection and the user receives 20 results of which 5 are relevant, they have achieved a recall of ______percent and a precision of ______percent.

5. (14 points) The following True/False questions are worth 2 points each.

(i) T F Cgi processes are always owned by their author when they run.

(ii) T F Storing only the highest weighted N postings for each word has no effect on retrieval accuracy.

(iii) T F The vector space model assumes that the dimensions are independent.

(iv) T F A TRIE can be used to support queries with wildcards anywhere in a query term.

(v) T F Stemming is used to increase recall.

(vi) T F If all documents are approximately the same length, we do not need to normalize.

(vii) T F Boolean search engines are easy for end users to understand and use.