Although the Band Does Come from Seattle, They Do Not Sound Like Nirvana

1. Why is it sensible not to use the highest frequency words in a corpus as indices? What about the lowest frequency?

2. Chris was creating a music recommendation system. It represents reviews of bands as tf-idf vectors and then finds band similar to ones that the user likes using cosine similarity. The system ran into a problem in that it was recommending both Band-A and Band-B to the same person, although Chris only wanted it to recommend Band-A. Here are the descriptions of the two bands:

Band-A: Although the band does come from Seattle, they do not sound like Nirvana.
Band-B: Although the band does not come from Seattle, they do sound like Nirvana

Why did the system behave this way? Describe any general recommendation technique that could recommend Band-A but not Band-B.

3. Why does it matter how the term weights of indices in a document are normalized? Would normalization be more important in a system that retrieves encyclopedia articles or web pages?

4. How is relevance feedback used to form a subsequent query? Describe one method.

5. Describe how an inverted index is created and how it is used in retrieval. In what order are elements stored for efficiency of retrieval?

6. Why do more IR systems use cosine similarity instead of Euclidean distance as a measure of similarity between documents?

7. Agree or disagree: As computer technologies make it possible to retrieve the full text of documents, the information retrieval problem will reduce to the standard database problem. Defend your answer

8. Agree or disagree: Global WWW retrieval services (e.g., DEC's Alta Vista service) have effectively solved the Information Retrieval problem. Defend your answer.

9. Pick a sports team or hobby that interests you. Find an authority for that topic. Find a hub. Argue that the site is a hub or an authority based on the number of links to or from that site. Hint: in Google the query “link: will find web sites that link to

10. For the web page that you identified as an authority, copy the first title and first 25 words here. (If there are fewer than 25 words, pick another web page.)

Which three words have the highest weights as indices? Describe the formula you are using for term weighting, the meaning of each variable. You will probably need to make some assumptions about the total number of documents on the web. Clearly state any assumptions. You also will need to estimate the frequency of the words. If you search for a word in Google, it will tell you the number of documents it appears in. (For example, “pig” occurs in 1,770,000 documents while “dog” occurs in 5,220,000).

11. At one point, search engines only indexed documents that were in English. Later, they indexed documents in all languages. If we assume that no document in a language other than English is relevant to an English speaking person, discuss how adding documents in other languages may affect the precision and recall of English queries. Discuss the best-case scenario, a worst-case scenario, as well as a typical scenario. Pick a word in English that has a different meaning in another language and use it to illustrate a query (e.g., “mare”, but you cannot use “mare” as your example because I just did).

12. Would a support vector machine or a collaborative recommendation system do better at finding Haiku that you find appealing? Why? Would latent semantic analysis improve the result of either method?

13. Suppose your company has a web site. The webmaster that created the site asks if he can add a link from the company’s home page to his personal home page. Discuss how this one link will change the PageRank and “hubness” of the company’s home page (assuming no other changes are made to the structure of the Internet).

14. Why is “point alienation” a more useful measure than the “sliding ratio” in evaluating large deployed information retrieval systems?

15. Describe at least three ways that the information retrieval task differs from the information classification task.

16. Google seems to work fairly well at searching web pages by finding the most important web sites on a topic. However, its new service for searching news uses the same algorithm but doesn’t do very well at identifying the most important news stories. Why does the same algorithm work well at web pages and not news? Do you think this algorithm would work better on classified ads or scientific papers? As one example, the search for “shuttle” in the news right after the accident found the following stories: