Class 9 - Search and Retrieval

Exercise overview

In classes 1-7 we explored information systems with a focus on the role of structured and semi-structured data (e.g. web-pages, metadata schemas and relational databases) in supporting the information lifecycle. We identified different types of metadata that support aspects of the lifecycle including descriptive (e.g. discovery phase), administrative (e.g. acquisition, appraisal phases), technical (e.g. preservation phase) and structural (e.g. ingest/management phases). We found that these types of metadata are essential in enabling long-term management and preservation of resources.

While we learned about the value of metadata we also found that manual metadata creation techniques are not always ideal and may not scale as needed in certain situations. In addition to this problem of scale, there is a widespread school of research and practice that asserts that descriptive metadata is poorly suited for certain types of discovery, also called "Information Retrieval" or IR. In this class we explore these alternative approaches to IR from both the systems processing perspective and the user-engagement perspective. In doing so we will build on our understanding of information seeking and use models by thinking about new types of discovery.

Suggested readings

  1. Mitchell, E. (2015). Chapter 7 in Metadata Standards and Web Services in Libraries, Archives, and Museums. Libraries Unlimited. Santa Barbara, CA.
  2. Watch: How Search Works:
  3. Watch: The evolution of search:
  4. Kernighan, B. (2011). D is for Digital. Chapter 4: Algorithms
  5. Read: How Search Works.
  6. Pirolli, Peter. (2009). Powers of Ten.

Optional Readings

  1. Read/Skim: Michael Lesk, The Seven Ages of Information Retrieval, Conference for the 50th Anniversary of As We May Think, 1995.
  2. Baeza-Yates, Ricardo and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison Wesley Longman, 1999, Chapter 1
  3. Watch: Kenning Arlisch talk about SEO in libraries - (Last talk in the first block of speakers)
  4. Explore:
  5. Read/skim:

What is Information Retrieval?

This week we are watching a short video on Google Search, browsing a companion Google Search website, learning about algorithms and finding out more about alternatives to metadata-based search. Let's start by watching the "How Search Works" video ( The Evolution of Search ( and browsing the "How Search Works" website

Question 1.Using these resources as well as your own Google searches, define the following terms:

  1. Computer Index
  2. PageRank
  3. Algorithm
  4. Free-text Search
  5. Universal Search
  6. Real-time or Instant Search

While indexes exist in database design, and are very important in database speed, we did not discuss them in depth in Class 7. It turns out that indexes exist in multiple forms and exist with multiple goals but in short they all exist to facilitate access to a large dataset. Indexes accomplish this by taking a slice of data and re-sorting it (e.g. indexing all of the occurrences of a word in a document, sorting words in alphabetic order). Indexes and approaches to indexing are one of the building blocks of IR. On top of indexes, IR systems use search algorithms to find, process, and display information in unique ways to the searcher.

IR is a broad field that includes the entire process of document processing, index creation, algorithm application and document presentation situated within the context of a search. The following figure shows a sample model of IR and the relationships between resources, the search process and document presentation process. This process is broken into three broad areas, Predict, Nominate and Choose.

Predict

In the Predict cycle, documents are processed and indexes are created that help an IR system make informed guesses about which documents are related to one another. This can be accomplished via algorithms like PageRank, Term-Frequency/Inverse Document Frequency or N-Gram indexing (more on these methods later) or by other means. The prediction process is largely a back-office and pre-processing activity (e.g. systems complete this process in anticipation of a search).

Nominate

The Nominate process is comparable to the search process that we have explored in previous classes. Using a combination of words, images, vocabularies or other search inputs the system applies algorithms to determine the best match for a user's query. This process likely involves a relevance ranking process in which the system predicts the most relevant documents and pushes them towards the top of the results. Google implements ranking using a number of factors including personal/social identity (e.g. Google will show results related to you or your friends first), resource raking with PageRank (e.g. The main website of UMD gets listed first because it is a central hub of links) and timeliness (e.g. using real-time search Google prioritizes news and other current results).

Relevance Ranking has evolved quickly over the last twenty years and is increasingly the preferred results display method. At the same time, relevancy is not always the best sorting process.

Question 2.Can you think of some areas in which relevance ranking is not the ideal approach to sorting search results?

Choose

In the resource selection process users are highly engaged with the system scanning results, engaging in SenseMaking and other information seeking behaviors to evaluate the fit of a resource with their information need and ultimately selecting documents for use. In this case as well the information product delivered may vary even using the same source documents. For example if we are seeking information about books we want to be presented with a list of texts while if we are seeking information about an idea and it's presence across multiple texts we may want to see a concordance that shows our search terms in context (also known as Keyword in Context or KWIC).

Structured data vs. full-text or digital objectInformation Retrieval

In the structured data world, search and retrieval is based on the idea that our data is highly structured, predictable and conforms to well-understood boundaries. For example if we are supporting search of books and other library materials using traditional library metadata (e.g. MARC) we can expect that our subject headings will conform to LCSH and look for other metadata fields (e.g. title, author, publication date) to support specific search functions. If in contrast we decided to find books based just on the full-content of the text with no supporting metadata we would not be able to make such assumptions.

Let's build our understanding of different approaches to indexing and IR by exploring three Google-based discovery systems. Our overarching question is "Which of Mark Twain's books proved to be most popular over time? How can we measure this popularity? Have these rankings changed over time? Which book is most popular today?" For feasibility we will limit our search to the following books: Innocents abroad, The Adventures of Huckleberry Finn, The Adventures of Tom Sawyer and A Connecticut Yankee in King Arthur's Court Roughing It, Letters from the Earth, The Prince and the Pauper and Life on the Mississippi.

In order to answer these questions we are going to explore several information systems. As we explore each information system you should look for answers to these questions and think about what type of index was required to facilitate the search and whether or not the index is based on metadata or "free text." You should also take note of the best index or search engine for this information.

Types of IR systems

Google Search (

The regular Google search interface indexes the web. There is a lot to say about this resource but I expect we are largely familiar with it. Try a few searches with Google related to each question? A "Pro-tip" for Google: Look for the "Search Tools" button at the top of the screen just under the search box. These search tools give you access to some filtering options.

Question 3.What type of index or IR system is most prevalent in this discovery environment (e.g. Metadata or full-text based)?

Question 4.What search terms or strategies proved to be most useful in this database?

Google Books (

Google Books is an index created by a large-scale scanning and metadata harvesting operation initiated by Google in the early 2000s ( Google books indexes both metadata (e.g. title, author, publication date) and the full-text of books. It uses a page-preview approach to showing users where books their search terms occurred.

Question 5.What type of index or IR system is most prevalent in this discovery environment (e.g. Metadata or full-text based)?

Question 6.What search terms or strategies proved to be most useful in this database?

HathiTrust (

The HathiTrust is a library-run cooperative organization that shares all of the scanned books and OCR data from Google. The main objective of HT is to provide an archive of scanned books for libraries. One product of this archive is a searchable faceted-index discovery system. In some cases (e.g. when a book is out of copyright) the digital full text is made available.

Question 7.What type of index or IR system is most prevalent in this discovery environment (e.g. Metadata or full-text based)?

Question 8.What search terms or strategies proved to be most useful in this database?

GoodReads (

GoodReads is a social book cataloging and reading platform. GoodReads aggregates bibliographic metadata and social recommendations by readers and serves as both a resource discovery and community engagement platform.

Question 9.What type of index or IR system is most prevalent in this discovery environment (e.g. Metadata or full-text based)?

Question 10.What search terms or strategies proved to be most useful in this database?

Google Ngram Viewer (

The Google Ngram Viewer is a specialized slice or index of the Google Books project. An N-Gram is an index structure that refers to a combination of words that are related by their proximity to one another. The letter "N" refers to a variable that is any whole number (e.g. 1, 2,3). N-grams are often referred to according to the number of words that are indexed together. For example in the sentence "The quick brown fox jumped over the fence" an index of two word combinations or "Bi-grams) would include "The Quick," "Quick Brown," "Brown Fox," "Fox Jumped" and so forth. Tri-grams are indexes of three words together (e.g. "The Quick Brown"). N-Gram indexes are a new take on Phrase searching as applied to full-text resources at a large scale.

Step 1:Searching the Google Ngram viewer can be conceptually somewhat difficult so I recommend you follow the short tutorial below:

  1. Go to the Google Ngram Viewer (
  2. In the Search box type (without the quotes) "Adventures of Huckleberry Finn, Adventures of Tom Sawyer."
  3. Check the case-insensitive box and click the "Search lots of books" button.
  4. You will see a graph displayed (see below) that shows the relative occurance of these ngrams across the entire corpus of books.
  5. You should notice that we searched for Quad-Grams (e.g.4 word phrases) but you can mix and match n-Grams in a single search. You should also notice that we separate our n-grams with commas. One technical detail - the maximum number of words you can search for is 5 words in a phrase so you may need to think about this as you search google.

Question 11.What type of index or IR system is most prevalent in this discovery environment (e.g. Metadata or full-text based)?

Question 12.What search terms or strategies proved to be most useful in this database?

Searching and find

Using these four indexing systems try your hand at answering our questions. Don't be shy about looking for documentation or other sites!

Metadata Standards and Web ServicesPage 1

Erik Mitchell

Question / Type of index (e.g. free-text / metadata) / Best search and resource to answer the question / Your findings
Which of MT's books proved to be most popular over time?
How are rankings of popularity different (e.g. what do they measure, what data sources do they use)?
Which book is most popular today?
Where can you get an electronic copy of each book?

Metadata Standards and Web ServicesPage 1

Erik Mitchell

Evaluating search results: Precision vs. Recall

In deciding which systems worked best for the questions we were asking you likely made qualitative decisions about what systems worked best. You may have decided that systems were not useful based on the initial page of results you looked at or you may have ultimately found that specialized or unique search strategies helped you identify better results. This process of evaluating relevance is often expressed as "Precision vs. recall" in the IR community.

Broadly stated, precision is related to whether or not the result you wanted was retrieved as part of a search process. In other words, precision helps you ask the question "How much of what was found is relevant?" An example of a high precision search is a known item search in an online catalog by title. In this case you know the title of the book and the index to use (e.g. the title index). The search results are highly precise - the catalog either has the book or it does not. In a well structured and perfect search environment, high precision probably best fulfills your information need.

In contrast to precision, recall pertains to the number of search results retrieved during a search. The Google web search is an example of a high-recall search, search results often contain tens of thousands of results! In contrast to precision, relevance asks the question "How much of what was relevant was found?" High recall helps users find the best resource that fits a fuzzy information need. A good example of this is a search for a website or product on the web where you may remember the qualities about the product (like it's function, color or price) but not its name. In a fuzzy search world where we do not always know what we are looking for high recall is most likely preferred over high precision as we want to expand the number of records we look at.

Precision and recall can be thought of as two intersecting collections of documents including all of the relevant documents irrelevant documents in an index. The intersection of these two groups of documents represents the documents retrieved in a search. Precision and recall are presented as a ratio with the minimum value being 0 and the maximum value being 1. This means that we can think about precision and recall in terms of a percentage (e.g. 100%).

Precision and recall can be expressed mathematically as:

  1. Precision = # of relevant records retrieved / (# relevant retrieved + # irrelevant records retrieved)
  2. Recall = # of relevant records retrieved / (# relevant retrieved + # relevant not retrieved)

Ideally, information systems are high recall (e.g. all relevant results retrieved) as well as high precision ( e.g. high ratio between relevant to irrelevant results). In the real world this is difficult if not impossible. In fact, as precision increased to 1 (or 100%) recall approaches 0. Conversely, as our recall approaches 1 (or 100%) our precision approaches 0.

Question 13.You have an index containing 100 documents, 10 of which are relevant to a given search and 90 which are not. Your search produces 5 good documents and 30 bad documents. Calculate the precision and recall ratios for this search

Question 14.Suppose you tweaked your indexing or your search and managed to retrieve all 10 relevant documents but at the same time returned 50 irrelevant documents. Calculate your precision and recall.

Question 15.Assuming that you would rather return all of the relevant documents rather than missing any what techniques might your IR system need to implement to make the results more useful?

Recall and precision are just one measure in system evaluation. In addition there are a number of affective measures such as user satisfaction or happiness and user-generated measures such as rate or re-use, # of searches to locate a resource or user judgments about the "best" resource.

Summary

In this class we explored types of indexing and information retrieval systems as we considered the differences between metadata-based, free-text and social/real-time based information retrieval systems. We learned about a key measure in IR - precision vs. recall and became acquainted with how to calculate both precision and recall. In doing so we just scratched the surface of IR. If you were intrigued by some of the search features in Google you may want to try out their "PowerSearching" course at If the aspects of IR intrigued you I encourage you to explore the optional readings for this week and explore more information retrieval systems.

Metadata Standards and Web ServicesPage 1

Erik Mitchell