Google Search Improvements

Introduction

This document covers functionality that the the university has specified it wishes from a search engine, and discusses whether the Google Search Appliance has that functionality. Where relevant it provides guidelines on getting started using that functionality.

These focus on solutions that are configured once in the appliance and then apply to all queries; there is a separate document (“Google Custom Search API”) that describes configuration on a per-query basis.

This document is not intended to address feasibility of solutions (especially in terms of configuration time/complexity, only to describe possible approaches. It should however be noted that manually tweaking search algorithms should be considered secondary to improving the source data quality where possible. This improves the results for all searches, not just the University of Edinburgh’s own internal search, and is likely to be more maintainable in the longer-term.

Integrating External Searches (Contacts)

The current search page also performs searches on the e-mail contacts and telephone databases, and presents those results in separate tabs on the page. These searches could be integrated into the main results through use of OneBox ( This provides the query to an external search web service, which returns results as an XML document for inclusion in the page.

Search Algorithm for Contact Data

There is also scope for improving the search algorithms on these external databases. At the moment the search is done only on the last term in the search query, for example a search entered as “Ross Nicoll” results in a search for “Nicoll” against the external databases. This returns contacts results including (listed in order that they appear in the search results):

Despite the presence of an entry exactly matching the name searched for, it’s the 4th entry in that list. Even more puzzling is the listing of “ ”, who is presumably included due to a partial match against the middle of the name (this can be seen as the capitalised characters in “”).

This search can likely be improved by splitting the full names of contacts in the directory into the each component name, and producing an index of those names. Searches would then be performed independently against each word in the query term, and results merged while maintaining a count of the number of times a result appears in the individual searches. Ranking the results on that count should then bring very close matches to the top. As a worked example:

Query term is “Ross Nicoll”
Query is broken into “Ross” and “Nicoll”
A search is performed for entries matching “Ross”
A search is performed for entries matching “Nicoll”
The two sets of results are merged, and the number of matches for each entry are counted. The one entry for “Ross Nicoll” matches both queries, while all other entries match only one query.
Results are ranked by match count, then alphabetically. “Ross Nicoll” is now at the top of the list, with other variations on the name listed below (in case of a mistake during search entry).

This could be further improved by using a “sounds-like” index generated with an algorithm such as Soundex ( or Metaphone ( This would make it easier to match variations of names such as “Nicoll”, “Nicol”, “Nicholl”, etc.

Partial name matches can be useful, but require more care. These typically involve matches to the start or end of a name; for example “Alex” may be a nickname for “Alexandra” or “Alexander”, or surname prefixes may be missed (so “Nicoll” may want to match “McNicoll” as well), however as a general rule a search term should not match the middle of a name (as in the case of “Toni Collis” matching “Nicoll” above). A sounds-like algorithm would also resolve many of these cases more efficiently than a partial match algorithm.

Search Databases

In the longer term, it is likely that this data should be integrated into IDM to reduce the number of different sources for the data. It may also be desirable to move the underlying database used for searching to a NoSQL database such as MongoDB ( to allow for easier scalability of the search database architecture. This would be intended as a cached copy for quick access, rather than golden source data, and therefore the increased complexity of managing data integrity in a NoSQL database is not a significant problem (as the data can be refreshed from golden source if damaged).

External Metadata

There are cases where the metadata describing a page’s contents cannot be included in-line with the page itself, for example where the page is proxied through a CMS (Polopoly) and that CMS strips out/replaces metadata as part of its processing.

In these cases metadata can be added to documents from an external data feed (see This metadata could be pulled from the unproxied version of the pages, written into a data feed and passed to the Google Search Appliance for it to associate with the proxied versions of the pages.

The biggest problem with this approach would be the need to map URLs of unproxied pages to their proxied versions’ address. Obviously Polopoly does maintain a list of these mappings itself, and if that could be extracted programmatically that would resolve this issue, however if not there would be significant duplication of effort in maintaining the list in two places.

Query Autocomplete

Automatic query suggestion/completion Is not provided as part of the Google Search Appliance by default, however it is available as an add-on ( This appears to be relatively experimental still, and requires a significant degree of set up to add the new front-end (and software to drive it), ensure the suggestions are useful, etc.

Query Suggestion/Widening

The search appliance has functionality both for suggesting alternative searches, and for widening a search that has been peformed.

Search suggestions ( are useful in cases where it’s unclear whether it is appropriate to perform the alternative search. This would be useful for suggesting alternative names for entities being searched for; one example provided by Google is a search for “turntables” that prompts the user:

You could also try: Acme Portable Turntable

Query expansion ( on the other hand adds additional query terms to the existing search. For example a student searching for “divinity” may additionally be presented for results for “theology”. Synonyms can be specified both as a custom list for the appliance and from Google provided defaults.

Result Influencing (Weighting/Boost and Bury)

The search appliance can modify search result ranking based on a number of criteria, including metadata included within the content itself, and the location of the content (

For example if it was considered desirable to demote results which relate to courses that are no longer run, the state of the course could be included in the page metadata and the search appliance instructed to check for the presence of that metadata and use it to lower the ranking of the page.

In contrast, if it was desirable to increase the ranking of academic’s pages (for example to show academics whose research interests match a query) in preference to a page listing events on a topic, the appliance could be instructed to raise the ranking of pages located within the academic staff directory/directories.

Metadata would normally by added into the document as “meta” tags, inserted by the content management system (Polopoly, Drupal or otherwise). Example tags could include (these are independent examples and not intended to be put on a page together):

These examples would, in order, by used for:

A course that is no longer active, and therefore should be “buried” in the search results. In counterpoint an “active” course might be weighted very highly in the results.
A page reflecting research material, which might be weighted as higher than non-research material.
A page for an academic member of staff, which might be weighted below their research pages.

Result Spotlighting

For results that are considered of particular importance, it is possible to show alternative search results outside the main results list. This could be appropriate for example for staff contact details, courses, etc. The functionality for doing this is referred to as “OneBox” ( and an example can be seen (using the St Andrews search) at where related links in Moodle are presented in their own section at the top of the page.

Separation of Media Types

Results can be filtered by type ( so for example if a student wished to search purely for videos there could be an alternative search page that only displays video-like file types.

Separation of Content by Content Attributes

Results can be filtered ( for example to exclude out of date or sensitive information from being displayed as a search result. This can be done on various criteria; in this case use of a meta-tag in the content would be an appropriate solution.