The Technology of Connectors
(for powerful federated search)
Many complex technology components are necessary to make a strong federated search platform, such as an Aggregator to collect, filter and rank results; a User Interface to help the user interact with the results set; and a Clustering Engine to enable clusters. Perhaps the most misunderstood, yet important component of federated search is the Connector.
In general terms, Connectors make it possible to talk to other data sources. More specifically, a Connector does a number of things to make federated search possible:
1. Reformat the search query: Search engines differ, and most have their own unique commands necessary to do a search. A Connector must therefore take the search term or terms submitted by the end-user, and make it acceptable for the source.
2. Submit proper authentication: Much of the deep web is behind a password prompt, firewall or other barrier which limits access. A Connector must therefore know how to “sign in” to submit a search to many sources.
3. Extract the results: Once a query is submitted to a source, the results must be received by the Connector. This may be more difficult than it sounds, since sources often implement unique options for showing results (for example: number per page, sort options, and level of detail). The Connector needs to be smart enough, to know how to extract the right results from the source it interfaces with.
4. Parse the results: Once the results set is obtained, the Connector must “read” the results and make sure the proper meta-data is extracted from the results sets. For example, different sources will display date, title, author and other information in different ways. The Connector must be aware of how the source outputs the results, and parse through the results, so the Aggregator can treat all results from all sources in the same way.
There are several ways Connectors “talk to sources” to obtain search results. While there are many ways to do this, the most popular are:
1. Screen scraping: Screen scraping is the technique of simulating user input to a particular source, and literally “reading” the results provided by that source over a web browser (i.e. http protocol).
2. z39.50: z39.50 is an older specification in use by many search engines, especially ILS and catalog systems in use by libraries. While z39.50 is a powerful specification, it does pose a number of challenges for federated search. Deep Web has extended the popular YAZ Proxy service to deal with these challenges.
3. API / XML Gateways: API stands for “Application Programming Interface.” This is a newer and more popular way to interface with sources. Any publisher or source that provides an API, usually makes it possible to obtain accurate meta-data and experience faster searches.
4. API / XML Standards: Any publisher that supports an existing or emerging standard for their API / XML Gateway will not only gain all the normal benefits of a documented Gateway but will often enjoy “out of the box” connector support from day one. Some existing standards are SRU, SRW, and OpenSearch.
Connectors must deal with a variety of challenges to provide outstanding federated search:
1. Logging into or accessing restricted sources.
2. Dealing with those sources who don’t rank their results set.
3. Slow or unresponsive sources.
4. Obtaining enough results, especially when results may be located on more than one page (in screen scraping situations).
5. Not overloading a source, by submitting too many searches.
6. Be able to parse complex HTML and AJAX interfaces.
Not all Connectors are the same!
1. Dumbing Down: Does a Connector take advantage of the capabilities of a source, or is it built to the “least common denominator” across all sources?
2. Advanced Searching: Does a Connector support advanced searching functionality, such as searching on fields such as author, title, keywords and date ranges? Can it intelligently handle searching of sources that don’t support any/all fielded searches?
3. Normalized Author Fields: Does a Connector normalize author input and search on author fields differently depending on the source being searched? For example a search on author – John Smith, might be searched as John Smith in one source and as Smith, John in another source and as Smith, J in a third source.
4. Full Boolean Searching: Does a Connector support AND, OR and NOT, when supported by the source, or does it simply make assumptions and hope for the best?
5. Exact Wording or Phrase Searching: Does a Connector support phrase searching, when supported by the source?
6. Wild-Card Searching: Does a Connector support wild-card searching, when supported by the source?
7. Authentication Methods: There are many authentication methods in use on the Internet, and any publisher using an authentication method unsupported by a Connector is unavailable to the federated search. Will the Connector support session ids, cookies, SSL encryption or any of the myriad others out there?
8. Contextual Sorting: Will a Connector utilize different sorting options at the source depending on the search criteria of the end user? For example, when the newest results from a source are desired, will the Connector retrieve results sorted by date rather than relevance?
9. Normalized Date Fields: There are an amazing variety of ways to write dates. Does the connector normalize date formats so dates from disparate sources can be accurately sorted and compared?
10. Flexible Technology: Our Connectors are written in JRuby, Java and XML, which means that there are very few sources for which Deep Web Technologies can’t create Connector for.
Not all federated search solutions are the same. Aside from the differences in Connector features and benefits, as enumerated above, also consider the following:
1. Incremental Results: Can you receive results incrementally, which means the federated search solution provides results quickly for fast sources, and integrates results (optionally) as results come in from the slower sources? In the age-of-Google, end-users expect immediate results – and incremental results help satisfy the hunger for fast results.
2. Alerts: Does your federated search solution enable end-users to create daily alerts on their full Boolean expression searches to their favorite sources?
3. Smart Queuing: Does your federated search solution manage sources by automatically disabling unavailable sources, or does it wait to timeout each time a search is performed?
4. Connector Monitoring and Support: Does your federated search solution generate emails to you and your provider, when Connectors fail or sources become unavailable? Do you have a birds’ eye view of the performance of your sources? How quickly does your provider respond to issues?
5. Custom Connector Development: Do you wish your federated search solution had sources today that it does not? Is your provider willing to affordably produce high-quality Connectors to your in-house sources?
6. Relevance Ranking: Does your federated search solution provide relevance ranking, by topic, author, date or other fields? Does it provide a five-star ranking system?
7. Smart Clustering: Does your federated search solution provide clustering, built-in without cost, which displays results relevantly ranked within a cluster?
8. Spelling Suggestions: Does your federated search solution provide a “did you mean?” capability, built-in without cost?
9. Administration: Do you have the ability to manage your sources, credentials and obtain reports from a centralized administrative interface?
- 2 -