Developing a Comprehensive Patent-Related Information Retrieval Tool

Abstract

There is an explosive growth of regulatory and related information now available online. This paper reviews the current state of practice in accessing patent-related documents in the form of patents, government regulations, court cases, scientific publications, etc. This paper proposes an ontology-based framework to retrieve documents from the multiple heterogeneous databases. A use case, erythropoietin, is developed to test the framework. A corpus of 135 patents and 30 court cases closely related to the use case is built. Methods are discussed to improve the results obtained through the use of bio-ontologies for document retrieval in the two databases. The challenges faced with respect to the integration of these documents and future plans are briefly discussed.

  1. Introduction

The role of science and technology has been in the regulatory state in our system of government. There are administrative agencies that deal with various science and technology issues such as the Food and Drug Administration (FDA), the Environmental Protection Agency (EPA), the U.S. Patent and Trademark Office (USPTO), the Nuclear Regulatory Commission (NRC), the Federal Communications Commission (FCC), and the like. The agencies promulgate regulations that appear in the relevant chapters of the Code of Federal Regulations (CFR) and they interpret these regulations and the applicable United States Code. In addition, the courts (often federal courts) interpret the relevant U.S. statutes and federal regulations (CFR). Moreover, there is often a need to consult additional literature in the form of technical/scientific publications.

Let us consider a few examples. If a company wanted to study the market for acid reflux drugs, they may choose to go to the FDA web site; they may look for court cases involving these drugs, and they may also study some relevant technical publications. Similarly, a start-up company looking to work on therapeutics in the breast cancer space may choose to study patents in this field to determine whether some patents were litigated, and the applicable scientific and technological literature.

In each situation, we have a common problem. There is relevant information that must be accessed and which is available in different information domains. In addition, even within one domain, the information may not be easily accessible and searchable. Broadly speaking, we have information on a particular topic in (a) an administrative agency; (b) the court system; (c) the relevant laws and regulations; and (d) other literature such as scientific publications. In order to develop a basic understanding of a topic, a user would desire the ability to search, collate and analyze information across all these information domains.

The past years have seen a tremendous advancement and expansion in research and developments in various fields of science and technology. New fields continue to emerge and have opened up more opportunities. In 2009 alone, 485,312 patent applications were filed at the USPTO. This implies a tremendous increase in the information available and hence the necessity for knowledge bases which can efficiently manage such explosive growth of information.

From a technology firm’s perspective, there are many concerns to address for which a thorough analysis of existing works in specific domain and related fields must be done. These concerns could be about protecting their invention, being updated about their competitors, being careful not to infringe anyone else’s patents, checking to see if the validity of those patents has been challenged in the USPTO (“Patent Office”) or the courts and the like.

The documents to be studied range from patent documents, scientific publications, court cases to other government administrative agency and court documents. Furthermore, these documents are distributed across many different organizations and databases. For example, there are around 40 different patent issuing authorities across the world and tens of thousands of scientific journals. These different domains and databases are very often not compatible with one another and hence a manual scan through each possible database is not time and labor feasible. Existing frameworks assist in searching for documents in each individual domain, especially for patent documents and scientific publications. However frameworks for accessing other legal documents such as court cases and government regulations are rather lacking. Very little efforts have been attempted to deal with the integration of such diverse, heterogeneous documents.

The framework proposed in this paper attempts to provide an integrated approach, with a single or multiple compatible interfaces, for retrieving related documents from across many of these incompatible domains and databases. The approach intends to exploit information available in each set of documents that helps improve automated relevancy estimate across incompatible but yet related domains.

In this work, we choose to develop a use case in the biotechnology arena involving the hormone, erythropoietin, which regulates red blood cell production. We develop a set of results for this particular application as an exemplar to illustrate our work and to test out the efficacy of our proposed approach. Specifically we test one part of the framework and for two sets of documents; court cases and patents.

This paper is organized as follows: Section 2 presents some common challenges associated with the various types of documents and existing work to address them. Section 3 introduces the use case and the framework is discussed in Section 4. Section 5 reports selected preliminary results and observations of the use case study. Section 6 briefly discusses continuing efforts and future works.

  1. Background and Relevant Work

This section introduces the presently available tools and resources for working with the different document types related to patent regulations. The challenges that we face with accessing each set of documents are briefly discussed.

2.1 Patents

Currently there are over 7 million issued U.S Patents. In 2009, 485,312 patent applications were filed with the USPTO [12]. In addition, there are over 40 different patent issuing authorities across the world including European Patent Office, Japanese and German Patent offices. Espacenet is a service offered by the European Patent Office (EPO) which offers search by keywords, inventor names, European classification etc. [14]. Thompson Innovation offers patent analysis and translation services (Delphion) as well as publication search such as Web of Science [13]. Other sources include Dialog LLC which is an online information retrieval system, Google Scholar, USPTO’s website WIPO[18] etc. Research is very active also in using semantic web technologies to represent patent structures via ontologies and facilitate content based document retrieval. Some natural language processing (NLP) techniques have also been employed to extract important information such as claim text, chemical compounds, etc. from patent documents [6].

2.2 Court Cases and Litigation Documents

IP litigation is an important part of the patent filing and regulation compliance framework. Information regarding whether a patent has been previously litigated is valuable. There are 94 District Courts and one Court of Appeals (CAFC). PACER (Public Access to Court Electronic Records) is one electronic system to access databases for US Court cases [15]. Manually scanning each of these 95 databases is not a feasible option. Currently, PACER requires one to know party/assignee name or the case number; in other words, it does not allow keyword based search and it hosts image documents which cannot be searched automatically for text. Furthermore, keyword based search may not be the most effective due to lack of context.

2.3 Patent File Wrappers and Scientific Publications

File Wrappers contain information about scope of protection, application/patent data, prosecution history and other examination information. This can all be key information which can help in relevancy estimates in document retrieval. Available resources include Patent Application Information Retrieval PAIR (public and private) to download patent file wrappers. File wrappers available for download from PAIR are in image form and require some extra processing in order to be able to extract text from them.

Scientific publications cover a very broad set of topics which are spanned across many different databases. Available tools which facilitate such a search are PubMed, MedLine, and Google Scholar etc. [16]. For an estimate, PubMed contains published articles from over 300 research journals. In addition, there exist conference proceedings and workshop presentations that can further hold valuable information for the organization/patent examiner etc.

While there is plenty of work going on with respect to document retrieval in each of these independent domains, there is little work to bring together many of these databases into a common platform to help solving the problem of retrieving highly related sets of documents from various domains.

  1. Use Case: Erythropoietin/EPO

The preliminary framework proposed will be demonstrated through the use case “erythropoietin”. The use case will not only help in evaluating the effectiveness of the framework, but also demonstrate how each document specific or domain specific challenge is handled.

Erythropoietin is a hormone which controls erythropoiesis, which is the production of red blood cells in bone marrow. Interest in this hormone grew rapidly since its discovery and its functional importance. External preparation of erythropoietin has made possible the treatment of diseases such as anemia (the inability to produce sufficient red blood cells in the body). Synthetic production of this hormone has led to several commercially available drugs. Amgen Inc., an international biotechnology company, produced the first commercially available drug – Epogen. Amgen holds 5 patents for the production of erythropoietin – US5547933, US5618698, US5621080, US5756349 and US5955422, which have since been cited as well as challenged by others. Other pharmaceutical companies which have shown interest include Hoechst Marion Roussel, Transkaryotic Technologies and Reliance Life Sciences.

This use case contains plenty of documentations and forms a rich corpus with documents from a variety of domains. Following forward and backward citations of the five core patents, we have outlined 135 closely related patents to complete the use case. The five core patents have been challenged several times in US courts. Approximately 20 court cases have taken place since late 1980’s till today. These sets of patents also reference a set of over 3000 publications. Collectively, these documents put together make a good corpus to work with and test the framework.

3.1 Ontologies

The field of biomedicine and biotechnology is rapidly growing with huge amount of research being carried out daily. The use of terminology is very extensive and hard to keep up with. There is hence a requirement for a controlled vocabulary to help correlate various existing and new terms. One solution for this is provided by ontologies. An ontology is a formal representation of key concepts or properties in a domain and the various relations connecting these individual entities. This kind of formal representation is very widely used such as in biomedical informatics, artificial intelligence, etc. Several efforts are being made in order to standardize the use and representation of biomedical terminology.

A domain specific ontology is more than a simple thesaurus of words; it formalizes the idea of classes and domains. Domain ontologies provide more information about how one concept is related to another. Ontologies can be used as a backbone for content and information retrieval systems. GoPubMed uses Gene Ontology and MEdical Sub-Headings (MESH) to access the PubMed database while others make use of NLP techniques to search the PubMed database [17]. Ontologies have also been used previously for patent retrieval [1-2, 7-8].

Ontologies address or build upon a certain aspect of a domain. Each ontology considers and relates concepts from a different viewpoint. Consider an example - Gene Ontology (GO) organizes their ontologies based on three principles – the concept as a cellular component, as a part of a biological process, and its molecular function. The ontology shown in Fig. 1 represents ‘erythropoietin receptor binding’ as a molecular function.

Fig 1: Gene Ontology [19]

The ontology describes the process erythropoietin receptor binding as a type of cytokine receptor binding which is in turn a type of receptor binding and so on. The tree directly gives us a set of key phrases – “cytokine receptor binding”, “binding” etc. which can be matched with the text contained in documents. In addition, the class properties contain definitions and relations for concepts.

Each set of documents are written in very different writing and formatting styles, intended for a different set of audience. Publications make high use of technical jargons, whereas court cases are long and very descriptive emphasizing on clarity of the statements being made. Patent documents make good use of technical jargons but also employ writing styles as required by legal documents.

A single ontology may result in a very narrow term expansion. BioPortal is a web based application for accessing and sharing bio-ontologies by the National Center for Biomedical Ontologies [19]. It provides a common searchable and interactive interface to over 130 bio ontologies. Other sources for ontologies and taxonomies include GeneCards, MedTerms etc. In this paper, we search for all ontologies made available through BioPortal that contain the keyword erythropoietin/EPO either as a concept or as a relation to another concept. We make use all the ontologies to not only obtain a larger term base but a broader scope and hence a broader coverage of documents. The results of using these ontologies are shown in the results section.

3.2 Document Organization and Indexing

Each type of document is written and available in a different format from the other. As a first step, these documents are converted into a common format. This makes it easy to manage and perform operations on the documents. The format chosen is XML, and an automated script is written to convert the downloaded HTML files into XML files based on the fields/features of that document. For example a patent document is re-arranged into XML as shown in Fig 2.


Fig 2: Sample patent XML file

Using XML not only allows maintaining these documents in a uniform format, it would also facilitate easy transformation into an ontology based language such as OWL similar to the work done by Giereth et.al. [4-5] and Wannar et. al. [3]. Building an ontology for this application is a way to formalize the framework. It will have knowledge encoded about various aspects spoken about in this paper.

We choose to index and search the documents using Apache Lucene. Apache Lucene is an open source information retrieval library which offers full-text indexing and search APIs. It makes use of the vector space model to represent documents. In the vector space model, every document is represented as an n-dimensional vector where each dimension is a unique word occurring in text, and the magnitude along each dimension is the frequency of occurrence of the word in a document. This software is very widely used in web search utilities. The model also filters out very frequent words such as – a, the, of etc. known as stop words as they generally do not contain any information about the document. Words occurring in the text are trimmed down to their roots. This process is called stemming and makes the model more efficient. Apache Lucene allows indexing documents in the form of ‘fields’, which makes it possible to constrain the search to only one or more sub sections of the document. It uses tf-idf to rank the documents based on the query. Tf-idf stands is a scoring factor that takes into account the frequency of occurrence of a word in a document (term frequency) and the frequency of occurrence across all the documents (inverse document frequency).

  1. Proposed Framework

There are many aspects of multi-domain document search which can be exploited for effective document retrieval. The framework proposed in this paper attempts to achieve this in various stages which are explained below. Fig. 3 gives the top level picture of the framework. The corpus is a combined document set of various types such as patents, court cases, publications etc. The framework interacts with the corpus based on the user query and returns the results to the user.

4.1 Keyword Expansion

The field of biotechnology is rapidly expanding. While the number of documents required to be managed is increasing immensely, there is an imminent lack of standard terminology. This poses an issue to applications that deal with biomedical text such as search engines. A bag of words model may not recognize a particular relevant document unless the document contains the concept exactly as in the model. One solution to this is the use of bio-ontologies. Bio-ontologies address this issue by defining concepts and relations. Given a keyword, it is possible to find related concepts by parsing the tree and the class properties. This can give a significant boost to the number of hits.

Fig 3: Proposed Framework

The goal of this stage is to make use of the ontologies, based on the user query to maximize the number of document hits. There are several challenges associated with this stage. Picking a relevant ontology is very important as an irrelevant ontology could result in a pool of irrelevant keywords, affecting the document retrieval at the very beginning. Also, close attention needs to be paid to the diverse usage of terms that are used in each type of document. For example, a court case may make use of less technical terms than what a scientific publication might. A patent normally restricts itself to the topic under concern; the references it cites may vary widely and will need the ontology to cover sufficient breadth of terms. We must hence clearly define which ontology will apply to each domain. Selecting irrelevant ontologies could thus lead to producing inaccurate results.