Rapidly Deployable, Highly Scalable Natural Language Processing Using Cloud Computing And An Open Source NLP Pipeline

Rapidly Deployable, Highly Scalable Natural Language Processing Using Cloud Computing and an Open Source NLP Pipeline

David Baldwin, BS and David Carrell, PhD

Group Health Research Institute
Seattle, WA

January 2010

1. Introduction

This whitepaper describes a method of integrating the open source UIMA/cTAKES natural language processing (NLP) system with an institution’s local clinical document repositories in a way that supports two modes of scalability: 1) internal scalability within the local network through the use of multiple locally deployed instances of the UIMA/cTAKES system, and 2) external scalability leveraging internet hosted (“cloud”) computing platforms where resource scaling is managed through built-in features of the cloud computing environment. This dual-mode scalability is achieved through an integrative local web service that provides the connective tissue between local document repositories, cTAKES, and local databases for storing cTAKES-generated annotations. Because this integration is achieved through a web service, it can be readily adapted to either the local or cloud-based scheme.

In the sections that follow we present our motivation for using UIMA/cTAKES to apply NLP to clinical text in a research setting (section 2), followed by a description of the deployment and integration model we implemented to facilitate a scalable NLP system (section 3). We conclude with a discussion of novel opportunities and challenges introduced by cloud-based clinical NLP system (section 4).

We assume the reader is familiar with the open source UIMA/cTAKES system. For those interested in learning more about UIMA/cTAKES we refer you to the Open Health Natural Language Processing (OHNLP) Consortium Web site and the Apache UIMA Web site, listed here:

OHNLP Web site: .nih.gov/Vocab/KC/index.php/OHNLP

Apache UIMA Web site: che.org/uima/

2 . M otivation for using UIMA/cTAKES

Despite vast amounts of structured data available on the health and care of patients captured in clinical data systems large amounts of research-relevant information remains inaccessible because it is represented as unstructured or semi-structured text. The high cost associated with manual chart abstraction prevents large-scale capture of this information. To address this challenge researchers have turned to natural language processing (NLP) as a means for obtaining various types of clinical information in a more efficient and scalable manner. While extracting research-grade information from clinical text using NLP is not without its limitations, it is often the only feasible approach to marshaling data represented in textual form. NLP-extracted data can support a wide variety of health-related investigations including comparative effectiveness research.

Until recently, deployment and use of industrial-strength NLP systems has required the support of a fairly sophisticated infrastructure, both in terms of technical capacity and human capital. For these and other reasons successful large-scale NLP operations are rare outside major academic institutions with mature biomedical informatics departments. The availability of UIMA/cTAKES since February 2009 through open source licensing and supported by the OHNLP community of users has significantly reduced barriers to adopting NLP in applied research settings. In our experience deployment of UIMA/cTAKES was relatively straightforward. The one major prerequisite for deployment not represented within our existing IT infrastructure was programmer competency in Java and commonly used open source software tools. Once we acquired these skills we encountered no insurmountable challenges.

3. Integrating cTAKES with Data Repositories

Production-scale processing of local clinical text requires integrating cTAKES with local data repositories. Integrative software must provide mechanisms for selecting documents to be processed from local document repositories, handing off the document text to the cTAKES pipeline, and receiving the cTAKES-generated annotations and storing them in a database for subsequent analysis. In this section we focus on these integrative aspects of deployment, as well as an alternative approach to configuring the Uniform Medical Language System (UMLS) vocabularies cTAKES uses.

The basic process of installing cTAKES is described fully in documentation provided in the Documentation and Downloads section of the OHNLP Web site referenced in section 1 above. Various approaches could be used to integrate cTAKES with local data systems to achieve a fully functional, end-to-end deployment. Different approaches will vary in the amount of custom programming required. The integrative approach we used involved attention to the following four areas:

Local computing environment (section 3.1)

Modifications to the standard installation (section 3.2)

Deploying cTAKES as a Web service (section 3.3)

Document and annotation management (section 3.4)

Each of these areas is addressed in the following sections.

3.1 The local computing environment

The local computing environment includes hardware, operating systems, and network infrastructure. The hardware capacity need to run cTAKES depends on the volume of clinical text being processed. The “starter” system we are using in our initial deployment, a freshly deployed machine supporting no other processing and located inside our institutional firewall, has the following characteristics:

· JRE 1.5 compliant operating system (for complete specifications see the Java Web site at h t tp://java.sun.com/javase/)

· 2 GB of physical memory

· 500 GB of hard disk space

· A single core 2GHZ processor or better

Such a starter system proved adequate during our initial deployment and testing phase. A more powerful system, suitable for moderate level of production processing, and which gives quicker turn-around times when experimentally processing corpora for annotation development and testing purposes, has the following characteristics:

· JDK 1.5 compliant operating system (see .com/javase/ for a complete listing) with ANT and MAVEN build tools and

· 4 GB of physical memory

· 1,000 GB of hard disk space

· A dual core 2GHZ processor or better

Because our institutional computing infrastructure is primarily Windows based, and because we preferred developing within a UNIX environment, we chose to deploy on a Windows 2000 Server running Cygwin. Cygwin is a Linux-like environment for Windows, which means that our experience would be essentially the same as one that relied on the UNIX platform.

The installation procedure for the UIMA/cTAKES stack of software, discussed in section 3.3 below, is relatively straightforward and, we believe, unlikely to vary significantly across different computing environments (i.e., Windows vs. UNIX). The most significant challenges encountered had to do with configuring our network and the stack to interact. Some salient issues included:

· Access outside of our network. The GHRI network topology includes a single-gateway proxy through which all HTTP and HTTPS traffic must pass. This was relevant for using build tools during development as well as communicating with cloud-based servers outside the network. Appendix A shows instructions for programmatically getting around these limitations in both Java and Ruby

· Compliance with local network management policies. With a 100% locally deployed system the relatively high storage and CPU demands of an NLP system required coordination with the managers of our institutional network to ensure proper stewardship of network resources. When extending the system to incorporate cloud-based processing, additional coordination was needed to address such issues as the correct routing of network traffic and proper integration with local security systems (e.g., VPN and firewall).

· Physical versus virtual machines. Deciding whether to deploy on local physical machines or local virtual machines (VMs) has implications for system performance and maintenance. Dedicated physical machines—blade servers managed by our own research institute staff—offer the greatest degree of control but also imply greater responsibility and potentially greater costs in terms of system management. VMs, which in our case are deployed on larger servers managed by the central IT staff of our parent institution, offer flexibility with respect to resource scaling and replication, but are less under our direct control. We have chosen VMs as our preferred option for local computing capacity. In addition to the scalability benefits, the ability to ‘snapshot’ and restore a VM’s image at any point in time gives flexibility and freedom to experiment.

The level of performance cTAKES achieved was quite acceptable, even as implemented within our starter system; it processed approximately 90,000 notes with an average size of 80 lines in a 24-hour period.

3.2 Local modifications

Downloading Apache UIMA and cTAKES from their respective download sites and installing them according to the documentation available on the OHNLP Web site was uneventful. Without modification we were able to process sample text through the cTAKES pipeline and review cTAKES-generated annotations. Out-of-the-box the system is a full featured NLP system, capable of tokenization, shallow parsing, negation detection, dictionary look-ups, concept tagging and more. Local requirements and use cases may, however, make it necessary to introduce modifications. For example, one of our use cases required us to label words that appeared in a subset of the UMLS dictionary. We implement this using a custom-built Lucene index. Details of our usage of the custom Lucene index will be posted to the OHNLP user forum.

3.3 Deploying cTAKES as a service

After an initial deployment and configuration of UIMA/cTAKES in a test environment we deployed the stack to our production servers. As downloaded, UIMA/cTAKES has no external dependences; it does not require database or network access during or after deployment. Thus, deployment was a simple matter of releasing new code on the server.

There are many ways to access UIMA/cTAKES. Relevant technical considerations in designing this system include:

· Concurrency – should the server handle multiple requests at a time?

· Security – who should be able to access the clinical text and annotations?

· Note volume – how many notes need to be processed and how quickly are they required?

We were best able to address the NLP needs of researchers at GHRI by deploying UIMA/cTAKES as a web service backed by the industrial-strength, multi-threaded Java web server Tomcat. This allows the program to scale horizontally with multiple Web servers serving the same annotations. Tomcat provides robust concurrency, integrated system-level access to the service and can be configured for most major platforms. We customized the UIMA-Simple Server Web service to output its data in XCAS, UIMA’s standard for data serialization, allowing us the option performing additional processing on the annotated documents later on.

Though we use this web service for production purposes in an entirely local deployment (all systems are inside our institutional firewall), its design allows it to be used as a web service that can be accessed remotely. This includes, for example, a configuration where the source text and annotation data repositories are located inside the institutional firewall while the UIMA/cTAKES service is deployed in the cloud, using internet computing service provider (see section 4).

3.4 Managing the transfer of documents and annotations

The most development intensive aspect of our local deployment was integrating the cTAKES system with our existing local repositories of clinical notes (e.g., chart notes, pathology reports, radiology reports) and database management systems for storing annotations. We wanted a system that could interact with the cTAKES web service (described in 3.3 above) and allowed configurable control over the type and processing priority assigned to documents. We designed and implemented such as system using Ruby. The client application is a Ruby background process that checks if new notes are in the document repositories, identifies the notes of highest priority (according to a user-modifiable look-up table), and then sends them to the cTAKES web service for processing. To take full advantage of the processing power of the server, the client application manages multiple concurrent processing requests

We currently use a SQL 2005 database as the document repository to store cTAKES-generated annotations. Our database schema accommodates multiple document repositories (i.e., separate data tables for each type of clinical document), with each document being represented by an identifier that is unique across tables. The document text in each table is stored as a single text field; we plan to use custom-defined annotations to implement document sectioning.

The Ruby client also manages receipt and storage of the annotated text returned by cTAKES, which is represented in XCAS format (see Section 3.3). Unique to Microsoft SQL Server is the ability to directly query annotations via Microsoft’s built-in XML query functionality. We leverage this ability to post-process the XCAS files, generating for each document a list of all UMLS concept codes and storing them in the database according to a normalized schema.

4 Opportunities and challenges of c loud deployment

In this section we describe a model of deployment in which traditional hardware systems are replaced with cloud-based virtualized computing systems to simplify some aspects of deployment and achieve rapid scalability. We discuss some of the advantages of cloud-deployed systems as well as the more salient security issues they introduce and general strategies for addressing them.

Recent advances in cloud computing have expanded opportunities for secure, low-cost horizontal scaling of internet- (“cloud”) hosted web services. In the cloud-based model a user rents computing capacity in the form of virtual hardware from a cloud computing service provider. Cloud computing services may be procured from large providers such as Amazon Web Services? or Slicehost?, or any of a number of smaller “boutique” providers. The physical machinery supporting cloud services consist, typically, of large server farms with multiple layers of redundancy. Interactions with the cloud-deployed system, whether for administrative or processing purposes, are handled through remote connections mediated by the internet.

General advantages of cloud computing include the ability to easily procure pre-built or customized machine images, rapid and potentially massive scalability of resources and processing capacity, and the ease with which an entire computing system may be cloned. Cloning is the process of creating one or more replicas of a virtual machine. Cloning can be used to achieve horizontal scaling (i.e., multiple identical computing instances managed by a load balancer) or to simplify adoption within another institution. In the latter case, the adopting institution acquires a fully functional cloud-based cTAKES system for its private use by cloning the cloud-based system deployed and tuned by another institution.

Sharing expertly designed and configured cloud-based NLP systems through cloning, we believe, can radically reduce the technical cost of adoption while allowing complete control and ownership of the system by the adopting institution. Additionally, the trivially small charges incurred when cloning a system (i.e., one-time disk read/write fees) means cloning can also be used as a means to perform system maintenance or implement upgrades; tweaks or upgrades to one instance of an expertly-managed cloud-deployed NLP system can be made available to others simply by re-cloning the entire system.

While cloud technologies hold great potential for reducing the barriers to NLP system adoption, significant security concerns must be addressed before meaningful and widespread adoption can occur. Addressing these security concerns requires action on multiple fronts, including risk management, security system design, testing and monitoring, governance, and third-party auditing. At least as important as the actual security apparatus itself is the perception of security among key local stakeholders. These sociological aspects of security underscore for us the importance of 1) establishing realistic expectations with respect to the timeframe of adoption, and 2) leveraging the prestige and expertise of industry-leading third-party auditing services (e.g., Deloitte Consulting, LLP). Building confidence among all significant local stakeholders based on an accumulated history of positive experience is fundamental to successful adoption.