Incorporating Scientific Judgment into Workflow Systems for Ocean Science

Summer 2007 REU Project

By

Nicholas Hagerty1,2 (REU Intern)

With

Bill Howe1 (Frontline Mentor)

David Maier1,3 (Senior Mentor)

António Baptista1 (Senior Mentor)

1 Center for Coastal Margin Observation and Prediction,

OregonHealth & ScienceUniversity, Beaverton, OR

2Brown University, Providence, RI

3Department of Computer Science, PortlandStateUniversity, Portland, OR

Contact:

Bill Howe, Ph.D.

Center for Coastal Margin Observation and Prediction

OGISchool of Science and Engineering

Oregon Health and ScienceUniversity

Mail Code OGI-100

20000 NW Walker Road

Beaverton, OR97006

Project Abstract

Recent improvements in technology have increased rates of scientific data acquisition so that scientists are now more often limited by data processing and data management. An existing technology beginning to address these problems is workflow management systems (WMS), software applications that facilitate modeling and execution ofworkflows (e.g. scientific research procedures). However, existing scientific WMS support only computational tasks, lacking the ability to pause for and incorporate user input. Instead, scientific judgment in real time is crucial to workflows in labs and on research cruise at the Center for Coastal Margin Observation and Prediction (CMOP).Our goal was to create a workflow system that can display task-specific information and incorporate user input. I created and deployed a configurable, dynamic user interface for cruise workflow execution, modeled a cruise workflow in a WMS, and planned methods of generalizing the interface and WMS and linking them together.With more development, we hope this system can improve the efficiency of research across and between diverse fields of ocean science at CMOP.

1. Introduction

In the past few decades, improvements in technology have increased rates of data acquisition across many fields of science. For example, instead of reserving time on a telescope to examine a narrow region of space,astronomers now systematically scan the entire sky, making available terabytes of data for distributedanalysis (Szalay et al., 2000).With such improvements in data acquisition capabilities, scientists are now more often limited by data processing and data management. Technologies for more efficient and automated data processing and management are necessary tomaintain the current rate of scientific discoveries(Deelman & Gil, 2006).

At the Center for Coastal Margin Observation and Prediction (CMOP), two situations illustrate the need for better technologies for data processing and management: (1) fieldwork during research cruises, and (2) processing of environmental samples in the microbiological laboratory. In situation (1), conductivity-temperature-depth (CTD) instruments take measurements as they are lowered and raised in the water, while otheronboard instruments continually record variables such as sea surface temperature and velocity.Meanwhile, computational models of oceans and estuaries createnew hydrological forecasts daily,generating vastquantities of numerical data. CMOP has a wealth of tools and data products for viewing and comparing these heterogeneous data. However, simultaneous access to the enormous volume and variety of data can slow onboard researchers reviewing the data in order to makenavigational and procedural decisions. Since these decisions often need to be made very quickly, researchers need straightforward access to a task-specific set of relevant data products to ensure optimal use of their limited time.

In situation (2), environmental sample processing, researchers repeatedlycollect tens to hundreds of samples for analysis. Each sample must be processed through a series of detailed protocols while preserving the metadata about the origin and classification of the sample. The complexity of managing these metadata multiplies when several researchers collaborate on a single project. Coordination and management of data about samples and protocols is overwhelming traditional lab notebooks, so microbiologists need effective software to automate large-scale data management.

An existing technology that has been developed and studied to address problems of data processing and management is workflow management systems (WMS) (Deelman & Gil, 2006). WMS are software applications that support the modeling and execution of workflows. Workflows are networks of tasks required to complete a particular process (Shields, 2007), and workflow modeling translates these workflows into aformalized, computer-processable format. Potential benefits of workflow modeling include automation of tasks, facilitation of collaboration and repeatability (Gannon et al., 2007; Deelman & Gil, 2006), scalability, accessibility to researchers without programming experience, and flexibility from modularization of tasks (Gil, 2007). All of these benefits can directly address challenges in research at CMOP.

Despite their strengths, existing scientific WMSsuch as VisTrails (Bavoil et al., 2005),Kepler (Ludäscheret al., 2005), and Taverna (Oinn et al., 2004)exhibit two key limitations. First, scientific WMS do not present separate interfaces for workflow modeling and workflow execution. The full array of workflow modeling options is unnecessarily complex for workflow execution in the hectic and time-constrained environments of labs or research cruises. Second, scientific WMS support only computational tasks (Ludäscheret al., 2005), lacking the native ability to pause a workflow,promptthe user for input, and then incorporate this input into further processing. Cruise and lab workflows necessarilyincorporate human input in real time. For example,researchers on cruises often makedecisions regarding where to sail based on qualitative analysis of data, and a computer cannot perform laboratory tasks such as DNA extractions.

Our research objectives were to (1)create a configurable, dynamic user interface for workflow execution that supports user input, (2) create a functional system by first modeling CMOP workflows in an existing WMS and then linking the user interface to the WMS, and (3) validate the system by deploying and evaluating it in the field and the lab. Such a system would adapt to the current workflow task, display task-specific data products, prompt the user for input, and display output.The system would also retain the automation, repeatability, flexibility and other benefits of existing WMS technology, while hiding the complexity of the WMS during execution. For example, on a research cruise, a workflow task named “Evaluate Cast” might display a map and a depth profile of salinity and ask the user whether the water is stratified. Depending on the answer, it might recommend sailing either upstream or downstream. We predicted that this system could improve the efficiency of research across and between diverse disciplines at an ocean observatory such as CMOP.

2. Related Work

Proponents of workflow modeling claim numerous benefits. To illustrate these benefits, consider two typical workflows at CMOP: (1) on cruises, locating the estuarine turbidity maximum (ETM), and (2) in the lab, analyzing the microbial population of water samples. Potential benefits include the following:

  1. Automation: Allowing a computer to complete tedious tasks previously performed manually (Gannon et al., 2007). Computer automation would more quickly accomplish tasks at CMOP such as analyzing data from a conductivity-temperature-depth (CTD) cast or searching the online genome database BLAST for a DNA sequence.
  2. Collaboration and repeatability: Formal, precise specification of workflows so that research methods can be shared with collaborators within or between labs (Gannon et al., 2007; Deelman & Gil, 2006). CMOP frequently hosts temporary interns and includes researchers collaborating across disciplines and across institutions in multiple states, so ease of sharing information is crucial.
  3. Scalability: Ensuring that similar workflows can be repeated indefinitely with minimal marginal work (Gil, 2007). The sheer number of casts taken and environmental samples collected at CMOP necessitates frugal time investment.
  4. Flexibility: Modularizing tasks and displaying them as discrete units within the visual modeling environment so that the user can easily substitute and rearrange the tasks within a workflow as needed (Gil, 2007). CMOP researchers modify cruise plans and lab protocols frequently in response to new developments.
  5. Accessibility: Providing a method of programming that is easier for users to learn than a general-purpose programming language (Gil, 2007). Many researchers at CMOP have no programming experience and are reluctant to adopt technology with a time-intensive learning curve.
  6. Coordination of heterogeneous tasks: Standardizing the flow of data between tasks so that tasks written in two different languages and running on two different platforms, for example, can interact (Altintas et al., 2003). Many of the existing data tools and web services at CMOP have no obvious mechanisms for interoperability.
  7. Provenance: Ensuring that the source and processing path of information is tracked and comprehensively documented (Deelman & Gil, 2006; Ludäscheret al., 2005). When managing large numbers of samples at CMOP, this metadata is necessary to avoid confusion.

3. Results

3.1 Interface: The Dashboard

The Dashboard ( is a single-screen interactive interface bringing together a small set of task-specific products and tools. Currently it is customized for taking, reviewing and planning CTD casts, but it can be easily extended or modified with other products to fit other tasks. The variables currently set at the top of the Dashboard PHP script (see below) could eventually be instead submitted in XML via POST, for complete control over the display.

The Dashboard is divided into four quadrants: a “command center” in the upper left, contextual information in the upper right, raw data in the lower left, and data products in the lower right. These distinctions were found convenient, but they are somewhat artificial; the sizes of the quadrants have been set optimally, but otherwise any of the products could be interchanged with the others. All these products are organized as separate, independent web services, most based on information or similar products already existing at various stages of development.

Upper left/Command Center: Uses the “commandcenter” web service. Displays basic context information such as the current cruise, vessel, time, location (from most recent flow-through CT data), and location, surface and bottom salinity of last cast. There is a dropdown box to display the current vessel and cruise. Here we also expect to place workflow controls, such as “Next Task” buttons and prompts for user decisions.

Upper right/Context: Uses the “context” web service. Using the “View above” drop-down box, one can select from the following:

  • Map: An image dynamically pulled from the Cruise Mapper, displaying bottom salinity from the best available model forecast, and path, casts, and current location of the current cruise.
  • Tides: A tide chart with a form to change the bounding dates, dynamically pulled from the NOAA web site.
  • Animation: Today’s best available forecast of bottom salinity in the Columbia estuary, animated by hour over several days.

Lower left/Data: One can select from the following:

  • Enter/Edit Cast Data: Uses the “dataform” web service. Allows certain information to be entered directly into the cruise.ctdcast and cruise.sample tables of the database. Only available if the user is logged in.
  • All Casts: Uses the “datafactory” web service. Displays time, location, and surface and bottom salinity of all casts on the current cruise.
  • Individual Cast, averaged by meter: Also uses the “datafactory” web service. Displays several variables obtained by the CTD (salinity, turbidity, temperature, oxygen), averaged at each meter in depth, for a selected cast.
  • Individual Cast, all data: Also uses the “datafactory” web service. Displays those several variables at all data points measured by the CTD, for a selected cast.

Lower right/Products: Uses the Product Factory web service written by Bill, but most product specifications were written in coordination with this project. The “View below” drop-down box allows selection from the following:

  • Individual Cast Profile. Displays a depth profile of salinity and temperature (one x-axis) and one other chosen variable (the other x-axis) for the selected cast.
  • Single-Variable Cast Profile: Displays a depth profile of one chosen variable for the selected cast.

Contrast the task-specific Dashboard with the alternative model of a comprehensive directory of all available products; under the constrained conditions of a research cruise, users cannot waste time browsing for a particular product. The Dashboard was first used on a cruise for collecting samples for REU projects in July 2007, and microbiologists have used it since for reviewing data. The cast-specific Dashboard is expected to remain a principal interface on CMOP cruises in the near future.

3.2 Workflow Model: Cruise Plan in YAWL

Because the initial Dashboard was designed specifically for feature-tracking research cruises, we selected the July 2007 cruise plan as a pilot workflow to model. We selected YAWL (“Yet Another Workflow Language”) as our WMS for its (1) formal specification language, (2) separation of editor and engine interfaces, and (3) purported support for web services (van der Aalst& ter Hofstede, 2005). The resultant workflow model is shown in Fig. 2. Each window is a net, with control starting in the net at the bottom. Each task in that net is a composite task, unfolding to another net of individual tasks. In the case of this cruise plan, first we would take a “fresh” water sample, then attempt to location the ETM, then take a “salty” water sample. For further details, please see the workflow specification file and the YAWL documentation.

4. Next Steps

Now that the interface is complete and a workflow is modeled, they need to be linked together. We envision that as the user steps through a workflow using the interface, the YAWL engine will be updated, and in turn the state of the workflow will update the particular set of products displayed in the interface to fit the task.

To accomplish this, we envision an architecture (Fig. 3) resembling the Model-View-Controller (MVC) software engineering pattern (Sun Microsystems, 2002). The View consists of the Dashboard skeleton as well as the products that fill it. For typical data display, it sends queries and data input to the database (part of the Model), which sends back data to be displayed or rendered. This is the extent of the currently implemented architecture.

We need to design a third major component, the Product Broker, which will act as the Controller. The Product Broker will stand between the View and the Model and will fulfill two major functions. First, it will take user decisions from the Dashboard and translate them into workflow control input compatible with the YAWL engine (which will also be part of the Model). Second, it will query the YAWL engine for the state of the workflow and translate that into the state of the view; that is, the specific products to be displayed in the Dashboard.

At this stage of its development, YAWL supports only Java-based Web services (van der Aalst& ter Hofstede, 2005). Since the Dashboard is written in PHP and most applications at CMOPare written in PHP or Python, a Java-to-PHP link will also need to be written. We envision placing this link between the Controller and the Model, at the interface to the YAWL engine.

5. Conclusion

After constructing this system, we will look to expand our workflow repertoire. We will model other workflows on research cruises and workflows in the microbiological laboratory. Our eventual goal is to make this workflow system so ubiquitous and accessible that (1) researchers will be able to model new workflows with minimal effort beyond the current technique of writing plans or protocols on paper, and (2) researchers will follow workflows in the field or the lab on a handheld computer, entering data and decisions in real time, partially or completely replacing the lab notebook.

Workflow systems incorporating scientific judgment and other human input have great potential for improving the efficiency of research at CMOP and similar interdisciplinary centers. We have created the interface necessary for such a system and described its demonstrated utility as a free-standing application. We have selected a WMS, modeled a pilot workflow and described the next steps required to make the entire system functional. We believe workflow systems will be an integral component of cyberinfrastructure at CMOP in the near future.
Appendix 1. Technical Details for development of Dashboard

At the top of the Dashboard PHP script there is a variable-setting area. The variables are:

$thetime: current time; if you want to emulate a different time use mktime()

$thispath: URL path to this node page, relative to the server.

$defaultvessel and $defaultcruise

$defaulturl[1]: URL for web service to be displayed in the top-left (currently the CommandCenter)

$botleft[n][‘x’], $topright[n][‘x’], $botright[n][‘x’]: variables for describing the options for each drop-down box.

  • [n] is any number; a unique identifier for each option.
  • [‘value’] (required) is the full URL for the web service.
  • [‘text’] (required) is the name for that option displayed in the drop-down box.
  • [‘default’] (optional; only one per quadrant) when set to true sets the default option.
  • [‘login’] (optional) when set to true requires the user to be logged in to use the option.

The web services automatically receive the following GET variables:

  • $_GET[‘vessel’]
  • $_GET[‘cruise’]
  • $_GET[‘time’]: equal to $thetime
  • $_GET[‘skeleton’]: equal to $thispath; URL path to the node page.

To dynamically generate a customized Dashboard, use the following GET variables:

  • vessel
  • cruise
  • url1 (of web service to be displayed in top left)
  • url2 (top right) and text2 (name of web service in drop-down box)
  • url3 and text3 (bottom left)
  • url4 and text4 (bottom right)

url1-4 parameters are either full URLs or relative to this directory: