LSST Science and Project Sizing Inputs Explanation s1

LSST Data Management Middleware Design LDM-152 7/25/2011

Large Synoptic Survey Telescope (LSST)

Data Management Middleware Design

Kian-Tat Lim, Ray Plante, and Gregory Dubois-Felsmann

LDM-152

July 25, 2011

The contents of this document are subject to configuration control and may not be changed, altered, or their provisions waived without prior approval of the LSST Change Control Board.

LSST Data Management Middleware Design LDM-152 7/25/2011

Change Record

Version / Date / Description / Owner name
1 / 7/25/2011 / Initial version based on pre-existing UML models and presentations / Kian-Tat Lim

Table of Contents

Change Record i

1 Executive Summary 1

2 Introduction 1

3 02C.06.02.01 Database and File Access Services 1

3.1 Data Access Framework I/O Layer 2

4 02C.07.01.01 Control and Management Services 3

4.1 Event Subsystem 3

4.2 Orchestration 4

5 02C.07.01.02 Pipeline Construction Toolkit 5

5.1 Policy Framework 5

5.2 Pipeline Harness 7

6 02C.07.01.03 Pipeline Execution Services 8

6.1 Logging Subsystem 8

7 02C.07.01.06 System Administration and Operations Services 9

7.1 Data Management Control System 9

8 02C.07.01.07 File System Services 10

8.1 Data Access Framework Replication Layer 10

LSST Data Management Middleware Design LDM-152 7/25/2011

The LSST Data Management Middleware Design

1 Executive Summary

The LSST middleware is designed to isolate scientific applications, including the Alert Production, Data Release Production, Calibration Products Production, and Level 3 processing, from details of the underlying hardware and system software. It enables flexible reuse of the same code in multiple environments ranging from offline laptops to shared-memory multiprocessors to grid-accessed clusters, with a common communication and logging model. It ensures that key scientific and deployment parameters controlling execution can be easily modified without changing code but also with full provenance to understand what environment and parameters were used to produce any dataset. It provides flexible, high-performance, low-overhead persistence and retrieval of datasets with data repositories and formats selected by external parameters rather than hard-coding. Middleware services enable efficient, managed replication of data over both wide area networks and local area networks.

2 Introduction

This document describes the baseline design of the LSST middleware, including the following elements of the Data Management (DM) Construction Work Breakdown Structure (WBS):

· 02C.06.02.01 Database and File Access Services

· 02C.07.01.01 Control and Management Services

· 02C.07.01.02 Pipeline Construction Toolkit

· 02C.07.01.03 Pipeline Execution Services

· 02C.07.01.06 System Administration and Operations Services

· 02C.07.01.07 File System Services

The LSST database design, which contributes to WBS element 02C.06.02.01, may be found in the document entitled “Data Management Database Design” (LDM-135).

Common to all aspects of the middleware design is an emphasis on flexibility through the use of abstract, pluggable interfaces controlled by managed, user-modifiable parameters. In addition, the substantial computational and bandwidth requirements of the LSST Data Management System (DMS) force the designs to be conscious of performance, scalability, and fault tolerance. In most cases, the middleware does not require advances over the state of the art; instead, it requires abstraction to allow for future technological change and aggregation of tools to provide the necessary features.

3 02C.06.02.01 Database and File Access Services

This WBS element contains the I/O layer of the Data Access Framework (DAF).

3.1 Data Access Framework I/O Layer

This layer provides access to local resources (within a data access center, for example) and low-performance access to remote resources. These resources may include images, non-image files, and databases. Bulk data transfers over the wide-area network (WAN) and high-performance access to remote resources are provided by the replication layer within 02C.07.07 File System Services.

3.1.1 Key Requirements

The DAF I/O layer must provide persistence and retrieval capabilities to application code. Persistence is the mechanism by which application objects are written to files in some format or a database or a combination of both; retrieval is the mechanism by which data in files or a database or a combination of both is made available to application code in the form of an application object. Persistence and retrieval must be low-overhead, allowing efficient use of available bandwidth. The interface to the I/O layer must be usable by application developers. It is required to be flexible, allowing changes in file formats or even whether a given object is stored in a file or the database to be selected at runtime in a controlled manner. Image data must be able to be stored in standard FITS format, although the metadata for the image may be in either FITS headers or database table entries.

3.1.2 Baseline Design

We designed the I/O layer to provide access to datasets. A dataset is a logical grouping of data that is persisted or retrieved as a unit, typically corresponding to a single programming object or a collection of objects. Dataset types are predefined. Datasets are identified by a unique identifier. Datasets may be persisted into multiple formats.

A Formatter subclass is responsible for converting the in-memory version of an object to its persisted form (or forms), represented by a Storage subclass, and vice versa. The Storage interface may be thin (e.g. providing a file's pathname) or thick (e.g. providing an abstract database interface) depending on the complexity of the persisted format; all Formatters using a Storage are required to understand its interface, but no application code need do so. One Storage will represent the publish/subscribe interface used by the camera data acquisition system to deliver image data. Formatters and Storages are looked up by name at runtime, so they are fully pluggable. Formatters may make use of existing I/O libraries such as cfitsio, in which case they are generally thin wrappers. Formatters are configured by Policies.

All persistence and retrieval is performed under the control of a Persistence object. This object is responsible for interpreting the overall persistence Policy, managing the lookups and invocations of Formatters and Storages, and ensuring that any transaction/rollback handling is done correctly.

3.1.3 Alternatives Considered

Use of a full-fledged object-relational mapping system for output to a database was considered but determined to be too heavyweight and intrusive.

3.1.4 Prototype Implementation

A C++ implementation of the design was created for Data Challenge 2 (DC2) and has evolved since then. Formatters for images and exposures, sources and objects, and PSFs have been created. Datasets are identified by URLs. Storage classes include FITS[1] files, Boost::serialization[2] streams (native and XML), and the MySQL[3] database (via direct API calls or via an intermediate, higher-performance, bulk-loaded tab-separated value file). The camera interface has not yet been prototyped.

This implementation has been extended in DC3 to include a Python-based version of the same design that uses the C++ implementation internally. In the Python version, a Data Butler plays the role of the Persistence object. It takes dataset identifiers that are composed of key/value pairs, with the ability to infer missing values as long as those provided are unique. An internal Mapper class uses a Policy to control the format and location for each dataset. A Python-only Storage class has been added to allow persistence via the Python "pickle" mechanism.

4 02C.07.01.01 Control and Management Services

4.1 Event Subsystem

The event subsystem is used to communicate among components of the DM System, including between pipelines in a production. A monitoring component of the subsystem can execute rules based on patterns of events, enabling fault detection and recovery.

4.1.1 Key Requirements

The event subsystem must reliably transfer events from source to multiple destinations. There must be no central point of failure. The subsystem must be scalable to handle high volumes of messages, up to tens of thousands per second. It must interface to languages including Python and C++.

A monitoring component must be able to detect the absence of messages within a given time window and the presence of messages (such as logged exceptions) defined by a pattern.

4.1.2 Baseline Design

The subsystem will be built as a wrapper over a reliable messaging system such as Apache ActiveMQ[4]. Event subclasses and standardized metadata will be defined in C++ and wrapped using SWIG[5] to make them accessible from Python. Events will be published to a topic; multiple receivers may subscribe to that topic to receive copies of the events.

The event monitor subscribes to topics that indicate faults or other system status. It can match templates to events, including boolean expressions and time expressions applied to event data and metadata.

Observatory Control System (OCS) messages destined for the DM System will be translated into DM Event Subsystem events by dedicated software and published to appropriate topics.

4.1.3 Prototype Implementation

An implementation of the event subsystem on Apache ActiveMQ was created for DC2 and has evolved since then. Command, Log, Monitor, PipelineLog, and Status event types have been defined. Event receivers include pipeline components, orchestration components, the event monitor, and a logger that inserts entries into a database.

The event monitor has been prototyped in Java. The OCS message translator has not yet been prototyped.

4.2 Orchestration

The orchestration layer is responsible for deploying pipelines and Policies onto nodes, ensuring that their input data is staged appropriately, distributing dataset identifiers to be processed, recording provenance, and actually starting pipeline execution.

4.2.1 Key Requirements

The orchestration layer must be able to deploy pipelines and their associated configuration Policies onto one or more nodes in a cluster. Different pipelines may be deployed to different, although possibly overlapping, subsets of nodes. All four pipeline execution models (see section 5.2.1) must be supported. Sufficient provenance information must be captured to ensure that datasets can be reproduced from their inputs.

The orchestration layer at the Base Center works with the DM Control System (DMCS, see section 7.1) at that Center to accept commands from the OCS to enter various system modes such as Nightly Observing or Daytime Calibration. The DMCS invokes the orchestration layer to configure and deploy the Alert Production pipelines accordingly. At the Archive Center, the orchestration layer manages execution of the Data Release Production, including sequencing scans through the raw images in spatial and temporal order.

Orchestration must detect failures, categorize them into permanent and possibly-transient, and restart transiently-failed processing according to the appropriate fault tolerance strategy.

4.2.2 Baseline Design

The design for the orchestration layer is a pluggable, Policy-controlled framework. Plug-in modules are used to configure and deploy pipelines on a variety of underlying process management technologies (such as simple ssh[6] or more complex Condor-G[7] glide-ins), which is necessary during design and development when hardware is typically borrowed rather than owned. Additional modules capture hardware, software, and configuration provenance, including information about the execution nodes, the versions of all software packages, and the values of all configuration parameters for both middleware and applications.

This layer monitors the availability of datasets and can trigger the execution of pipelines when their inputs become available. It can hand out datasets to pipelines based on the history of execution and the availability of locally-cached datasets to minimize data movement.

Faults are detected by the pipeline harness and event monitor timeouts. Orchestration then reprocesses transiently-failed datasets on already-deployed pipelines or else deploys a new pipeline instance for the reprocessing.

4.2.3 Prototype Implementation

A prototype implementation of the deployment framework was developed for DC3a. It was extended to handle Condor-G, and data dependency features were added for DC3b. Full fault tolerance has not yet been prototyped, although a limited application of a fault tolerance strategy has been demonstrated. Provenance is recorded in files and, to a limited extent, in a database. The file-based provenance has been demonstrated to be sufficient to regenerate datasets.

5 02C.07.01.02 Pipeline Construction Toolkit

5.1 Policy Framework

The Policy component of the Pipeline Framework is of key importance throughout the LSST middleware.

Policies are a mechanism to specify parameters for applications and middleware in a consistent, managed way. The use of Policies facilitates runtime reconfiguration of the entire system while still ensuring consistency and the maintenance of traceable provenance.

5.1.1 Key Requirements

Policies must be able to contain parameters of various types, including at least strings, booleans, integers, and floating-point numbers. Ordered lists of each of these must also be supported. Each parameter must have a name. A hierarchical organization of names is required so that all parameters associated with a given component may be named and accessed as a group.

There must be a facility to specify legal and required parameters and their types and to use this information to ensure that invalid parameters are detected before code attempts to use them. Default values for parameters must be able to be specified; it must also be possible to override those default values, potentially multiple times (with the last override controlling).

Policies and their parameters must be stored in a user-modifiable form. It is preferable for this form to be textual so that it is human-readable and modifiable using an ordinary text editor.

It must be possible to save sufficient information about a Policy to obtain the value of any of its parameters as seen by the application code.

5.1.2 Baseline Design

The design follows straightforwardly from the requirements.

Policies are specified by a text file containing hierarchically-organized name/value pairs. A value may be another Policy (referred to as a sub-Policy). A value may also be a list of values (all of the same type). Policies may reference other Policies to set values for sub-Policies.

A Dictionary, which is also a Policy, specifies the legal parameter names, their types, minimum and maximum lengths for list values, and whether a parameter is required. Since Dictionaries are Policies, they may use Policy references to incorporate other dictionaries to validate sub-Policies.

Each piece of application code (routine or object) using a Policy will typically have an associated Dictionary to validate the Policy parameters and provide default values. If a higher-level routine invokes a lower-level Policy-using routine, it passes in the appropriate Policy, which may be a sub-Policy of the higher-level routine's Policy. The higher-level routine may also provide defaults, adding to or overriding the Dictionary defaults.