ESG Roadmap1
ESG ROADMAP
Questions we hope to address:
- How can we work together?
- How does the work we are focused on fit into existing systems (i.e., security, metadata, data delivery, etc.)
- How do we interoperate with independent systems
Serving our Community:
- CMIP3 IPCC AR4 and CMIP4 IPCC AR5
- CCSM
- CCES
- PCM
- POP
- NARCCAP
- C-LAMP
- CFMIP
Preliminary requirements and architecture development with European partners
Characterization of Users
- WG1 -- Scientist
- WG2 -- Impacts
- WG3 – Mitigations
Characterization of CMIP3 IPCC AR4 Data
- Gridded files, aggregations
- Datasets
- Accessible through netCDF API
- CF 1.0 compliant
Characteristics of the Data:
- AR4/AR5 format and convention: netCDF and CF
- Cannot assume all data will follow CF convention – legacy CCSM and PCM data don’t follow CF convention. So leverage CF convention when possible, but build metadata model and services to be independent of CF
- Very large data (high resolution regional data)
- Gridded data
- Observation data
- Station data
- Satellite data
Functional specification and Architecture Design
- In order to meet the needs of the climate community, the ESG-CET architecture must allow a large number of distributed sites, with varying capabilities, to federate, cooperate, and/or work as standalone entities. Thus, we must be able to work with multiple portals and services and have API-based access, and data delivery mechanisms. To accomplish this, the ESG-CET architecture is based on three tiers of data services, as illustrated in Figure below.
Tiered ESG-CET Architecture
- Tier 1 comprises a set of Global ESG Services (partially exposed via a Global ESG Portal) that provide shared functionality across the overall ESG-CET federation. These services include 1) user registration and management, 2) common security services for access control to the federated resources, 3) metadata services for describing and searching the massive data holdings, 4) a common set of notification services and registry, and 5) global monitoring services. All ESG-CET sites share a common user database, so that a user will only have to register once and maintain a single set of credentials in order to access resources across the whole system. (Access to specific data collections and related resources, such as IPCC data, may still have to be approved by the data’s “owners”.) The definition of global metadata services for search and discovery guarantees a user’s ability to find data of interest throughout the whole federation, independently of the specific site at which a search is begun.
- Tier 2 comprises a limited number of ESG Data Gateways that act as data-request brokers, possibly providing specific enhanced functionality and/or documentation to serve specific user communities, while supplying much needed fault-tolerance to the overall system. Services deployed on a Gateway include the user interface for searching and browsing metadata, for requesting data (including analysis and visualization) products, and for orchestrating complex workflows. We expect the software stack deployed at the Gateway level to require considerable expertise to install and maintain, and thus we envision these Gateways being operated directly by ESG-CET engineering staff. Initially, three ESG Gateways are planned: one at LLNL focused on the IPCC AR5 needs, one at ORNL to support the Computational Climate End Station project, and one at NCAR to serve the CCSM and PCM model communities (and possibly also others).
- Tier 3 is the actual data holdings and the back-end data services used to access the data, which resides on a (potentially large) number of federated ESG Nodes. Tier 3 resources typically host those data and metadata services needed to publish the data onto ESG, and to execute data-product requests formulated through an ESG Gateway. Because researchers and engineers at local institutions with variable levels of expertise operate ESG Nodes, the software stack to be deployed at an ESG Node is kept to a minimum, and supplied with detailed and exhaustive documentation. A single ESG Gateway serves data requests to many associated ESG nodes: for example, as part of the next IPCC project, more than 20 institutions are expected to set up ESG data nodes.
ESG will host approximately 30 – 35 models from over 30 modeling centers
AR5 Requirements Timeline
- 9/07: Primary set of benchmark experiments defined (WGCM meeting). Preliminary set of standard output defined (CMIP panel)
- 3/08: standard output finalized (CMIP panel) New version of CMOR released accommodating non-rectangular grids
- 9/08: All required experiments defined (WGCM meeting)
- Late-2008: modeling groups begin running benchmark experiments (e.g., control, 1%/year CO2 increase)
- 2009: modeling groups run models and produce output
- 1/10: model output starts to be made available to community
- 9/10: first meeting of WG1 AR5 lead authors
- 3/11: research paper drafts completed and made available to WG1 AR5 authors
- 9/11: First order draft of WG1 AR5 completed
- 8/12: deadline for acceptance of papers cited by IPCC
- 6/13: release of WG1 AR5
Location of significant instruments, facilities, or other assets
- Platform(s) required (*nix?)
- Online storage for the most accessed data sets
- Tertiary storage for very large, less commonly accessed datasets
- Online cache for staging from tertiary storage
ESG’s testbed deployment (2008-2009 time frame) is expected to federate the following sites:
- PCMDI, LLNL (gateway, node)
- NCAR (gateway, node)
- ORNL (gateway, node)
- LANL (node)
- GFDL, Princeton, NJ (node)
- BADC, UK (partner)
- DKRZ, Germany (partner)
- University of Tokyo Center for Climate System Research, Japan (node)
The full next-generation ESG federation is likely to add one or more sites in the following countries to the testbed sites above:
- China
- Norway
- Canada
- France
- Australia
- Korea
- Italy
- Russia
Component Design for Federation
- Phase 1 – the minimum set to requirements for the testbed to operate
- Security: Authorization and Authentication
- User enrolment and management
- Metadata management
- Publication
- Data deliver (FTP, GridFTP, etc.)
- Portal: search, browse
- logging and metrics
- Phase 2 – second set of requirements not needed for initial testbed
- Monitoring
- Data product, remote computing services
- Client-side access
- Notification service
Defining implementing the components and APIs (perhaps brokers) for interoperability with European partners
- Interaction sequences
ESG-CET ROADMAP
Deadline: October 2008 for IPCC AR5
Roadmap based on quarterly releases of a deployable Gateway source code base. Each release includes several modules of functionality, which need to be fully tested and involve an almost-final user interface
- December 31st, 2007: Standalone Gateway with basic functionality
- User management
- Authentication (standalone)
- Gateway Branding
- Catalogs Browsing (datasets and files)
- Files Download: files local to Gateway disk
- Data Publishing: harvesting from existing THREDDS catalogs
- Semantic Search: prototype based on metadata harvested from THREDDS catalogs
- March 31th, 2008: Standalone Gateway with extended functionality
- Files Download: files on deep storage (SRM w/ WS)
- Data Cart and Data Request Monitoring interface
- Data Publishing: files local to Gateway (including on deep storage)
- Detailed metadata display
- Semantic Search: use full metadata from domain model
- Authorization Services
- June 30th, 2008: Data Node integration
- Integration with LAS, TDS
- User Interface
- Submission of data requests (viz and subsetting) from UI to LAS, TDS
- Database-driven configuration
- Authorization
- Metrics: collection from Gateway and Data Node
- Data Publishing: data on remote Data Node
- September 30th, 2008: Gateways Federation
- Single Sign On among Gateways
- Exchange/Share Metadata
- Metrics: aggregation among Gateways
- Monitoring
- Data Publishing: migration of legacy data from current operational systems
- Client access:
- FTP
- GridFTP
- Wget
- DML
Left for 2009:
- Opendap-G integration
- Server-Side processing