CZO national data management meeting 1 May 19-20 2010, Boulder CO

CZO Data Management Meeting Minutes

May 19th and 20th, 2010

Boulder, CO

Mark Williams and Hope Humphries

Highlights of the meeting

? We seem to be on the same page with our overall approach: (a) sites manage their own data; (b) sites display data in ascii format with agreed upon metadata; (c) all sites use the same controlled (preferred) vocabulary and metadata format; (d) data is harvested by the San Diego Supercomputer, which wraps WaterML around the data (site metadata is consistent with WaterML) and then ingests into CUAHSI’s WaterOneFlow Web Services; (e) data at this point should be compatible with EarthChem and LTER EML, etc; (f) CZO desktop is in development to provide data analysis tools.

? Given that, we have thorny issues to work out: (a) exactly what the metadata will look like; (b) developing a controlled vocabulary; (c) streaming data versus QA/QC data; (d) unique identifer, etc. New working groups were started to address some of these issues. Note also that Kerstin, from lessons learned in developing EarthChem, strongly urged us to adopt a controlled vocabulary.

? New working groups formed

v CZO data interoperability group (CZODIG): get different databases talking to each other. Members: Kerstin, David T., Anthony, Charlie, Brian, Ilya, and Mark. Also need someone from Luquillo.

v Web development group: Roger, Mark, Gary, David L (other volunteers?)

· Semi-distributed system. Most sites prefer a central server that hosts the individual sites web pages, along with the national web site. Southern Sierras CZO prefers to host their web site and have products harvested by the central server. That configuration appears doable. Where the central server resides under discussion.

v Unique Identifer: Kerstin, David T, Mark W (others?)

· There seemed to be a consensus to use the International Geosample Number (IGSN), which is a 9-digit alphanumeric string, the first 3 digits of which constitute a user code. The SESAR data base is used to identify a sample from its IGSN. The system is still developing; ability to track multiple parents could be built in. Procedure to get a number: SESAR website set up for only one sample at a time. CZO could become a trusted agent and run its own registration; then submit to the central catalog.

v Metadata standards: David T, Ilya, Mark W (others interested?). We need to formalize metadata standards and a controlled (preferred) vocabulary. Idea is to build on the ODM controlled vocabulary and use their web system for defining variables. This group needs to meet in early July in Utah.

? Data products/synthesis products needed by September Science Council Meeting

? Are their flagship data products that CZO wants to highlight?

Attendees

Mark Williams, David Lubinski, Suzanne Anderson, Chi Yang, Eric Parrish, Boulder Creek CZO; Gary Phelps, Roger Bales and ?, Southern Sierra CZO; Kevin Dressler and Brian Bills, Shale Creek CZO; Matej Durcik and Jon Chorover, Jemez River CZO; Charles Dow and Anthony Aufdenkampe, Cristina River CZO; Kerstin Lehnert, EarthChemDB developer; Sherri Johnon, Andrews LTER; Jame Brunt, LTER Network Office; David Tarboton, ODM developer; Deana Pennington, LTER network office; Ilya Zaslavsky, CUAHSI WaterOneFlow Web Services developer, Dan Ames, CZO Desktop developer.

Initial goals of the workshop:

? Metadata and data display - agree on and describe to the extent that data managers know what’s expected.

? Unique identifier - adopt?

? Future funding - original RFP required single-portal access, integrated databases, but these requirements were removed. Money was provided to develop a roadmap for an integrated database, with the understanding that the funding is not sufficient.

Discussion about data models: There are lots of data models; an important subject for discussion is having data models that are easily linked to each other. Interoperability comes from having a central data management system with enough WaterML or EML wrapped around it to talk to other systems. The definition of a data model includes how different types of data connect to each other, i.e., connecting together information about a site and water chemistry constitutes an implicit data model. Compatible models are required to be able to parse data from different sources for efficient use and merging.

Mark Williams, INSTAAR, University of Colorado. Overview talk.

The LTER leads in data management but nevertheless has some problems, e.g., presenting the data for data discovery. NSF mandates synthesis for LTER, but the Ecotrends program took 2 years longer than expected because of the amount of data massaging needed - the data were not standardized/in a useful form. Also, Ecotrends is static. We need to move to a dynamic system. Our approach is well-vetted, pragmatic, and conservative, building on 20 years of experience, including mistakes. We don’t want to re-invent the wheel; we want to build on previous work done by CUAHSI, EarthChem, etc. The goal is to provide data in archival form that can be ingested by WaterML and EML, with the least amount of effort by individual sites.

A carrot is that if we can design and implement a integrated data management system, there is an increased chance that sites will continue - renewal is not guaranteed. An integrated system sells the program. Do we still agree on the CZO approach? – a consistent web presence and data presentation; sites own their data and are responsible for them. They handle data however they want but publish on the website in a certain format. The data are harvested by a central data system. Interoperability comes from incorporating data into the central system. Need sufficient metadata. Pull system - sites don’t have to do anything to get data into CZO central. The data are presented on the site web page in a way that makes it easy to bring them into the central database. Each CZO is master of its own data. The only requirement is that we collect and incorporate metadata that we agree on in this meeting and present the data in a certain format. We don’t want to just adopt CUAHSI’s Observations Data Model (ODM). The data display is ASCII text. Our primary focus for now is on time series data and data that can be treated as time series (e.g., soil profile) in this meeting. Metadata are always attached to the data. Have an identical format for presenting data and metadata across CZOs. ASCII text is human readable and machine parsable (i.e., by WaterML and EML). We need a controlled vocabulary. Adapt ODM approach to defining variables. Coming up with the controlled/preferred vocabulary is difficult - use champions for this. Do we want to continue to hire out to CUAHSI or have a network office if CZO continues to grow? Data in a centralized database can be harvested and used by other tools.

Discussion about other kinds of data: How can we attach photos, links to algorithms? Photos as data, e.g. snow survey photos - need to match GPS location of a site with photos; spectroscopic streaming data can be 5 dimensional data with 3 bands and x, y coordinates. Have a metadata approach that is as expandable as possible. Provide tags - metadata need the capacity to do this. The metadata we’re developing will allow us to attach photos.

Gary Phelps, UC Merced. Spatial data and digital library.

His system has metadata functionality - assigns metadata and spatial extent to each data set uploaded. Spatial metadata protocol prototype based on the CZO recommendations from February 2010. It can include both point data and spatial extent data, e.g., LIDAR. It has multi-file control - can download multiple files with one set of metadata. It uses Google Maps API - has KML functionality, can store up to 3 GB, create FTP site.

Discussion about map browsing capabilities: A map-based data exploration and retrieval system would impress NSF. Other prototypes mentioned are based on Google Maps, e.g., Nanoos visualization system - real-time plot of data. However, large data sets such as LIDAR data may load too slowly in Google Maps. It may be worth taking a uniform approach to map interfaces on the web pages.

David Lubinski, INSTAAR, University of Colorado. CZO web information system

The context is helping to integrate the CZO web information system. He has 6 months, so a pragmatic approach is necessary. Questions: who is the audience, what is in the system, why (mission), and how to do it. The audience is CZO scientists and students, peers, NSF, the public. Need to remove hurdles to updating websites. Websites are important - NSF uses them to assess what you’re doing. Need easy access to web content by multiple groups of people. Make it as easy as possible to get information in. Need to track the information needed for NSF reporting. Outreach is also an important role. His role is focused on information, not scientific data. The websites are places for data display, but data archiving is done through the data managers.

Anthony Aufdenkampe : Christiana River Basin. Wiki-watershed project (proposal pending) - a map-based data viewer intended as a web-based outreach tool for exploring backyard watersheds. Will contain photos and stories in addition to data. Clickable maps produce information, stories about what’s going on in a watershed, e.g., mountain-top mining in West Virginia. Can be used to build curricula for school groups, citizen monitoring, provide the opportunity to load your own data onto the web server. Two purposes for map-based data explorer are scientific and gee-whiz for the public.

David Lubinski continued.

Three questions requiring feedback:

1. What does the interface for data access look like? We need a tagline. His suggestion: “Multi-disciplinary teams studying the zone where rock meets life”. The tagline from the brochure is long - a 10-12 word purpose. We also need a welcome blurb that explains who we are and themes in a succinct way. Put this on the welcome pages.

Discussion: Wordsmith the brochure tagline and use that. Start with what we already have.

2. How much aggregation and integration should take place? Is a simple aggregation of news from individual sites on the national websites sufficient or should we interconnect activities via links - people, news, etc.?

3. What are the implementation details? Build multi-page websites using a content management system; individual sites manage the content.

Discussion: Strategies include: 1) each site goes its own way; 2) web pages are on a single server - sites are not in charge of the hardware; or 3) a third model - each site runs its own web page with a consistent template. We’re using 3 now, so has the decision already been made? A new increment of funding would be needed to go in a different direction. We can work on an individual level, add functionality, and share resources. CZOs have an opportunity to avoid LTER’s problems. A fourth model consists of a having a central server and individual websites - integration with some autonomy to post data that is not shared. Data changes are handled the same way whether locally hosted or centrally located.

Mark: Assignment: PIs should caucus and decide the best way to go.

Sherri Johnson (Andrews LTER): Opportunities for synthesis of long-term cross-site data.

? ClimDB/HydroDB web pages implemented in 2002 with 41 sites, 327 measuring stations, 21 parameters. The challenge is updating, including metadata. Each site is presented the same way, but some don’t have all the variables. The climate committee made the decision to present only daily data.

? Ecotrends has an annual time step; metadata have gotten lost in the process, not well integrated.

? Cross site synthesis of long term stream biogeochemistry across 10 experimental forests is beginning. Challenges are standardizing the data and issues of comparability - e.g., standard terminology and units, detection limits.

Discussion about hybrid model: some sites don’t want their own content management system (Anthony). Others want their own CMS (e.g., AZ, Christina, Boulder).

Ilya Zaslavsky, Spatial Information Systems Laboratory, San Diego Supercomputer Center.

Developing a CZO central server.

Using a CMS, can assign skins to change the look and feel of a website, add modules to web pages by clicking on a list, grab data by doing SQL requests.

? CZO integrated data management: Why have web services for water data? Content on web pages is human readable, uses html, but also may want to make machine readable using WaterML for water data. Data comes back using a common model. The old way, data queries are entered on different web pages, get responses in different formats such as ASCII, xml. WaterML, a web language developed in CUAHSI, gives a common way to present data; includes location, variables, time series. WaterOneFlow is a set of web services. Other languages have been developed for standardization within other organizations. International standardization of WaterML is now taking place. For individual CZO data systems, expose content in a standard way. The rest is value-added: standard CZO services, web-based data discovery, desktop apps.

? CZO data publication model: individual CZO data management systems generate display files, modeled on LTER files. This allows adding time series-level and data value-level attributes. The data are placed on websites and ingested in the CZO repository at SDSC. The time series are then exposed as water data services. NSF now requires data management plans - this can be a way to comply. Need an indicator in a file about the last time it was updated. Many steps are involved in the CUAHSI system that CZOs don’t want to do. CZOs manage their own data systems and generate display files. Other activities go on behind the scenes, ending in data downloading. Main components of prototype system: the folder watch service passes a new file to the data interpreter, which reads the ASCII data line by line and a web-accessible directory is populated with the data. CZO central web services repository is on a test machine - the demo populated it with a data set from Boulder CZO. Using HydroDesktop, can search data catalogs, integrate data, map locations, create charts, search based on concepts. Can be used to select data outside the CZO, e.g., for the same variable and time period.