Grid Data Distribution and DAIS Data Delivery Requirements

Grid Data Distribution and DAIS Data Delivery Requirements

The intent of this paper is to address the issues that were raised at the DAIS F2F meeting at ANL in Oct 03 and to position the Grid Data Distribution model [GDD] vis-a-vis the DAIS Data Delivery proposal [DAISDD].

The intent of this paper is not to resolve all of the issues, but rather to provide a good background and a detailed description of the various options for data delivery, which may then be discussed at the next F2F meeting in Manchester.

A pre-requisite would be to read the “Data Distribution in the Grid Environment” informational paper submitted/presented at GGF9.

1. Summary of the GDD model

The Grid Data Distribution model allows users to share data between publishers and subscribers in a timely fashion.Publishers are considered to be the sources of data (information) whilesubscribers are considered to be the consumers of data (information).Subscribers and Consumers could be separate entities too. Data can be pushed or pulled from the publisher to the subscriber/consumer. The publisher can also alert subscribers/consumers of the existence of data.

In this model, publishers and subscribers can control how information is published, distributed, and consumed. The model supports efficient asynchronous distribution of data, so publishers don’t necessarily need to know of the target recipients and vice-versa through the use of staging areas.

There are three ways in which data movement could happen using the GDS framework. They are Data Access, Notification and Grid Data Distribution. Data Access provides the synchronous data movement (RPC like request-response characteristics). It could also be extended to support the asynchronous data access (for example, by returning a handle to the result data as response and the result data could later be retrieved through the Data Access using the handle).

We propose that the Grid Data Distribution portType be defined on a Grid Data Service as a separate portType (at the same level as the DataAccess and DataManagement portTypes). The GDD portType could have two sub portTypes GDDProducer and GDDConsumer.

GDDProducer would have the following operations:

createPublication, alterPublication, dropPublication, startPublication, stopPublication, createSubscription, alterSubscription, dropSubscription, startSubscription, stopSubscription, createPropagation, alterPropagation, dropPropagation, startPropagation, stopPropagation, publishData, getData.

GDDConsumer would have the following operations:

createConsumption, alterConsumption, dropConsumption, startConsumption, stopConsumption, deliverData, deliverEvent.

2. Issues brought up during the DAIS October F2F at ANL, Chicago

1.Issue: Data Service portTypes as they exist today, for adding support for data distribution. Questions arise as to whether GDD should be part of “Data Access” or “Data Management” or we need a separate portType(/s).

Response: We propose that the Grid Data Distribution portType be defined on a Grid Data Service as a separate portType (at the same level as the DataAccess, DataManagement portTypes). The GDD portType could have two sub portTypes GDDProducer and GDDConsumer.

2.Issue: Should GDD be further decomposed into PTs for each ‘phase’? i.e., publication, subscription, propagation and consumption.

Response: As mentioned above, GDD would have 2 sub portTypes GDDProducer and GDDConsumer.

3.Issue: How is data access done by GDD? Will it call the DataAccess PT of DS? if not, explain how this will be done.

Response: DataAccess and GDD are not related in any special way even when they are supported by the same Data Service. GDD supported by a Data Service should NOT use the Data Access interface (as a service) to access the data underneath. GDD when defined over Data Service would use the relational/XML query language (similar to Data Access) to define the implicit publication (data of interest) against the underlying data source.

** Data Access acts as a conduit for existing query languages to be conveyed to the appropriate data sources and similarly GDD may also use the appropriate query languages to define the data publication/subscription when it is supported by a Data Service.

4.Issue: How does the (client) agreements/policy fit within GDD?

Response: GDD offers negotiation/inspection capabilities through the DataDescription portType of GDS (via Grid Agreements). GDD needs to negotiate capabilities that are understood by a specific Grid Data Service (GDS) to do data distribution. The Grid agreements protocol is a good fit for negotiating these. The following is an example list of capabilities that we minimally might need to support interfaces defined by Grid Data Distribution (GDD).

Generic Capabilities:

Example of such capabilities are:

The GDS_VERSION capability specifies the version for a particular Grid data service.

The GDS_XID capability specifies whether GDS provides and understands transactions.

The GDS_LOGON_TYPES capability specifies whether GDS understand newer types of authentication during logon

Characterset and form type capabilities:

Example of such capabilities are:

The GDS_CODEPOINT capability specifies whether GDS understands client and server understand codepoint length semantics.

The GDS_NCHAR capability specifies whether GDS understands the nchar conversions.

The GDS_CHARSET capability specifies whether GDS can do character set conversion.

Type and Type evolution capability

Example of such capabilities are:

The GDS_DATE_TIME capability specifies the date time formats supported by the GDS.

The GDS_FLOAT_DOUBLE capability specifies whether binary types such as float or double are supported by GDS.

The GDS_TYPE_EVOLUTION capability specifies whether or not GDS understands an evolved type.

Presentation related capability

Example of such capabilities are:

The GDS_PRESENTATION capability specifies the presentations understood by the GDS.

The GDS_ENDIAN capability specifies the endian of the system hosting GDS.

The GDS_EOCS capability specifies whether the GDS know end-of-call status.

The GDS_PIGGYBACK capability specifies whether piggybacks are handled by GDS.

The GDS_DATA_BLOCK_VERSION capability specifies the data block formats understood by the GDS.

RPC related capability

Example of such capabilities are:

The GDS_RPC_VERSION capability specifies the rpc version supported by the GDS.

The GDS_RPC_SIGNATURE capability specifies the rpc signatures supported by the GDS.

The GDS_RPC_FLAG capability specifies the rpc flags supported by the GDS.

5.Issue: What monitoring capabilities are provided by GDD?

GDD offers monitoring capability through views accessed through the DataManagement portType of GDS. At a high level GDD advocates three types of views namely Administrative views, Statistical views and Security views. The following is an example list of parameters that we would like to monitor through these views:

Administrative views:

GDD Publishers and Publication rules

Publisher Name – Name of the publisher.

Publisher Identifier – Identifier returned as a result of create publisher call.

Publisher Address – Location of the publisher if any.

Publication Rules – Publication rules if any.

GDD Subscribers and subscription rules

Subscriber Name – Name of the subscriber.

Subscriber Identifier – Identifier returned as a result of create subscriber call.

Subscriber Address – Location of the subscriber if any.

Subscription Rules – Subscription rules if any.

GDD Propagators

Propagator Name – Name of the propagator.

Propagator Identifier – Identifier returned as a result of create propagator call.

Propagator Address – Location of the propagator if any.

Propagation Rules– Propagation rules if any.

GDD Consumers

Consumer Name – Name of the consumer.

Consumer Identifier – Identifier returned as a result of create consumption call.

Consumer Address – Location of the consumer if any. Publication Rules – Publication rules if any.

Consumption Rules – Consumption rules if any.

Security views:

GDD users with administrative privileges

User name – Name of the user with administrative privilege.

Privilege Matrix – Matrix of privilege associated with the user

User signature – Digital Signature of the user.

GDD users with operational privileges

User name – Name of the user with operational privilege.

Privilege Matrix – Matrix of privilege associated with the user

User signature – Digital Signature of the user.

Statistical Views

GDD publication statistics

Start Date and Time - Start Date and Time at which publication was started.

Last run Date and Time - The date and time of last successful execution

Total Number of Messages -Total number of messages published since publication was started.

Total Bytes -Total number of bytes published since publication was started.

Failures -The number of times the execution failed. -

Last error date and time - The date of the last unsuccessful execution

Last error message --The error number and error message text of the last unsuccessful execution

GDD deliver statistics

Destination GDS – Destination Grid Data Service Identifier

Start Date and Time - Start Date and Time at which data delivery was started.

Last run Date and Time - The date and time of last successful execution

Total Number of Messages -Total number of messages delivered since delivery was started.

Total Bytes -Total number of bytes delivered since delivery was started.

Failures -The number of times the execution failed. -

Last error date and time - The date of the last unsuccessful execution

Last error message --The error number and error message text of the last unsuccessful execution

GDD consumption statistics

Start Date and Time - Start Date and Time at which consumption was started.

Last run Date and Time - The date and time of last successful execution

Total Number of Messages -Total number of messages consumed since consumption was started.

Total Bytes -Total number of bytes consumed since consumption was started.

Failures -The number of times the execution failed. -

Last error date and time - The date of the last unsuccessful execution

Last error message --The error number and error message text of the last unsuccessful execution.

6.Issue: How does GDD handle transactional issues?

Transaction support:

GDD is a message oriented environment that requires transactional support through GDS, to ensure consistency and at the same time allow high performance and scalability.

GDD requires improved control over transaction capabilities of GDS. For integrity of propagation, we need support of “recoverable read”. For improved performance the support of “fast commit” would be required.

In a client server environment it is important to ensure that a transaction is recoverable before a commit completion is signaled to a program. This allows the program to send a note to the client, that everything has properly been completed and will be remembered even through system crashes. However, if the notification is done through a staging area, the commit completion can be signaled once visibility is reached, waiting for recoverability is not required and leads to unnecessary resource consumption (CPU cycles and process structures). The technology is sometimes referred to as 'fast commit.'

In non message environments it is assumed that external notification by the program will not happen until its commit has becomes recoverable. Under this assumption it is OK to read visible data even if they are not yet recoverable. Propagation, however, uses the following sequence: Read data, send data, wait for acknowledgment, and commit. Obviously, this sequence opens the door for sending none recoverable data, What is required is a 'recoverable' read,' i.e., a read that returns only recoverable data.

The Grid transaction support has to provide specifications for fast commit and recoverable read. In messaging environments, the best combination is fast commit and recoverable read. This model has been successfully used since 1976 in a commercial product that supports many highly visible mission critical applications.

7.Issue : Resolve GSH and GDD Ids interaction

7a. Can the GDD publications, subscriptions and consumptions be ‘services’ of their own?

Response: It is up to the implementer of the GDD portTypes to decide if they could be spawned as services or not. Since the Grid is highly scalable, the answer is yes.

7b. If the GDD publications, subscriptions and consumptions are not spawned as individual ‘services’, then how would a client know of existing publications for example. i.e., how a clientwould discover publications, if they are not services them selves.

Response: These could be supported by an external discovery service or through the DataDescription portType of the Grid Data Service.

3 GDD addressing the DAIS sample delivery scenarios

We Map the GDS example scenarios (see Appendix A), using Grid Data Distribution (GDD).

In order for GDS to support scheduling of execution/delivery of the “operations (query/update)”, it needs to have a “timer” facility implemented or have access to use such a service.

GDD is not interested in supporting scenarios that can be fully covered by DataAccess of GDS. The focus of GDD is mainly distribution of data or event with wide range of operational characteristics including those that are required for mission critical applications. Optionally, GDD will provide support for the selection of recipients and also optionally, generate the “right data at the right time”.

The following table shows what scenarios are supported by the GDS Spec (as of GGF9), those by Greg Riccardi’s suggested extensions (as of GGF9) and those by Grid Data Distribution.

Scenario / GDS spec / Greg Riccardi’s
Extensions / GDD / Comments
1 / Yes / Synchronous query
2 / Yes / Yes
3 / ? / Yes
4 / Yes / Synchronous update
5 / ? / Pull from a non-service entity
6 / Yes
7 / Yes
8 / Yes

Scenario 1: Synchronous access

GDD has no role to play here. It could use the SQLAccess::SQLQuery() to achieve this functionality.

Scenario 2: Asynchronous Query execution and delivery to 3rd party consumer. Push model

The analyst locates the Data service representing the data of interest, by looking up a global registry that lists such Data services. The analyst receives in return the GSH of this Data Service (DSGSH).

The analyst defines a subscription at DSGSH, expressing interest in the required data by specifying a SQL Query, the time at which this query has to be executed (3PM) and that it would result in a implicit publication called QueryPublication:

GDDProducer::createSubscription([implicitname=QueryPublication, SQL Query, scheduleat = 3PM], Analyst)

returns SubsID.

The subscription generates the publication implicitly.

At 3PM, this data service would execute the query and maintain the result data in a queue/temporary table.This result data would be required for delivery purposes as mentioned through the propagation rules.

The analyst specifies that the results of the subscription SubsID (the result of the query) be delivered to a 3rd party consumer. This is done by specifying the propagation rules. Through propagation, the analyst specifies the consumer location (URI), the time when the results have to be delivered (9PM), protocol to be used (SMTP), format in which the data needs to be delivered (WebRowSet), referring to the subscription by the subscription id (SubsID).

Note: 1. No need for consumption rules, in this case.

GDDProducer::createPropagation(ConsumerURI, [subscription=SubsID, scheduleat = 9PM, protocol=SMTP, deliveryFormat=WebRowSet])

returns propagationId2.

At 9PM, the data service at DSGSH, would use the protocol (SMTP) mentioned for propagationId2 to send the result data to the consumer at consumerURI. The data service has information about where the result data for subscription SubsID is maintained.

Scenario 3: Asynchronous query execution and delivery to 3rd party consumer. Pull model

Using GDD, scenario 3a (Pull model), could be achieved as follows:

a)The analyst locates the Data service representing the data of interest, by looking up a global registry that lists such Data services. The analyst receives in return the GSH of this Data Service (DSGSH).

b)The analyst defines a subscription at DSGSH, expressing interest in the required data by specifying a SQL Query, the time at which this query has to be executed (3PM) and that it would result in a implicit publication called QueryPublication:

GDDProducer::createSubscription([implicitname=QueryPublication, SQL Query, scheduleat = 3PM], Analyst)

returns SubsID.

The subscription generates the publication implicitly.

At 3PM, this data service would execute the query and maintain the result data in a queue/temporary table.This result data would be required for delivery purposes.

c)The analyst would ask a 3rd party consumer to get the result data from the data service, by sending to consumer, the GSH of the data service (DSGSH) and the subscription information (SubsID).

d)The 3rd party consumer will retrieve the results of the subscription SubsID (the result of the query), whenever it wants to. The consumer specifies the consumption rules (in what format the data would be consumed) and uses getData() method to retrieve the result data.

Note: 1. No need for propagation rules, in this case.

GDDConsumer::createConsumption([subscription=SubsID, dataConsumptionFormat=WebRowSet], Consumer)

Returns consumptionId.

GDDProducer::getData(consumptionId)

Using GDD, scenario 3b (Push model), could be achieved as follows:

Steps a, b and c are the same as 3a, as mentioned above.

The 3rd party consumer would specify a schedule to the data service (DSGSH) as to when the delivery of the results of the subscription SubsID (the result of the query) has to happen. Through propagation, the analyst specifies the consumer location (URI), the time when the results have to be delivered (11PM), protocol to be used (FTP), format in which the data needs to be delivered (WebRowSet), referring to the subscription by the subscription id (SubsID).

Note: 1. No need for consumption rules, in this case.

GDDProducer::createPropagation(ConsumerURI, [subscription=SubsID, scheduleat = 11PM, protocol=FTP, deliveryFormat=WebRowSet])

returns propagationId.

At 11PM, the data service at DSGSH, would use the protocol mentioned for propagationId to send the result data to the consumer at consumerURI. The data service has information about where the result data for subscription SubsID is maintained.

Using GDD, scenario 3c (pull model), could be achieved as follows:

Steps a, b, d of 3a are the same for this scenario also.

In step b, at 3PM, the data service (G1) would execute the query and create a new data service (G2) and populate this new data service using the query result.