DRAFT – Strawman for discussion – DRAFT
Program Execution
September 29, 2003
Andrew Grimshaw
Presumed objective
The program execution sub-group is to examine the issues and propose service definitions (interfaces) involved with starting up services – including legacy applications. An alternative view is that this sub-group is only concerned with legacy application support.
Background
There are two existing documents, one by Ming Xu (Platform) and one by Ravi Subramaniam (Intel). Xu’s contribution is focused on legacy applications and the use of load management system back-ends such as Platform’s LSF and Sun’s SGE. Several important issues such as executable provisioning, data provisioning, etc. are not yet addressed. Subramaniam’s contribution is a more general model of the entire process.
I believe both documents are a good start. I think though that we should look at what needs to go on under the covers. I have extensive experience with starting services (objects) in Legion. The Legion model presumed that the user would be involved in few if any direct activities, instead the underlying system automatically “provisioned” cycles, binaries, and data. Further the underlying system would automatically detect failures and recover. We did however, discover some problems with our model, particularly with respect to what I’ll call “binary matching”. A “binary” was a piece of code that will run in a particular hosting environment. This may be for example, a Solaris executable, a shell script, Perl, or Java byte code. Binary matching is the process of determining whether a particular “binary” will run on a particular hosting environment.
Below I list a number of services that we found necessary, with a paragraph about each one. It is not intended to be a complete description at this state.
Program Execution "Services"
Job proxy
The job proxy services manages the execution of legacy codes. One cannot presume for example that a legacy Fortran code is going to know anything about OGSI. The Job Proxy (we called them JPO’s for Job Proxy Object) wraps the legacy code and provides a manageability interface to the running application. For example, setting up stdio redirection, managing the saving of checkpoint files, providing the ability to start/stop the application, etc. Further, it may be responsible for provisioning the data to the application using tools such as GridFTP or others. My understanding is SGE uses something like this as well, and I suspect Platform LSF does too.
Application provisioning
To expect the user to ensure that the current, correct, version of an application has been installed in a hosting environment before execution, particularly if they do not know a priori where the application will run, is unreasonable. Therefore, application provisioning services will be required that ensure that whatever executable resources are required to run the application on a particular CMM resource are available. This may include copying binaries (executables, scripts, “.o” files, “.lib” files, “.jar” files, etc.) into a local “implementation cache” and setting appropriate permissions. Further, this should ensure that whatever policy with respect to application versioning has been selected is enforced.
The CMM resources could include hosting environments such as J2EE servers, queuing systems such as LSF, SGE, and vanilla operating systems such as Unix, Windows, etc. Additional CMM resources would be storage services and network services.
Resource Discovery
Resource discovery services will clearly be needed. They are likely to be associated with VO’s and define a set of resources that can be searched. They have been discussed elsewhere.
Schedulers
By schedulers I really mean service/job placement services that will choose CMM resources on which to execute services, not low-level OS schedulers. The scheduler may generate multiple schedules in the case that the first choice schedule becomes un-available. There will be different schedulers for different types of applications, and for different workload environments.
Enactors
Enactors take schedules and do the work needed to realize them. This includes acquiring the reservations, working with application provisioning services, dealing with non-responsive hosts, dealing with accounting issues, etc.
Reservation
Reservation services manage reservations for resources, interact with accounting services (there may be a charge for making a reservation), revoke reservations, etc.
Monitoring
Simply starting something up is often insufficient. Applications (which may include many different services/components) often need to be continuously monitored, both for fault-tolerance reasons as well as QOS reasons. For example, the conditions on some hosts that caused the scheduler to select it may have changed, possibly indicating that the application needs to be rescheduled.
Accounting/billing services
Accounting, auditing, and billing services are critical for success of OGSA outside of academia and the government. This will include the ability for schedulers to interact with resources to establish prices, as well as for resources to interact with accounting and billing services.
"Compatibility" checking services
One of the problems faced in the grid is being able to determine which hosts (CMM hosting services) are candidates for the execution of a service: not all services can run on all hosts. In reality there may be many different implementations of a service with different QOS features as well as different hosting requirements. For example, Java, Sun native implementations, AIX native implementations. Unfortunately, just saying it is a Sun native implementation is insufficient to determine if a binary can run on a particular Sun. There are OS versions, installed libraries, license restrictions, etc. A compatibility checking service will determine whether a particular implementation can execute on a particular host.
Licenses management services
License management services will be needed to manage access to
Queuing
Queuing services are higher level services that have enq, deq, re-prioritize, get status, etc. Queuing services will be implemented using other services such as schedulers, VO’s, data provisioning, and so on. These will be the user-facing part for legacy codes.
Data provisioning
Interaction scenario
Page 3 of 3