Seminar Report ’03Grid Computing

Introduction

The term the Grid was coined in the mid1990s to denotea proposed distributed computing infrastructure foradvanced science and engineering. Considerable progresshas since been made on the construction of such aninfrastructure but the term Grid has also been conflated,at least in popular perception, to embrace everythingfrom advanced networking to artificial intelligence. Onemight wonder whether the term has any real substanceand meaning. Is there really a distinct Grid problem andhence a need for new Grid technologies? If so, what is thenature of these technologies, and what is their domain ofapplicability? While numerous groups have interest inGrid concepts and share, to a significant extent, a commonvision of Grid architecture, we do not see consensus on theanswers to these questions.

The Grid concept is indeed motivated by a real and specificproblem and that there is an emerging, well-defined Gridtechnology base that addresses significant aspects of thisproblem. In the process, we develop a detailed architectureand roadmap for current and future Grid technologies.Furthermore, Grid technologies are currently distinct fromother major technology trends, such as Internet, enterprise,distributed, and peer-to-peer computing, these other trendscan benefit significantly from growing into the problemspace addressed by Grid technologies.

The real and specific problem that underlies theGrid concept is coordinated resource sharing andproblem solving in dynamic, multi-institutional virtualorganizations. The sharing that concerned with isnot primarily file exchange but rather direct access tocomputers, software, data, and other resources, as isrequired by a range of collaborative problem-solvingand resource-brokering strategies emerging in industry,science, and engineering. This sharing is, necessarily, highlycontrolled, with resource providers and consumers definingclearly and carefully just what is shared, who is allowed toshare, and the conditions under which sharing occurs. A setof individuals and/or institutions defined by such sharingrules form what we call a virtual organization (VO).

The following are examples of VOs: the application serviceproviders, storage service providers, cycle providers, andconsultants engaged by a car manufacturer to performscenario evaluation during planning for a new factory;members of an industrial consortium bidding on a newaircraft; a crisis management team and the databases andsimulation systems that they use to plan a response to anemergency situation; and members of a large, international,multiyear high-energy physics collaboration. Each ofthese examples represents an approach to computing andproblem solving based on collaboration in computation-anddata-rich environments.

As these examples show, VOs vary tremendously in theirpurpose, scope, size, duration, structure, community,and sociology. Nevertheless, careful study of underlyingtechnology requirements leads us to identify a broad setof common concerns and requirements. In particular, wesee a need for highly flexible sharing relationships, rangingfrom client-server to peer-to-peer; for sophisticated andprecise levels of control over how shared resources are used,including fine-grained and multi-stakeholder access control,delegation, and application of local and global policies; forsharing of varied resources, ranging from programs, files,and data to computers, sensors, and networks; and fordiverse usage modes, ranging from single user to multi-userand from performance sensitive to cost-sensitive andhence embracing issues of quality of service, scheduling,co-allocation, and accounting.

Current distributed computing technologies do not addressthe concerns and requirements just listed. For example,current Internet technologies address communication andinformation exchange among computers but do not provideintegrated approaches to the coordinated use of resourcesat multiple sites for computation. The Open GroupsDistributed Computing Environment (DCE) supportssecure resource sharing across sites, but most VOs wouldfind it too burdensome and inflexible. Storage serviceproviders (SSPs) and application service providers (ASPs)allow organizations to outsource storage and computingrequirements to other parties, but only in constrainedways: for example, SSP resources are typically linked to acustomer via a virtual private network (VPN). Emerging distributed computing companies seek to harness idlecomputers on an international scale but, to date, supportonly highly centralized access to those resources. Insummary, current technology either does not accommodatethe range of resource types or does not provide theflexibility and control on sharing relationships needed toestablish VOs.

It is here that Grid technologies enter the picture. Over thepast five years, research and development efforts within theGrid community have produced protocols, services, andtools that address precisely the challenges that arise whenwe seek to build scalable VOs. These technologies includesecurity solutions that support management of credentialsand policies when computations span multiple institutions;resource management protocols and services that supportsecure remote access to computing and data resources andthe co-allocation of multiple resources; information queryprotocols and services that provide configuration and statusinformation about resources, organizations, and services;and data management services that locate and transportdatasets between storage systems and applications.Because of their focus on dynamic, cross-organizationalsharing, Grid technologies complement rather than competewith existing distributed computing technologies. Forexample, enterprise distributed computing systems canuse Grid technologies to achieve resource sharing acrossinstitutional boundaries; in the ASP/SSP space, Gridtechnologies can be used to establish dynamic markets forcomputing and storage resources, hence overcoming thelimitations of current static configurations.

It is our belief that VOs have the potential to change dramaticallythe way we use computers to solve problems, muchas the web has changed how we exchange information. Asthe examples presented here illustrate, the need to engage incollaborative processes is fundamental to many diverse disciplinesand activities: it is not limited to science, engineeringand business activities. It is because of this broad applicabilityof VO concepts that Grid technology is important.

The Emergence of VirtualOrganizations

Consider the following four scenarios:

  • A company needing to reach a decision on the placementof a new factory invokes a sophisticated financialforecasting model from an ASP, providing it withaccess to appropriate proprietary historical data froma corporate database on storage systems operated byan SSP. During the decision-making meeting, what-ifscenarios are run collaboratively and interactively, eventhough the division heads participating in the decisionare located in different cities. The ASP itself contracts witha cycle provider for additional oomph during particularlydemanding scenarios, requiring of course that cycles meetdesired security and performance requirements.
  • An industrial consortium formed to develop a feasibilitystudy for a next-generation supersonic aircraft undertakesa highly accurate multidisciplinary simulation of theentire aircraft. This simulation integrates proprietarysoftware components developed by different participants,with each component operating on that participantscomputers and having access to appropriate designdatabases and other data made available to theconsortium by its members.
  • A crisis management team responds to a chemical spillby using local weather and soil models to estimate thespread of the spill, determining the impact based on populationlocation as well as geographic features such as riversand water supplies, creating a short-term mitigation plan(perhaps based on chemical reaction models), and taskingemergency response personnel by planning and co-ordinatingevacuation, notifying hospitals, and so forth.Thousands of physicists at hundreds of laboratories anduniversities worldwide come together to design, create,operate, and analyze the products of a major detector atCERN, the European high energy physics laboratory. Duringthe analysis phase, they pool their computing, storage,and networking resources to create a Data Grid capable ofanalyzing petabytes of data. These four examples differ in many respects: the numberand type of participants, the types of activities, the durationand scale of the interaction, and the resources being shared.

But they also have much in common, as discussed in the following (see also Figure 1). In each case, a number ofmutually distrustful participants with varying degreesof prior relationship (perhaps none at all) want to share

resources in order to perform some task. Furthermore,sharing is about more than simply document exchange (asin virtual enterprises): it can involve direct access to remotesoftware, computers, data, sensors, and other resources.

For example, members of a consortium may provideaccess to specialized software and data and/or pool theircomputational resources.Resource sharing is conditional: each resource owner makesresources available, subject to constraints on when, where,and what can be done. For example, a participant in VO P ofFigure.1 might allow VO partners to invoke their simulationservice only for simple problems. Resource consumers mayalso place constraints on properties of the resources they are

prepared to work with. For example, a participant in VOQ might accept only pooled computational resources certifiedas secure. The implementation of such constraints re-quiresmechanisms for expressing policies, for establishingthe identity of a consumer or resource (authentication), andfor determining whether an operation is consistent with applicablesharing relationships (authorization).

Sharing relationships are often not simply client-server, butpeer to peer: providers can be consumers, and sharing relationshipscan exist among any subset of participants. Sharingrelationships may be combined to coordinate use acrossmany resources, each owned by different organizations. Forexample, in VO Q, a computation started on one pooled computationalresource may subsequently access data or initiatesub computations elsewhere. The ability to delegate authorityin controlled ways becomes important in such situations,as do mechanisms for coordinating operations across multipleresources (e.g., co scheduling).

The same resource may be used in different ways, dependingon the restrictions placed on the sharing and the goal of thesharing. For example, a computer may be used only to run aspecific piece of software in one sharing arrangement, whileit may provide generic compute cycles in another. Becauseof the lack of a priori knowledge about how a resource maybe used, performance metrics, expectations, and limitations(i.e., quality of service) may be part of the conditions placedon resource sharing or usage.

Figure 1: An actual organization can participate in one or more VOs by sharing some or all of its resources. We show three actual organizations (the ovals), and two VOs: P, which links participants in an aerospace design consortium, and Q, which links colleagues who have agreed to share spare computing cycles, for example to run ray tracing computations. The organization on the left participates in P, the one to the right participates in Q, and the third is a member of both P and Q. The policies governing access to resources (summarized in “quotes”) vary according to the actual organizations, resources, and VOs involved.

The Nature of Grid Architecture

The establishment, management, and exploitation ofdynamic, cross-organizational VO sharing relationshipsrequire new technology. This technology is described interms of a Grid architecture that identifies fundamentalsystem components, specifies the purpose and function ofthese components, and indicates how these componentsinteract with one another.

In defining a Grid architecture, start from the perspectivethat effective VO operation requires that we be able toestablish sharing relationships among any potentialparticipants. Interoperability is thus the central issue to beaddressed. In a networked environment, interoperabilitymeans common protocols. Hence, our Grid architecture isfirst and foremost a protocol architecture, with protocolsdefining the basic mechanisms by which VO users andresources negotiate, establish, manage, and exploit sharingrelationships. A standards-based open architecturefacilitates extensibility, interoperability, portability, and code sharing; standard protocols make it easy to definestandard services that provide enhanced capabilities. Wecan also construct Application Programming Interfaces andSoftware Development Kits to provide the programmingabstractions required to create a usable Grid. Together, thistechnology and architecture constitute what is often termedmiddleware, although we avoid that term here due to itsvagueness. We discuss each of these points in the following.

Why is interoperability such a fundamental concern? Atissue is our need to ensure that sharing relationships canbe initiated among arbitrary parties, accommodatingnew participants dynamically, across different platforms,languages, and programming environments. In this context,mechanisms serve little purpose if they are not defined andimplemented so as to be interoperable across organizationalboundaries, operational policies, and resource types.Without interoperability, VO applications and participantsare forced to enter into bilateral sharing arrangements, asthere is no assurance that the mechanisms used between anytwo parties will extend to any other parties. Without suchassurance, dynamic VO formation is all but impossible, and the types of VOs that can be formed are severely limited.

Why are protocols critical to interoperability? A protocoldefinition specifies how distributed system elementsinteract with one another in order to achieve a specifiedbehavior, and the structure of the information exchangedduring this interaction. This focus on externals (interactions)rather than internals (software, resource characteristics) has important pragmatic benefits. VOs tend to be fluid;hence, the mechanisms used to discover resources, establishidentity, determine authorization, and initiate sharingmust be flexible and lightweight, so that resource-sharingarrangements can be established and changed quickly.

Because VOs complement rather than replace existinginstitutions, sharing mechanisms cannot require substantialchanges to local policies and must allow individualinstitutions to maintain ultimate control over theirown resources. Since protocols govern the interactionbetween components, and not the implementation of thecomponents, local control is preserved. Why are servicesimportant? A service is defined solely by the protocol that itspeaks and the behaviors that it implements. The definitionof standard services-for access to computation, access to data, resource discovery, co scheduling, data replication,and so forth- allows us to enhance the services offered toVO participants and also to abstract away resource-specificdetails that would otherwise hinder the development of VOapplications. A service is defined in terms of the protocolone uses to interact with it and the behavior expected inresponse to various protocol message exchanges.

Grid Architecture Description

This Grid Architecture part is not to provide a complete enu-meration of all required protocols (and services, APIs, andSDKs) but rather to identify requirements for general classesof component. The result is an extensible, open architecturalstructure within which can be placed solutions to key VOrequirements. Our architecture and the subsequent discussionorganize components into layers, as shown in Figure 2.

Figure 2: The layered Grid architecture and its relationship to the Internet protocol architecture. Because the Internet protocol architecture extends from network to application, there is a mapping from Grid layers into Internet layers.

Components within each layer share common characteristicsbut can build on capabilities and behaviors provided by anylower layer.The Grid architecture is specified using the principles of thehourglass model. The narrow neck of the hourglass definesa small set of core abstractions and protocols (e.g., TCP andHTTP in the Internet), onto which many different high-levelbehaviors can be mapped (the top of the hourglass), andwhich themselves can be mapped onto many differentunderlying technologies (the base of the hourglass). Bydefinition, the number of protocols defined at the neck mustbe small. In our architecture, the neck of the hourglassconsists of Resource and Connectivity protocols, whichfacilitate the sharing of individual resources. Protocols atthese layers are designed so that they can be implementedon top of a diverse range of resource types, defined at theFabric layer, and can in turn be used to construct a widerange of global services and application-specific behaviorsat the Collective layer-so called because they involve thecoordinated (collective) use of multiple resources.

Fabric: Interfaces to Local Control

The Grid Fabric layer provides the resources to whichshared access is mediated by Grid protocols: for example,computational resources, storage systems, catalogs,network resources, and sensors. A resource may be alogical entity, such as a distributed file system, computercluster, or distributed computer pool; in such cases, a resource implementation may involve internal protocols(e.g., the NFS storage access protocol or a cluster resourcemanagement systems process management protocol), butthese are not the concern of Grid architecture.

Fabric components implement the local, resource-specificoperations that occur on specific resources (whetherphysical or logical) as a result of sharing operationsat higher levels. There is thus a tight and subtleinterdependence between the functions implemented at theFabric level, on the one hand, and the sharing operations supported, on the other. Richer Fabric functionality enablesmore sophisticated sharing operations; at the same time, ifwe place few demands on Fabric elements, then deploymentof Grid infrastructure is simplified. For example, resourcelevel support for advance reservations makes it possible forhigher-level services to aggregate (co schedule) resourcesin interesting ways that would otherwise be impossible toachieve. Experience suggests that at a minimum, resourcesshould implement enquiry mechanisms that permitdiscovery of their structure, state, and capabilities (e.g.,whether they support advance reservation) on the onehand, and resource management mechanisms that providesome control of delivered quality of service, on the other.

The following brief and partial list provides a resource specificcharacterization of capabilities.

  • Computational resources: Mechanisms are required forstarting programs and for monitoring and controllingthe execution of the resulting processes. Managementmechanisms that allow control over the resourcesallocated to processes are useful, as are advancereservation mechanisms. Enquiry functions are neededfor determining hardware and software characteristicsas well as relevant state information such as currentload and queue state in the case of scheduler-managedresources.
  • Storage resources: Mechanisms are required for puttingand getting files. Third-party and high-performance (e.g.,striped) transfers are useful. So are mechanismsfor reading and writing subsets of a file and/orexecuting remote data selection or reduction functions.Management mechanisms that allow control overthe resources allocated to data transfers (space, diskbandwidth, network bandwidth, CPU) are useful, as areadvance reservation mechanisms. Enquiry functionsare needed for determining hardware and softwarecharacteristics as well as relevant load information suchas available space and bandwidth utilization.
  • Network resources: Management mechanisms that providecontrol over the resources allocated to network transfers(e.g., prioritization, reservation) can be useful. Enquiryfunctions should be provided to determine networkcharacteristics and load.

Connectivity: Communicating Easily and Securely