Grid-Ireland OpsCentre

Grid-Ireland OpsCentre

Document prepared by

Dr.Brian Coghlan, Director, Grid-Ireland OpsCentre

Executive Summary

Definitions, Acronyms and Abbreviations

References

1. Grid-Ireland Operations Centre

1.1 Homogeneous Core Infrastructure, Heterogeneous Resources

1.2 Grid Operations Centre

1.3 Deployment Architecture

1.4 Fabric Management

1.5 Transactional Deployment

1.6 Heterogeneity

1.7 Security

1.8 Dissemination

1.9 Grid Management

Executive Summary

Grid-Ireland provides the grid layer for the Irish research community, managed through its OpsCentre at Trinity College Dublin. This document describes the OpsCentre, its evolution, its current status, and its future directions. The OpsCentre is intended to support academic research in Ireland, including international projects. For a roadmap of the Irish e-Infrastructure, including Grid-Ireland, see the “WHITE PAPER ON IRISH E-INFRASTRUCTURE – A Roadmap for Irish Research and Development”

Definitions, Acronyms and Abbreviations

LCG2LHC Computing Grid version 2

LHCLarge Hadron Collider

OGSAOpen Grid Services Architecture

VOVirtual Organization

VOMSVirtual Organization Membership Service

References

1Coghlan, B.A., Walsh, J., Quigley, G., O.Callaghan, D., Childs, S., and Kenny, E. (2004) Transactional Grid Deployment, Crakow Grid Workshop 2004, Crakow, December, 2004.

2Kenny, E., Coghlan, B.A., Walsh, J., Childs, S., O.Callaghan, D., and Quigley, G. (2004) Testing Heterogeneous Computing Nodes for Grid Computing, Crakow Grid Workshop 2004, Crakow, December, 2004.

3Kenny, S. and Coghlan, B.A. (2004) Grid-wide Intrusion Detection System, Crakow Grid Workshop 2004, Crakow, December, 2004.

4Childs, S., Coghlan, B.A., O.Callaghan, D., Quigley, G., and Walsh, J. (2004) Virtual Machines for Testbeds, Crakow Grid Workshop 2004, Crakow, December, 2004.

5Childs, S., Coghlan, B., O'Callaghan, D., Quigley, G., Walsh, J. (2005) A single-computer Grid gateway using virtual machines, AINA'05, Taiwan, March, 2005.

6Astolos, J., et al (2005) International Grid CA Interworking, Peer Review and Policy Management through the European DataGrid Certification Authority Coordination Group, Proc.EGC'05, Amsterdam, February, 2005.

7Coghlan, B.A., Walsh, J., Quigley, G., O.Callaghan, D., Childs, S., and Kenny, E. (2005) Principles of Transactional Grid Deployment, Proc.EGC'05, Amsterdam, February, 2005.

8Kenny, E., Coghlan, B.A., Walsh, J., Childs, S., O.Callaghan, D., and Quigley, G. (2005) Heterogeneity of Computing Nodes for Grid Computing, Proc.EGC'05, Amsterdam, February, 2005.

9Childs, S., Coghlan, B.A., O.Callaghan, D., Quigley, G., and Walsh, J. (2005) Deployment of Grid gateways using virtual machines, Proc.EGC'05, Amsterdam, February, 2005.

10Byrom, R., et al (2005) Fault tolerance in the R-GMA Information and Monitoring System, Proc.EGC'05, Amsterdam, February, 2005.

11Kenny, S., and Coghlan, B. (2005) Towards a Grid-wide Intrusion Detection System, Proc.EGC'05, Amsterdam, February, 2005.

12Coghlan, B.A., Walsh, J., O.Callaghan, D. (2005) Grid-Ireland Deployment Architecture, Proc.EGC'05, Amsterdam, February, 2005.:

13Byrom, R., et al (2005) R-GMA: a scalable information and monitoring system, Grid Journal, 2005.

14eInfrastructures Open Workshop (Internet and Grids) The new foundation for knowledge-based societies, RoyalIrishAcademy, Dublin, April 2004, see:

15Childs, S. et al, A virtual TestGrid, or how to replicate a national Grid, Proc.ExpGrid Workshop, HPDC 15, 2006.

16Childs, S., and Coghlan, B., How to join the virtual revolution, CERN Courier, Vol.46, No.5, pp.58, June 2006.

1. Grid-Ireland Operations Centre

Grid-Ireland provides grid services above the Irish research network to allow researchers to share Irish computing and storage resources using a common interface. It also provides for international collaborations by linking Irish sites into the European grid infrastructures developed under such EU projects as EDG, EGEE, LCG, CrossGrid and the Interactive European Grid (int.eu.grid). Grid-Ireland currently encompasses eighteen sites, with an Operations Centre in TCD. The grid infrastructure is currently based on a derivative of the LCG middleware, the common foundation that ensures interoperability between participating scientific computing centres around the world. Internationally, the Grid-Ireland OpsCentre is heavily involved in the EU EGEE-II and int.eu.grid projects. The Grid Manager is a member of the UK GridPP Deployment Board, and there are strong links to UK e-Science through the Belfast e-Science Centre (BeSC), the UK National e-Science Centre in Edinburgh (NeSC), and Rutherford Appleton Labs. The OpsCentre is the EGEE Regional Operations Centre for Ireland. HEAnet, the Irish national research and education network (NREN) substantially assists Grid-Ireland.

1.1 Homogeneous Core Infrastructure, Heterogeneous Resources

Grid-Ireland is unusual in its integrated infrastructure, and the stress that is laid on homogeneity of its core. For a detailed description of the deployment architecture, see [12]. Below is a synopsis. There are three primary motivations:

  • To minimize the demand on human resources by minimizing the proportion of the software that needs to be ported to new systems that institutions (sites) might purchase. Only do what’s necessary. The simplest component of most grid software frameworks is that relating to the worker nodes, i.e. machines at sites that actually execute the user's jobs. Only port worker node software.
  • To minimize the demand on human resources by maximizing the proportion of the software that does not need to be ported. Don’t do what’s not necessary. Thus all the non-worker node components should use the reference port. This implies: (a) that there should be a core infrastructure, separate from but connecting the worker nodes at sites, (b) that the core infrastructure should be homogeneous, using the reference port, and (c) do not port any non-worker-node software.
  • To maximize the availability of the infrastructure. Maximize the efficiency of what you do. Grid-Ireland has designed a transactional deployment system to achieve this [1][7]. This unique facility is explained further below. It requires that the core infrastructure be homogeneous, and also centrally managed.

Thus the core infrastructure should be homogeneous, and also centrally managed from an OpsCentre. The OpsCentre should deploy the reference software to the core infrastructure, and should port worker node software to new systems that sites might purchase. The deployment should be automated to maximize efficiency.

The motto Homogeneous Core Infrastructure, Heterogenous Resources is used to encapsulate the explicit and implicit principles:

  • Explicit homogeneous core infrastructure;
  • Implicit centralized control via remote management;
  • Implicit decoupling of grid infrastructure and site management.

1.2 Grid Operations Centre

Establishment of Grid-Ireland began in 1999 with funding from EnterpriseIreland, and its architecture has been designed with the above approach since mid-2001. Funding was sought and eventually granted, a senior Grid Manager and Deputy appointed, a Grid Operations Centre established and staffed, infrastructure specified, purchased and installed, and finally middleware and management tools deployed. The OpsCentre considers that the use of these principles has been highly beneficial.

Figure 1: csTCDie Site

The Operations Centre resources are illustrated in Figure 1, although this diagram is a little out-of-date. The csTCDie grid site, which now includes the gateway, 96 CPUs and a 4TB disk farm for the various grid testbeds, is incorporated into the OpsCentre. The OpsCentre is not in the business of HPC, so the csTCDie resources are representative only, contributions that enable TCD and the OpsCentre to particpate in national and international Grid projects. Nevertheless, the contribution of compute power, particularly to international particle physics and biomedical research, has been significant. The OpsCentre itself includes approximately 16 national servers, a certification TestGrid and an e-Learning infrastructure.

The national servers provide the central services that coordinate that activity on Grid-Ireland. Two central information systems, one hierarchical, the other relational, aggregate grid status across the island. Several security systems provide grid-wide system security, certification, authentication, proxying, authorization, accounting and logging. A central replica catalog assists data management. Several servers combine to enable a broker to automatically allocate resources to jobs. A central repository maintains tagged versions of all the software. A grid-enabled webserver supports numerous web publishing activities, including the Grid-Ireland website ( and Wiki.

Figure 2: TestGrid

The TestGrid (see Figure 2), which includes approximately 40 machines and a 4TB disk farm, originally servedmultiple purposes: it implemented a fully working replica of the national servers and sites; it permitted experimentation without affecting the national services; it acted as the testing and certification platform to allow the uncovering of software problems before deployment; and it acts as a non-reference porting platform. Virtual machine technology was harnessed to allow replicas of sites, using the significant but nonetheless limited number of machines [4]. This TestGrid [15] has proven to be an extremely valuable resource, enabling activities that others, even in large organisations like CERN, find difficult to support. The usage is now such that it too has recently been replicated to form separate infrastructures for experimentation, certification, porting, and also e-Learning.

Figure 3 shows a photograph of the OpsCentre systems. The e-Learning and certification infrastructures share a cluster of 32 CPUs with a large aggregate memory and disk capacity (128GB and 8TB) at the bottom of rack 2, and both exploit virtualization. The e-Learning infrastructure is part of a concerted push to exploit e-Learning for grid-related teaching, training and dissemination.

Figure 3: OpsCentre Racks

1.3 Deployment Architecture

As stated above, Grid-Ireland has installed a grid computing infrastructure that is fully homogeneous at its core. Each of the sites connects via a grid gateway. This infrastructure is centrally managed from Trinity College Dublin. These gateways are composed of a set of seven machines: a firewall, an install server, a compute element (CE), a storage element (SE), a user interface machine (UI), a worker node that is used for gateway tests only, and optionally a network monitor. All the sites are identically configured.

As can be seen from Figure 4, the site resources (shown as a cluster of worker nodes) are outside the domain of the gateway; these resources belong to the site, and the site is always in charge of their own resources. One of the key departures from the structures for deployment commonly used in Europe is to facilitate those resources to be heterogeneous with respect to the gateways. As explained in Section 3.3, Grid-Ireland is attempting to provide ported code for any potential platform that will be used for worker nodes (WNs).

Figure 4: Grid-Ireland Gateway with attached cluster

The only requirement on the site is that the worker nodes be set up to cater for data- and computation-intensive tasks by installing the worker-node software outlined in Section 3.3. This includes both the replica management software from EGEE and the various versions of MPI from CrossGrid and int.eu.grid. A site submits jobs via the gateway UI, whilst the gateway CE exports core grid services to the site and queues job submissions to the site resource management system, and its SE provides scratch storage for grid jobs. Therefore the gateway is both the client of the site and vice versa.

Grid-Ireland has specified its gateways to ensure minimal divergence from standard site configuration, minimal hardware and space costs per site, and minimal personnel costs per site. The basic technology for this is the use of virtual machines. Currently there are three physical realisations of this architecture. At minimum, a generic Grid-Ireland gateway comprises a single physical machine, a switch, and an uninterruptable power supply. The machine runs its own operating system (OS) plus a number of virtual machines that appear to be a normal machine both to the host OS and to external users. The Linux OS and grid services are remotely installed, upgraded and managed by the Operations Centre, without needing any attention at the site. The firewall and install server run concurrently on the host operating system. All other servers are hosted as virtual machines. Eleven gateways of this type have been deployed. A second variant, with more performance, memory and disk, is deployed at ICHEC. For more demanding sites the functionality is spread over four physical machines. Six such gateways are already deployed.

1.4 Fabric Management

Apart from the firewall, all other servers on the gateways are remotely installed over the network from the install server, currently using sophisticated fabric management tools. The install server itself is manually installed, but thereafter it is updated with new tagged releases from the central Grid-Ireland CVS repository in accordance with a hierarchy of configuration profiles. Installation or upgrade of the other gateway nodes takes place over the network from the install server as an automated boot build process not unlike installation from a CD. In other jurisdictions it is usual that grid middleware installation is a manual process. Whilst this takes place the site is essentially in an inconsistent state. Grid-Ireland, however, have designed an automatic process that is specifically intended to reduce inconsistency to a minimum.

1.5 Transactional Deployment

Sometimes a new release of some infrastructural grid software is incompatible with the previous release. Once a certain proportion of the sites in a grid infrastructure are no longer consistent with the new release then the infrastructure as a whole can be considered inconsistent and proper operation is no longer available. Maximizing availability requires either the time between releases be maximized or the time to consistency be minimized. The former is beyond the control of those who manage the infrastructure. On the other hand, if the latter can be minimized to a single, short action across the entire infrastructure then the availability will indeed be maximized.

However, an upgrade to a new release may not be a short operation. To compensate, the upgrade may be split into a variable-duration prepare phase and a short-duration upgrade phase, that is, a two-phase commit. If the entire infrastructure is to be consistent after the upgrade, then the two-phase commit must succeed at all sites. Even if it fails at just one site, the upgrade must be aborted and the infrastructure should be returned to the same state as it was before the upgrade was attempted, so the upgrade process appears as an atomic action that either succeeds or fails. Very few upgrades will comprise single actions. Most will be composed from multiple subactions. For such an upgrade to appear as an atomic action requires that it exhibits transactional behaviour, that all subactions succeed or all fail, so that the infrastructure is never left in an undefined state. Thus to maximize availability requires that an upgrade be implemented as a two-phase transaction.

Figure 5: Transactional Deployment GUI

Grid-Ireland have implemented transactional deployment [1][7]. A central repository holds both the software to be deployed and the transaction logic. Install servers at the sites hold configuration data plus a local copy of software that is used by the install server to maintain the configuration of the client nodes at the site. Deployment is then an automated totally-repeatable push-button upgrade process (see Figure 5), with no possibility of operator error. Perhaps the most important benefit of this system is the ease it brings to deployment. This is a major tool for achieving deployment efficiency. This and fabric management present a major bonus when employing very few staff to maintain a complete national grid infrastructure.

1.6 Heterogeneity

As explained above, the architecture of the Grid-Ireland infrastructure is based on the principle that a homogeneous infrastructure allows the attachment of heterogeneous resources [12]. Substantial effort has been committed to porting the middleware to heterogeneous platforms [2][8]. The expertise of the Grid-Ireland Operations Centre in this area is now widely recognised. It is the primary non-reference repository site for EGEE, and co-ordinates the porting activities across Europe. It has ported the middleware to RedHat Linux 9.x, Fedora Core 2, 3 & 4, CentOS, Suse 9.x and IRIX. Further ports to em/64t, Solaris, AIX and MacOS-X are underway, see Figure 6. Both IBM and Apple have contributed hardware. The entire software is autobuilt nightly or on demand from source code in our central CVS repository. Without this effort neither the UCD nor the ICHEC clusters could be connected. There has been close interaction with middleware developers at CERN and INFN.

Figure 6: Porting Status on 24th July, 2006

1.7 Security

A major emphasis has been placed on sharing of the infrastructure and on system and grid security. The Grid-Ireland Certification Authority [6] has issued over 500 grid certificates, see and VO Managers have been appointed for the CosmoGrid, WebCom-G, MarineGrid,GeneGridand EireVOvirtual organizations. It is a founder member of the EU Grid Policy Management Authority (EUGridPMA), which is itself a member of the International Grid Trust Federation.

Software has been developed to detect ad-hoc or coordinated security intrusions, analyse these and generate alerts (500,000 alerts since mid-2005) [3][11]. Further software has been developed to check grid and system services (15,000 checks per day), analyse the results and generate automated alerts if need be. Yet further monitoring software tests the infrastructure to a considerable depth every 12 hours (see Figure 7). Similar software tests the whole European infrastructure from CERN on the same regular schedule.

Figure 7: Site Functional Tests

1.8 Dissemination

During the last 12 months the OpsCentre has been developing infrastructure for grid training courseware. Current courses are very labour intensive. For example, a Grid-Ireland Site Manager's Course was held in Dublin in September 2004, followed by a Grid-Ireland User's Course in Dublin in December 2004 (actually for instructors of future User's Courses), a Grid-Ireland User's Courses (for science users of the grid) in Dublin in January and December 2005, a VO Managers course in December 2005, and a Grid/MPI course in March 2006. These divert scarce manpower away from management and development of the infrastructure, so the OpsCentre have been developing an online adaptive courseware infrastructure and the virtual machine technology to allow online use of a Tutorial Grid (the e-Learning infrastructure) in conjunction with the courses. The actual courses will use the EGEE manuals and presentation-ware as input to the suprisingly complex adaptivity process. The e-Learning courses are eagerly awaited. An optional 4th Year undergraduate course that covers Grid has been taught in TCD for five years, but a 3rd Year undergraduate course, incorporating this e-Learning, was added in October 2006.