Task 3.3 Grid Monitoring M15 Deliverable

WP3.3 Grid Monitoring

M15 Deliverable

Software Evaluation

and Testing

WP3 New Grid Services and Tools

Document Filename: / CG3.3-D3.4-v1.4-TCD016-M15Deliverable.doc
Work package: / WP3 New Grid Services and Tools
Partner(s): / TCD, CYFRONET, ICM
Lead Partner: / TCD
Config ID: / CG3.3-D3.4-v1.4-TCD016-M15Deliverable
Document classification: / Confidential
Abstract: This document is an internal progress report on WP3, Task 3.3 grid monitoring, software evaluation and testing.
CG3.3-D3.4-v1.4-TCD016-M15Deliverable / Confidential / 1 / 60
/ Task 3.3 Grid Monitoring M15 Deliverable
CG3.3-D3.4-v1.4-TCD016-M15Deliverable / Confidential / 2 / 60
/ Task 3.3 Grid Monitoring M15 Deliverable
Delivery Slip
Name / Partner / Date / Signature
From / WP3, Subtask 3.3 / TCD / May 30th, 2003 / Brian Coghlan
Verified by
Approved by
Document Log
Version / Date / Summary of changes / Author
1-0 / 10/04/2003 / Draft version / Bartosz Baliś, Brian Coghlan, Stuart Kenny, Krzysztof Nawrocki, Adam Padee, Marcin Radecki, Tomasz Szepieniec, Slawomir Zielinski
1-1 / 02/05/2003 / Draft version. Updated after TAT review. / Bartosz Baliś, Brian Coghlan, Stuart Kenny, Krzysztof Nawrocki, Adam Padee, Marcin Radecki, Tomasz Szepieniec, Slawomir Zielinski
1-2 / 08/05/2003 / Draft version. Updated after further comments. / Bartosz Baliś, Brian Coghlan, Stuart Kenny, Krzysztof Nawrocki, Adam Padee, Marcin Radecki, Tomasz Szepieniec, Slawomir Zielinski
1-3 / 26/05/2003 / Draft version. Updated risk assessment. / Bartosz Baliś, Brian Coghlan, Stuart Kenny, Krzysztof Nawrocki, Adam Padee, Marcin Radecki, Tomasz Szepieniec, Slawomir Zielinski
1-4 / 30/05/2003 / Final version. Updated after internal review. / Bartosz Baliś, Brian Coghlan, Stuart Kenny, Krzysztof Nawrocki, Adam Padee, Marcin Radecki, Tomasz Szepieniec, Slawomir Zielinski


Contents

1 REFERENCES 6

2 EXECUTIVE SUMMARY 8

3 INTRODUCTION 9

3.1 Definitions, Abbreviations, Acronyms 11

4 state of the art 12

4.1 Applications MONITORING 12

4.1.1 OCM-G 12

4.2 Instruments, Infrastructure, Derived Results 14

4.2.1 Instruments: SANTA-G 14

4.2.2 Infrastructure: JIMS 15

4.2.3 Derived Results: Postprocessing 17

5 Contributions to Grid Technology 19

5.1 Applications MONITORING 19

5.1.1 OCM-G 19

5.2 Instruments, Infrastructure, Derived Results 20

5.2.1 Instruments: SANTA-G 20

5.2.2 Infrastructure: JIMS 21

5.2.3 Derived Results: Postprocessing 22

6 Brief description of the software 23

6.1 Application monitoring 23

6.1.1 OCM-G 23

6.2 Instruments, Infrastructure, Derived Results 23

6.2.1 Instruments: SANTA-G 23

6.2.2 Infrastructure: JIMS 24

6.2.3 Derived Results: Postprocessing 25

7 aims, tests and evaluation, new requirements 26

7.1 Applications MONITORING 26

7.1.1 OCM-G 26

7.2 Instruments, Infrastructure, Derived Results 27

7.2.1 Instruments: SANTA-G 27

7.2.2 Infrastructure: JIMS 28

7.2.3 Derived Results: Postprocessing 28

8 Results of the tests and evaluation 29

8.1 Applications MONITORING 29

8.1.1 OCM-G 29

8.2 Instruments, Infrastructure, Derived Results 29

8.2.1 Instruments: SANTA-G 29

8.2.2 Infrastructure: JIMS 29

8.2.3 Derived Results: Postprocessing 29

9 Problems and issues 30

9.1 Applications MONITORING 30

9.1.1 OCM-G 30

9.2 Instruments, Infrastructure, Derived Results 30

9.2.1 Instruments: SANTA-G 30

9.2.2 Infrastructure: JIMS 31

9.2.3 Derived Results: Postprocessing 31

10 Future plans 32

10.1 Applications MONITORING 32

10.1.1 OCM-G 32

10.2 Instruments, Infrastructure, Derived Results 32

10.2.1 Instruments: SANTA-G 32

10.2.2 Infrastructure: JIMS 33

10.2.3 Derived Results: Postprocessing 35

11 TOWARDS OGSA 36

11.1 Applications MONITORING 36

11.1.1 OCM-G 36

11.2 Instruments, Infrastructure, Derived Results 36

11.2.1 Instruments: SANTA-G 36

11.2.2 Infrastructure: JIMS 36

11.2.3 Derived Results: Postprocessing 37

12 Risk assessment 38

12.1 Applications MONITORING 38

12.1.1 OCM-G 38

12.2 Instruments, Infrastructure, Derived Results 43

12.2.1 Instruments: SANTA-G 43

12.2.2 Infrastructure: JIMS 46

12.2.3 Derived Results: Postprocessing 50

13 Concluding remarks 55

14 Appendix A 56

1  REFERENCES

Bal2000 Z.Balaton, P.Kacsuk, N.Podhorszki, F.Vajda, Comparison of Representative Grid
Monitoring Tools, http://www.lpds.sztaki.hu/publications/reports/lpds-2-2000.pdf

CrossGrid CrossGrid Project Technical Annex,

http://www.eu-crossgrid.org/CrossGridAnnex1_v31.pdf

DataGrid DataGrid Project Technical Annex DataGridPart_B_V2_51.doc

Ganglia http://ganglia.sourceforge.net/

GRIDLAB GridLab Project Home Page

http://www.gridlab.org

GRADS Grid Application Development Software Project Home Page http://nhse2.cs.rice.edu/grads/

Jiro http://www.jiro.org/

Jiro D3.3 Jiro Based Grid Infrastructure Monitoring System, First Prototype Description, part of D3.3

http://www.eu-crossgrid.org/Deliverables/M12pdf/CG3.3.3-CYF-D3.3-v1.1-Jiro.pdf

Jiro D3.2 Jiro Software Design Document

http://tulip.man.poznan.pl/doc/dd-04-09-2002/3.3/CG3.3.3-D3.2-v1.1-CYF022-JiroDesign.pdf

Jiro Tech Jiro Technology Installation and Configuration Guide, © Sun Microsystems

JMX Spec Java Management Extension Specification, © Sun Microsystems,

http://java.sun.com/products/JavaManagement/

KaTools P.Augerat, C.Martin, B.Stein, Scalable monitoring and configuration tools for grids
and clusters, http://ka-tools.sourceforge.net/publications/ka-admin-pdp02.pdf

NWS http://nws.cs.ucsb.edu/

OCM A Monitoring System for Interoperable Tools

http://wwwbode.cs.tum.edu/~omis/Docs/spdt98.ps.gz

OCMGD3.3 Description of the OCM-G first prototype, part of CrossGrid deliverable 3.3.

http://www.eu-crossgrid.org/Deliverables/M12pdf/CG-3.3.1-CYF-D3.3-v1.0-OCM-G.pdf

OGSA The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. I. Foster, C. Kesselman, J. Nick, S. Tuecke, January 2002.

http://www.globus.org/research/papers/ogsa.pdf

OGSA-DAI http://www.ogsadai.org.uk/

OMIS OMIS – On-line Monitoring Interface Specification. Version 2.0. Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik (LRR-TUM), Technische Universität München.

http://wwwbode.informatik.tu-muenchen.de/~omis/

RFC 1213 K. McCloghrie, M.Rose (ed.), Management Information Base for Network Management of TCP/IP-based Internets: MIB-II

RGIS-RG Relational Grid Information Services Research Group

http://hepunx.rl.ac.uk/ggf/rgis-rg/

R-GMA R-GMA: A Relational Grid Information and Monitoring System

http://hepunx.rl.ac.uk/edg/wp3/documentation

R-GMA ARCH DataGrid Project Deliverable 3.2 DataGrid-03-D3.2-0101-1-0

http://hepunx.rl.ac/edg/wp3/documentation/

RGMA-OGSA A presentation describing the R-GMA transformation to web-services

https://edms.cern.ch/file/353606/1/GGF5-2.pdf

SANTA Prototype Trace Probe and Probe Adaptor and Prototype Trace Software, http://www.cs.tcd.ie/Brian.Coghlan/scieuro/scieuro.html, Esprit Project 25257, SCI Europe, Deliverables, Trinity College Dublin, 1999

SCITRAC B. Skalli, I. Birkeli, B. Nossum, and D. Wormald, Scitrac – an lsa preprocessor for sci link tracing, in Scalable Coherent Interface: Technology and Application, 1998.

SMiLE Jie Tao, A low-level Software Infrastructure for the SMiLE Monitoring Approach

http://wwwbode.cs.tum.edu/archive/smile/scieur2001/scieuro2001-s2-p2.pdf

SNORT Open-source network intrusion detection system, http://www.snort.org/

Spitfire Grid enabled middleware service for access to relational databases, http://spitfire.web.cern.ch/Spitfire

Task3.3 SRS Task3.3 Grid Monitoring Software Requirements Specification

Task3.3-SRS.pdf

http://www.eu-crossgrid.org/Deliverables/M3pdf/Task3.3-SRS.pdf

Task2.4 SRS Task2.4 Interactive and semiautomatic performance evaluation tools

CG-2.4-DOC-CYFRONET001-SRS.pdf

http://www.eu-crossgrid.org/Deliverables/M3pdf/CG-2.4-DOC-CYFRONET001-SRS.pdf

TCPDump http://www.tcpdump.org/

TEK http://www.tek.com

TopoMon M.denBurger, T.Kielmann, H.E.Bal, TopoMon: A Monitoring Tool for Grid
Network Topology, http://www.gridlab.org/Resources/Papers/iccs02.pdf

WP3 Inst WP3 Installation Guide

http://alpha.ific.uv.es/~sgonzale/integration/WP3-Installation-guide.doc

2  EXECUTIVE SUMMARY

This document is the month 15 deliverable for Task 3.3, Grid Monitoring. It forms part of an internal progress report on WP3 software evaluation and testing. Section 3 is the introduction, and also provides definitions, abbreviations and acronyms. Section 4 describes the current state of the art, while Section 5 provides a description of the contribution of Task 3.3 to current grid technology. Section 6 provides a brief description of the software. The aims of the prototypes and the tests are described in Section 7, and the results of these tests are given in Section 8. Any problems and issues discovered with the prototypes are described in Section 9. The plans for Task 3.3 for the next year are detailed in Section 10. Section 11 describes the plans of Task 3.3 regarding preparing for OGSA. A detailed risk analysis for the task is given in Section 12. Section 13 contains concluding remarks.

3  INTRODUCTION

According to the technical annex [CrossGrid], it was proposed that within Task 3.3 a prototype infrastructure for the needs of monitoring-related activities for automatic extraction of high-level performance properties and for tool support of performance analysis would be developed. The aim of the Grid monitoring facilities provided by the Grid Monitoring task is to underpin the functionality of various tools by delivering the low-level data intended for the above. During the design phase system scalability, flexibility and ease of configuration were the guidelines. These are still the guidelines now during the implementation phase.

For the Grid Monitoring task we deliberately restricted our scope. We chose to both add new services (for applications monitoring) and to extend existing services (for monitoring instruments and infrastructure, and for generating derived-results), see Figure 3.1. Analysis has shown that the new applications monitoring was substantially different from monitoring infrastructure and instruments - they needed two separate approaches:

·  Infrastructure monitoring collects static and dynamic information about Grid components, such as hosts or network connections; this information is indispensable for basic Grid activities as resource allocation or load balancing; often this type of information has not only immediate, but also historic value, thus it is often stored in a database for alater analysis (e.g., statistical, forecasting, etc.),

·  Application monitoring aims at observing a particular execution of an application; the collected data is useful for tools for application development support, which are used to detect bugs, bottlenecks or just visualize the application's behaviour; this kind of information in principle does not have historic value – it is meaningful only in the context of a particular execution.

Figure 3.1: The Grid Monitoring System

For applications monitoring we chose to:

·  proceed in an OCM-compliant route, for compatibility with WP2

·  implement direct low-latency communications

For extending existing grid monitoring services we chose to:

·  deliberately avoid a hierarchical (e.g. LDAP) approach

·  proceed in an OGSA-compliant route, using R-GMA as the interim web-based service

·  implement instrument monitoring as part of R-GMA

·  implement infrastructure monitoring using Jiro tools that already support infrastructure protocols and dynamic system configuration, with import/export from/to other services

·  explore the potential of Jiro techniques for higher levels of grid monitoring

·  consider results derived from monitoring data as just another stream of monitoring data

·  produce data for external consumption via a single web-based API (initially R-GMA)

·  closely monitor OGSA developments

3.1  Definitions, Abbreviations, Acronyms

CrossGrid The EU CrossGrid Project IST-2001-32243

DataGrid The EU DataGrid Project IST-2000-25182

FMA Federated Management Architecture, defined by Sun Microsystems

GUI Graphical User Interface

JIMS Jiro/JMX-based Grid Infrastructure Monitoring System

Jiro SUN Jiro, implementation of the FMA specification

LDAP Lightweight Directory Access Protocol

LDAP DIT LDAP Directory Information tree

MDS Monitoring and Discovery Service

NWS Network Weather Service

OCM OMIS-Compliant Monitor

OCM-G Grid-enabled OMIS-Compliant Monitor

OGSA Open Grid Services Architecture

OGSA-DAI OGSA Data Access and Integration

OMIS On-line Monitoring Interface Specification

RDBMS Relational Database Management System

RGIS-RG Relational Grid Information Services – Research Group

R-GMA DataGrid relational Grid monitoring architecture

SANTA System Area Network Trace Analysis

SANTA-G Grid-enabled System Area Network Trace Analysis

SMiLE Shared Memory in a LAN-like Environment

SNORT Open source network intrusion detection system

SOAP Simple Object Access Protocol

SRS Software Requirements Specification

WSDL Web Services Description Language

4  state of the art

4.1  Applications MONITORING

4.1.1  OCM-G

In this section, we provide an overview of three grid application monitoring approaches currently being developed, which are similar to the OCM-G approach. We try to compare the presented approaches to our own – OMIS/OCM-G based. The mentioned projects/systems are as follows: GrADS (Autopilot), GridLab, DataGrid (GRM).

·  GrADS

The Grid Application Development Software (GrADS) project develops a software architecture designed to support application adaptation and performance monitoring. The GrADSoft architecture (program preparation and execution system), replaces the discrete steps of application creation, compilation, execution, and post-mortem analysis with a continuous process of adapting applications to both a changing Grid and a specific problem instance.

The GrADSoft monitoring infrastructure is based on Autopilot, a toolkit for application and resource monitoring and control. The monitoring is based on sensors, which may be put directly into source code or embedded in the application library. The sensors register in Autopilot Manager and can then be accessed by sensor clients to collect information. The clients can be located anywhere on the Grid. Modification of the application behavior is achieved by executing actuators, which are implemented by the user in the source code.

There are predefined sets of sensors and sensor clients, but the user can also implement them manually with an API provided by Autopilot. This is, however, not very convenient. Sensors can include user-defined code for data pre-processing. Although it is flexible, the processing is fixed at compile time, while in the OCM-G it is defined at run-time.

More important is that the Autopilot toolkit works within an enclosed framework, where definition of measurement, compilation and performance tuning is done automatically, thus theusers do not have a detailed insight into their applications. In our approach, we focus on providing the user with exact knowledge of application performance in a particular execution, e.g., the user can flexibly specify what information he needs.

Although the Autopilot toolkit is a mature and interesting application monitoring approach, it is oriented more towards automatic steering, than to providing feedback to the programmer. It gives arather general view of application and environment, e.g., to explore patterns in behavior instead of particular performance loss. Based on those patterns and external knowledge (e.g. user’s experience), aperformance optimization action is taken. It suits well the situation of a run-time system, where aspecial compiler creates a program, which is then automatically reconfigured at run-time depending on discovered patterns, e.g., the I/O buffer is resized for certain operations to improve performance.

The GrADS project differs from ours in its goal: it aims to provide tools that will free the user from many low-level concerns, permitting greater focus on the high-level design and tuning of programs for a heterogeneous distributed computing environment, while our goals are to provide the user with exact, low-level knowledge of the application internals and behavior.

·  Gridlab

The application monitoring system developed within the GridLab project implements on-line steering guided by performance prediction routines deriving results from low level, infrastructure-related sensors (CPU, network load). GridLab proposed a protocol for high-level producer-consumer communication. The protocol has three types of messages: commands, command responses and metric values. A consumer can authenticate, initiate a measurement, and collect data. Additionally, the GridLab team proposed a standard set of metrics along with their semantics.