WP3.3 Grid Monitoring
M15 Deliverable
Software Evaluation
and Testing
WP3 New Grid Services and Tools
Document Filename: / CG3.3-D3.4-v1.4-TCD016-M15Deliverable.docWork package: / WP3 New Grid Services and Tools
Partner(s): / TCD, CYFRONET, ICM
Lead Partner: / TCD
Config ID: / CG3.3-D3.4-v1.4-TCD016-M15Deliverable
Document classification: / Confidential
Abstract: This document is an internal progress report on WP3, Task 3.3 grid monitoring, software evaluation and testing.
CG3.3-D3.4-v1.4-TCD016-M15Deliverable / Confidential / 1 / 60
/ Task 3.3 Grid Monitoring M15 Deliverable
CG3.3-D3.4-v1.4-TCD016-M15Deliverable / Confidential / 2 / 60
/ Task 3.3 Grid Monitoring M15 Deliverable
Delivery Slip
Name / Partner / Date / Signature
From / WP3, Subtask 3.3 / TCD / May 30th, 2003 / Brian Coghlan
Verified by
Approved by
Document Log
Version / Date / Summary of changes / Author
1-0 / 10/04/2003 / Draft version / Bartosz Baliś, Brian Coghlan, Stuart Kenny, Krzysztof Nawrocki, Adam Padee, Marcin Radecki, Tomasz Szepieniec, Slawomir Zielinski
1-1 / 02/05/2003 / Draft version. Updated after TAT review. / Bartosz Baliś, Brian Coghlan, Stuart Kenny, Krzysztof Nawrocki, Adam Padee, Marcin Radecki, Tomasz Szepieniec, Slawomir Zielinski
1-2 / 08/05/2003 / Draft version. Updated after further comments. / Bartosz Baliś, Brian Coghlan, Stuart Kenny, Krzysztof Nawrocki, Adam Padee, Marcin Radecki, Tomasz Szepieniec, Slawomir Zielinski
1-3 / 26/05/2003 / Draft version. Updated risk assessment. / Bartosz Baliś, Brian Coghlan, Stuart Kenny, Krzysztof Nawrocki, Adam Padee, Marcin Radecki, Tomasz Szepieniec, Slawomir Zielinski
1-4 / 30/05/2003 / Final version. Updated after internal review. / Bartosz Baliś, Brian Coghlan, Stuart Kenny, Krzysztof Nawrocki, Adam Padee, Marcin Radecki, Tomasz Szepieniec, Slawomir Zielinski
Contents
1 REFERENCES 6
2 EXECUTIVE SUMMARY 8
3 INTRODUCTION 9
3.1 Definitions, Abbreviations, Acronyms 11
4 state of the art 12
4.1 Applications MONITORING 12
4.1.1 OCM-G 12
4.2 Instruments, Infrastructure, Derived Results 14
4.2.1 Instruments: SANTA-G 14
4.2.2 Infrastructure: JIMS 15
4.2.3 Derived Results: Postprocessing 17
5 Contributions to Grid Technology 19
5.1 Applications MONITORING 19
5.1.1 OCM-G 19
5.2 Instruments, Infrastructure, Derived Results 20
5.2.1 Instruments: SANTA-G 20
5.2.2 Infrastructure: JIMS 21
5.2.3 Derived Results: Postprocessing 22
6 Brief description of the software 23
6.1 Application monitoring 23
6.1.1 OCM-G 23
6.2 Instruments, Infrastructure, Derived Results 23
6.2.1 Instruments: SANTA-G 23
6.2.2 Infrastructure: JIMS 24
6.2.3 Derived Results: Postprocessing 25
7 aims, tests and evaluation, new requirements 26
7.1 Applications MONITORING 26
7.1.1 OCM-G 26
7.2 Instruments, Infrastructure, Derived Results 27
7.2.1 Instruments: SANTA-G 27
7.2.2 Infrastructure: JIMS 28
7.2.3 Derived Results: Postprocessing 28
8 Results of the tests and evaluation 29
8.1 Applications MONITORING 29
8.1.1 OCM-G 29
8.2 Instruments, Infrastructure, Derived Results 29
8.2.1 Instruments: SANTA-G 29
8.2.2 Infrastructure: JIMS 29
8.2.3 Derived Results: Postprocessing 29
9 Problems and issues 30
9.1 Applications MONITORING 30
9.1.1 OCM-G 30
9.2 Instruments, Infrastructure, Derived Results 30
9.2.1 Instruments: SANTA-G 30
9.2.2 Infrastructure: JIMS 31
9.2.3 Derived Results: Postprocessing 31
10 Future plans 32
10.1 Applications MONITORING 32
10.1.1 OCM-G 32
10.2 Instruments, Infrastructure, Derived Results 32
10.2.1 Instruments: SANTA-G 32
10.2.2 Infrastructure: JIMS 33
10.2.3 Derived Results: Postprocessing 35
11 TOWARDS OGSA 36
11.1 Applications MONITORING 36
11.1.1 OCM-G 36
11.2 Instruments, Infrastructure, Derived Results 36
11.2.1 Instruments: SANTA-G 36
11.2.2 Infrastructure: JIMS 36
11.2.3 Derived Results: Postprocessing 37
12 Risk assessment 38
12.1 Applications MONITORING 38
12.1.1 OCM-G 38
12.2 Instruments, Infrastructure, Derived Results 43
12.2.1 Instruments: SANTA-G 43
12.2.2 Infrastructure: JIMS 46
12.2.3 Derived Results: Postprocessing 50
13 Concluding remarks 55
14 Appendix A 56
1 REFERENCES
Bal2000 Z.Balaton, P.Kacsuk, N.Podhorszki, F.Vajda, Comparison of Representative Grid
Monitoring Tools, http://www.lpds.sztaki.hu/publications/reports/lpds-2-2000.pdf
CrossGrid CrossGrid Project Technical Annex,
http://www.eu-crossgrid.org/CrossGridAnnex1_v31.pdf
DataGrid DataGrid Project Technical Annex DataGridPart_B_V2_51.doc
Ganglia http://ganglia.sourceforge.net/
GRIDLAB GridLab Project Home Page
http://www.gridlab.org
GRADS Grid Application Development Software Project Home Page http://nhse2.cs.rice.edu/grads/
Jiro http://www.jiro.org/
Jiro D3.3 Jiro Based Grid Infrastructure Monitoring System, First Prototype Description, part of D3.3
http://www.eu-crossgrid.org/Deliverables/M12pdf/CG3.3.3-CYF-D3.3-v1.1-Jiro.pdf
Jiro D3.2 Jiro Software Design Document
http://tulip.man.poznan.pl/doc/dd-04-09-2002/3.3/CG3.3.3-D3.2-v1.1-CYF022-JiroDesign.pdf
Jiro Tech Jiro Technology Installation and Configuration Guide, © Sun Microsystems
JMX Spec Java Management Extension Specification, © Sun Microsystems,
http://java.sun.com/products/JavaManagement/
KaTools P.Augerat, C.Martin, B.Stein, Scalable monitoring and configuration tools for grids
and clusters, http://ka-tools.sourceforge.net/publications/ka-admin-pdp02.pdf
NWS http://nws.cs.ucsb.edu/
OCM A Monitoring System for Interoperable Tools
http://wwwbode.cs.tum.edu/~omis/Docs/spdt98.ps.gz
OCMGD3.3 Description of the OCM-G first prototype, part of CrossGrid deliverable 3.3.
http://www.eu-crossgrid.org/Deliverables/M12pdf/CG-3.3.1-CYF-D3.3-v1.0-OCM-G.pdf
OGSA The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. I. Foster, C. Kesselman, J. Nick, S. Tuecke, January 2002.
http://www.globus.org/research/papers/ogsa.pdf
OGSA-DAI http://www.ogsadai.org.uk/
OMIS OMIS – On-line Monitoring Interface Specification. Version 2.0. Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik (LRR-TUM), Technische Universität München.
http://wwwbode.informatik.tu-muenchen.de/~omis/
RFC 1213 K. McCloghrie, M.Rose (ed.), Management Information Base for Network Management of TCP/IP-based Internets: MIB-II
RGIS-RG Relational Grid Information Services Research Group
http://hepunx.rl.ac.uk/ggf/rgis-rg/
R-GMA R-GMA: A Relational Grid Information and Monitoring System
http://hepunx.rl.ac.uk/edg/wp3/documentation
R-GMA ARCH DataGrid Project Deliverable 3.2 DataGrid-03-D3.2-0101-1-0
http://hepunx.rl.ac/edg/wp3/documentation/
RGMA-OGSA A presentation describing the R-GMA transformation to web-services
https://edms.cern.ch/file/353606/1/GGF5-2.pdf
SANTA Prototype Trace Probe and Probe Adaptor and Prototype Trace Software, http://www.cs.tcd.ie/Brian.Coghlan/scieuro/scieuro.html, Esprit Project 25257, SCI Europe, Deliverables, Trinity College Dublin, 1999
SCITRAC B. Skalli, I. Birkeli, B. Nossum, and D. Wormald, Scitrac – an lsa preprocessor for sci link tracing, in Scalable Coherent Interface: Technology and Application, 1998.
SMiLE Jie Tao, A low-level Software Infrastructure for the SMiLE Monitoring Approach
http://wwwbode.cs.tum.edu/archive/smile/scieur2001/scieuro2001-s2-p2.pdf
SNORT Open-source network intrusion detection system, http://www.snort.org/
Spitfire Grid enabled middleware service for access to relational databases, http://spitfire.web.cern.ch/Spitfire
Task3.3 SRS Task3.3 Grid Monitoring Software Requirements Specification
Task3.3-SRS.pdf
http://www.eu-crossgrid.org/Deliverables/M3pdf/Task3.3-SRS.pdf
Task2.4 SRS Task2.4 Interactive and semiautomatic performance evaluation tools
CG-2.4-DOC-CYFRONET001-SRS.pdf
http://www.eu-crossgrid.org/Deliverables/M3pdf/CG-2.4-DOC-CYFRONET001-SRS.pdf
TCPDump http://www.tcpdump.org/
TEK http://www.tek.com
TopoMon M.denBurger, T.Kielmann, H.E.Bal, TopoMon: A Monitoring Tool for Grid
Network Topology, http://www.gridlab.org/Resources/Papers/iccs02.pdf
WP3 Inst WP3 Installation Guide
http://alpha.ific.uv.es/~sgonzale/integration/WP3-Installation-guide.doc
2 EXECUTIVE SUMMARY
This document is the month 15 deliverable for Task 3.3, Grid Monitoring. It forms part of an internal progress report on WP3 software evaluation and testing. Section 3 is the introduction, and also provides definitions, abbreviations and acronyms. Section 4 describes the current state of the art, while Section 5 provides a description of the contribution of Task 3.3 to current grid technology. Section 6 provides a brief description of the software. The aims of the prototypes and the tests are described in Section 7, and the results of these tests are given in Section 8. Any problems and issues discovered with the prototypes are described in Section 9. The plans for Task 3.3 for the next year are detailed in Section 10. Section 11 describes the plans of Task 3.3 regarding preparing for OGSA. A detailed risk analysis for the task is given in Section 12. Section 13 contains concluding remarks.
3 INTRODUCTION
According to the technical annex [CrossGrid], it was proposed that within Task 3.3 a prototype infrastructure for the needs of monitoring-related activities for automatic extraction of high-level performance properties and for tool support of performance analysis would be developed. The aim of the Grid monitoring facilities provided by the Grid Monitoring task is to underpin the functionality of various tools by delivering the low-level data intended for the above. During the design phase system scalability, flexibility and ease of configuration were the guidelines. These are still the guidelines now during the implementation phase.
For the Grid Monitoring task we deliberately restricted our scope. We chose to both add new services (for applications monitoring) and to extend existing services (for monitoring instruments and infrastructure, and for generating derived-results), see Figure 3.1. Analysis has shown that the new applications monitoring was substantially different from monitoring infrastructure and instruments - they needed two separate approaches:
· Infrastructure monitoring collects static and dynamic information about Grid components, such as hosts or network connections; this information is indispensable for basic Grid activities as resource allocation or load balancing; often this type of information has not only immediate, but also historic value, thus it is often stored in a database for alater analysis (e.g., statistical, forecasting, etc.),
· Application monitoring aims at observing a particular execution of an application; the collected data is useful for tools for application development support, which are used to detect bugs, bottlenecks or just visualize the application's behaviour; this kind of information in principle does not have historic value – it is meaningful only in the context of a particular execution.
Figure 3.1: The Grid Monitoring System
For applications monitoring we chose to:
· proceed in an OCM-compliant route, for compatibility with WP2
· implement direct low-latency communications
For extending existing grid monitoring services we chose to:
· deliberately avoid a hierarchical (e.g. LDAP) approach
· proceed in an OGSA-compliant route, using R-GMA as the interim web-based service
· implement instrument monitoring as part of R-GMA
· implement infrastructure monitoring using Jiro tools that already support infrastructure protocols and dynamic system configuration, with import/export from/to other services
· explore the potential of Jiro techniques for higher levels of grid monitoring
· consider results derived from monitoring data as just another stream of monitoring data
· produce data for external consumption via a single web-based API (initially R-GMA)
· closely monitor OGSA developments
3.1 Definitions, Abbreviations, Acronyms
CrossGrid The EU CrossGrid Project IST-2001-32243
DataGrid The EU DataGrid Project IST-2000-25182
FMA Federated Management Architecture, defined by Sun Microsystems
GUI Graphical User Interface
JIMS Jiro/JMX-based Grid Infrastructure Monitoring System
Jiro SUN Jiro, implementation of the FMA specification
LDAP Lightweight Directory Access Protocol
LDAP DIT LDAP Directory Information tree
MDS Monitoring and Discovery Service
NWS Network Weather Service
OCM OMIS-Compliant Monitor
OCM-G Grid-enabled OMIS-Compliant Monitor
OGSA Open Grid Services Architecture
OGSA-DAI OGSA Data Access and Integration
OMIS On-line Monitoring Interface Specification
RDBMS Relational Database Management System
RGIS-RG Relational Grid Information Services – Research Group
R-GMA DataGrid relational Grid monitoring architecture
SANTA System Area Network Trace Analysis
SANTA-G Grid-enabled System Area Network Trace Analysis
SMiLE Shared Memory in a LAN-like Environment
SNORT Open source network intrusion detection system
SOAP Simple Object Access Protocol
SRS Software Requirements Specification
WSDL Web Services Description Language
4 state of the art
4.1 Applications MONITORING
4.1.1 OCM-G
In this section, we provide an overview of three grid application monitoring approaches currently being developed, which are similar to the OCM-G approach. We try to compare the presented approaches to our own – OMIS/OCM-G based. The mentioned projects/systems are as follows: GrADS (Autopilot), GridLab, DataGrid (GRM).
· GrADS
The Grid Application Development Software (GrADS) project develops a software architecture designed to support application adaptation and performance monitoring. The GrADSoft architecture (program preparation and execution system), replaces the discrete steps of application creation, compilation, execution, and post-mortem analysis with a continuous process of adapting applications to both a changing Grid and a specific problem instance.
The GrADSoft monitoring infrastructure is based on Autopilot, a toolkit for application and resource monitoring and control. The monitoring is based on sensors, which may be put directly into source code or embedded in the application library. The sensors register in Autopilot Manager and can then be accessed by sensor clients to collect information. The clients can be located anywhere on the Grid. Modification of the application behavior is achieved by executing actuators, which are implemented by the user in the source code.
There are predefined sets of sensors and sensor clients, but the user can also implement them manually with an API provided by Autopilot. This is, however, not very convenient. Sensors can include user-defined code for data pre-processing. Although it is flexible, the processing is fixed at compile time, while in the OCM-G it is defined at run-time.
More important is that the Autopilot toolkit works within an enclosed framework, where definition of measurement, compilation and performance tuning is done automatically, thus theusers do not have a detailed insight into their applications. In our approach, we focus on providing the user with exact knowledge of application performance in a particular execution, e.g., the user can flexibly specify what information he needs.
Although the Autopilot toolkit is a mature and interesting application monitoring approach, it is oriented more towards automatic steering, than to providing feedback to the programmer. It gives arather general view of application and environment, e.g., to explore patterns in behavior instead of particular performance loss. Based on those patterns and external knowledge (e.g. user’s experience), aperformance optimization action is taken. It suits well the situation of a run-time system, where aspecial compiler creates a program, which is then automatically reconfigured at run-time depending on discovered patterns, e.g., the I/O buffer is resized for certain operations to improve performance.
The GrADS project differs from ours in its goal: it aims to provide tools that will free the user from many low-level concerns, permitting greater focus on the high-level design and tuning of programs for a heterogeneous distributed computing environment, while our goals are to provide the user with exact, low-level knowledge of the application internals and behavior.
· Gridlab
The application monitoring system developed within the GridLab project implements on-line steering guided by performance prediction routines deriving results from low level, infrastructure-related sensors (CPU, network load). GridLab proposed a protocol for high-level producer-consumer communication. The protocol has three types of messages: commands, command responses and metric values. A consumer can authenticate, initiate a measurement, and collect data. Additionally, the GridLab team proposed a standard set of metrics along with their semantics.