XSEDE Quarterly Report

1XSEDE Quarterly Report: FutureGrid Service Provider (April 1, 2012 – June 30, 2012)

1.1Executive Summary

Finalized quote and software licensing for new cluster to be dedicated for ScaleMP use. New system will be named “Echo” and will be built with new Intel Sandy Bridge technology.
On-going planning with NCSA and Virtual School of Computational Science and engineering (VSCSE) for upcoming Science Cloud Summer School (July 30 – August 03)
A major update on XSEDE testing on FutureGrid (see Science Highlights)
OpenStack “Essex” version deployed
Attended XSEDE XRAC meeting with emphasis on integration of account management and FutureGrid

1.1.1Resource Description

FG Hardware Systems

Name / System type / # Nodes / # CPUs / # Cores / TFLOPS / Total
RAM
(GB) / Secondary
Storage
(TB) / Site
india / IBM iDataPlex / 128 / 256 / 1024 / 11 / 3072 / 335 / IU
hotel / IBM iDataPlex / 84 / 168 / 672 / 7 / 2016 / 120 / UC
sierra / IBM iDataPlex / 84 / 168 / 672 / 7 / 2688 / 96 / SDSC
foxtrot / IBM iDataPlex / 32 / 64 / 256 / 3 / 768 / 0 / UF
alamo / Dell PowerEdge / 96 / 192 / 768 / 8 / 1152 / 30 / TACC
xray / Cray XT5m / 1 / 168 / 672 / 6 / 1344 / 335 / IU
bravo / HP Proliant / 16 / 32 / 128 / 1.7 / 3072 / 192 / IU
delta / SuperMicro / 16 / 32 / 192 / TBD / 3072 / 144 / IU
Total / 457 / 1080 / 4384 / 43.7 / 17184 / 1252

FG Storage Systems

Also, substantial backup storage at IU: Data Capacitor and HPSS.

1.2Science Highlights

FutureGrid as a Test Bed for XSEDE (eXtreme Science and Engineering Discovery Environment)
June 23, 2012
Andrew Grimshaw
Department of Computer Science
School of Engineering and Applied Science
University of Virginia
Charlottesville, Virginia
In 2008, the NSF announced a competition for the follow-on to the TeraGrid project, known as eXtreme Digital (XD). In 2009, two proposal teams, one led by NCSA and the other by SDSC, were selected to prepare full proposals. In July 2010, the proposals were delivered, and in late 2010, the NCSA-led team was awarded the project, albeit with the instruction to incorporate the best ideas and personnel from the SDSC proposal.
The NCSA-led team proposed XSEDE – the eXtreme Science and Engineering Discovery Environment. The XSEDE architecture is a three-layer, federated, system-of-systems architecture. It includes the use of standard Web Services interfaces and protocols that define interactions between different components, tools, and organizations. Many different science communities will use XSEDE, whether at a national supercomputing center or through its delivery of campus bridging among research groups around the US. XSEDE will be, upon completion, one of the critical components of the national cyberinfrastructure.
The XSEDE Web Services architecture is based on a number of interchangeable components that implement standard interfaces. These include: 1) RNS 1.1 [1] for Unix-directory-like namespace (e.g. /home/sally/work); 2) OGSA-ByteIO [2] for POSIX file-like operations (create, read, update, delete); 3) OGSA Basic Execution (OGSA-BES ) [3] for executing and managing jobs (create_activity, get_status, delete); and 4) WS Trust Secure Token Services (STS) [4] for identity federation. The architectural goals have clearly-specified, standard interfaces that anybody can implement, and use best-of-breed components.
The initial realization of the XSEDE web services architecture uses interoperable implementation from two different software stacks: UNICORE 6 [5] and Genesis II [6]. Together, UNICORE 6 and Genesis II provide a rich set of capabilities in the areas of data, computation, and identity management. These capabilities are grouped into two configuration items (CIs): Execution Management Services (EMS) and the Global Federated File System (GFFS).
The Execution Management Services CI is concerned with specifying, executing, and more generally managing jobs in the XSEDE grid. EMS capabilities include but are not limited to the following:

The ability to specify both single jobs and parameter space jobs in JSDL. Specified jobs may be sequential jobs or parallel (MPI) jobs.
The ability to manage jobs through their lifetime, i.e., from specification and submission to a compute resource to status checking and management during execution, as well as final cleanup.
A grid-queue (meta-scheduler) that matches jobs to a defined, configurable, set of execution services and load balances between them.
The ability to specify either a single compute resource as a target, e.g., a particular queue on Ranger, or to specify a global metascheduler/queue as the target and have the metascheduler select the execution endpoint.
The ability to add compute resources (e.g., queues on specific machines such as Ranger, Alamo, Kraken, or local campus queues such as Centurion at UVA) into the XSEDE namespace and subsequently target jobs at them.
The ability to create meta-schedulers/queues and configure them to use (schedule on) different compute resources.
A command line interface (CLI) to interact with grid compute resources.
A graphical user interface (GUI) to interact with and manage the backend grid compute resources. This includes, but is not limited to tools to: create and execute job descriptions, manage grid-queues, and manage access to resources.
A set of Java classes (and associated APIs) to interact with and manage the backend grid resources.

The Global Federated File System presents a file-system-like view of diverse resources types located at service providers, campuses, research labs, and other institutions. Resources (e.g., files, directories, job submission queues) are mapped into a single global path-based namespace. Resources can be accessed by their path name in a location-, replication-, migration-, and failure-transparent manner. Resources can be accessed via command line tools (a grid shell), a graphical user interface, or via the user’s local file system and a FUSE mount.
The GFFS provides a number of capabilities. These capabilities include but are not limited to the following:

A single, secure, shared global namespace for a diversity of resource types. For example, a single namespace can include files, directories, execution services, execution queues, secure token services, and executing jobs.
A three-level naming scheme consisting of location-independent, human-readable names (paths) that map globally unique resources identities -- identities that in turn can be mapped (bound) to one or more resource instances. Collectively, the three layers provide an easy-to-use name space that transparently handles heterogeneous configurations for location, failure, replication, migration, and implementation.
The ability to securely map (share) Service Provider (SP), local, lab, and campus data into the shared global namespace.
The ability to securely map (share) SP, local, lab, and campus compute resources into the global namespace.
The ability to securely map (share) SP, local, lab, and campus identity resources into the global namespace.
The ability to transparently access from both campuses and national supercomputing centers to the global shared namespace, via either the file system (e.g., FUSE) or command line tools and libraries. Such access includes the ability to perform create, read, update, and delete operations on files, directories, and other resource types.
A command line interface (CLI) to interact with backend grid resources. In particular, this CLI allows interaction with Open Grid Forum RNS, ByteIO, WS-Naming, and BES services, as well as WC3 WS-Trust Secure Token Services.
A graphical user interface (GUI) to interact with and manage the backend grid resources.
A set of Java classes (and associated APIs) to interact with and manage the backend grid resources.
The ability to integrate with existing, legacy, XSEDE Kerberos and MyProxy [7] authentication mechanisms.

Before the capabilities of the CIs can be delivered to end users, they must be tested in as realistic an environment as possible. The testing needs to reflect both production workloads as well as extreme conditions – including those that might crash production machines. What is therefore needed is an infrastructure that closely mimics the intended execution environment. This environment would be: 1) a collection of geographically separated parallel processing systems of various types connected by high speed networks; 2) connected to the national network backplane; 3) able to be used for testing without negatively impacting the production workloads at the national centers. FutureGrid fits these requirements perfectly.
In December 2011, the XSEDE-testgrid (XTG) was brought up on FutureGrid and University of Virginia resources as part of the first integrated test plan for execution management services (EMS) and the Global Federated File System (GFFS). The root of the RNS namespace was set up at Virginia and UNICORE 6 servers brought up on X-Ray at Indiana (the Cray that is a small version of Kraken) and Sierra at SDSC. Genesis II servers were brought up at Virginia and TACC. Genesis II clients were brought up at Indiana, Virginia, TACC, and PSC on Blacklight.
Over the next two months, the tests described in the EMS and GFFS test plans[1] were then executed by the XSEDE Software Development and Integration (SD&I) team. This led to the discovery of several minor problems as well as a number of desired feature enhancements. Minor bugs and integration issues were resolved and the increment 1.0 of EMS and GFFS were completed and, in March 2012, they were turned over to the XSEDE operations team for learning and testing purposes.
XSEDE testing activities on FutureGrid bifurcated in April 2012. The XSEDE operations team began doing its own experimentation and testing on FutureGrid resources, and SD&I continued to use the XTG as a test infrastructure – in particular working with the XSEDE Campus Bridging team on the Campus Bridging Pilot project. By the end of May 2012, operations had completed its EMS testing and was ready to start GFFS testing. In late June of 2012, SD&I began testing the second release increment of the EMS and GFFS CIs.
As part of the campus bridging pilot project, data resources at three universities were incorporated into the XSEDE test grid in June 2012: Indiana University, Louisiana Tech, and LSU. These resources will be used to perform typical campus bridging test cases and to capture issues that arise with real users in the wild. To support job execution on a wider set of resources than are configured in the XTG, the compute resources in the Cross Campus Grid (XCG) were linked into the XTG.
The XCG is a Genesis II-based grid run by the University of Virginia with resources on FutureGrid and at the University of Virginia. The XCG has been in production use for over three years, during which time it has executed over 1.3 million jobs for a number applications in areas such as economics, materials science, systems engineering, biology, and physics. Many of the XCG jobs have run on FutureGrid resources via the XCG.
Summary
FutureGrid has been an invaluable resource for XSEDE in testing the new generation of standards-based software described in the XSEDE proposal. Without FutureGrid, the XSEDE SD&I team would have been forced to use significantly smaller test machines at the centers and to execute its tests alongside production applications on the network and local parallel file systems. Instead, thanks to FutureGrid, the XSEDE team has been able to take advantage of a risk-free testing environment, collaborative systems administrators, and the similarity between the FutureGrid resources and its own. Given its ideal qualifications, we expect XSEDE to continue to use FutureGrid as a test environment.
1.Morgan, M., A.S. Grimshaw, and O. Tatebe, RNS Specification 1.1. 2010, Open Grid Forum. p. 23.
2.Morgan, M. ByteIO Specification 1.0. 2005 [cited; Available from:
3.Grimshaw, A., S. Newhouse, D. Pulsipher, and M. Morgan, GFD108: OGSA Basic Execution Service. 2007, Open Grid Forum.
4.OASIS, WS-Trust 1.3, in OASIS Standard Specification. 2007.
5.Snelling, D., Unicore and the Open Grid Services Architecture, in Grid Computing: Making The Global Infrastructure a Reality, F. Berman, A.J.G. Hey, and G. Fox, Editors. 2003, John Wiley. p. 701-712.
6.Morgan, M. and A. Grimshaw. Genesis II - Standards Based Grid Computing. in Seventh IEEE International Symposium on Cluster Computing and the Grid 2007. Rio de Janario, Brazil: IEEE Computer Society.
7.Basney, J., W. Yurcik, R. Bonilla, and A. Slagell. The Credential Wallet: A Classification of Credential Repositories Highlighting MyProxy. in 31st Research Conference on Communication, Information and Internet Policy (TPRC 2003). 2003. Arlington, Virginia.
Optimizing Shared Resource Contention in HPC Clusters
Sergey Blagodurov
School of Computing Science
Simon Fraser University
Burnaby BC, CA
Abstract
Contention for shared resources in HPC clusters occurs when jobs are concurrently executing on the same multicore node (there is a contention for allocated CPU time, shared caches, memory bus, memory controllers, etc.) and when jobs are concurrently accessing cluster interconnects as their processes communicate data between each other. The cluster network also has to be used by the cluster scheduler in a virtualized environment to migrate job virtual machines across the nodes. We argue that contention for cluster shared resources incurs severe degradation to workload performance and stability and hence must be addressed. We also found that the state-of-the-art HPC cluster schedulers are not contention-aware. The goal of this work is the design, implementation and evaluation of a scheduling framework that optimizes shared resource contention in a virtualized HPC cluster environment.
Intellectual Merit
The proposed research demonstrates how the shared resource contention in HPC clusters can be addressed via contention-aware scheduling of HPC jobs. The proposed framework is comprised of a novel scheduling algorithm and a set of Open Source software that includes the original code and patches to the widely-used tools in the field. The solution (a) allows an online monitoring of the cluster workload and (b) provides a way to make and enforce contention-aware scheduling decisions on practice.
Broader Impacts
This research suggests a way to upgrade the HPC infrastructure used by U.S. academic institutions, industry and government. The goal of the upgrade is a better performance for general cluster workload.
Results
Below is the link to our project report for the FutureGrid Project Challenge. A shorter version of it will appear in HPCS 2012 proceedings as a Work-In-Progress paper:

Experiments in Distributed Computing
Shantenu Jha
Center for Computation & Technology
Louisiana State University
Abstract
This work is aimed at (i) developing and extending SAGA the programming system (ii) application and development of \n\ndistributed programming models, and (iii) analysis data-intensive applications and methods.
Intellectual Merit
The CI community does not currently have the ability to address ""distributedness"" explicitly. Part of the reason\n\nis the fragmented and silo approaches, that don't scale or are not extensible. SAGA provides\n\na standards based approach to distributed application development that is interoperable and extensible by definition.\n\nResearch on Futuregrid is primarily about establishing the advantages of a SAGA-based approach to distributed\n\napplications -- primarily data-intensive.\n\n
Broader Impacts
Access to FG is being used to support educational -- graduate and undergraduate activities in an EPSCOR state. \n\nIt also forms the basis of multiple student and training projects.
Results
Summary: The design and development of distributed scientific applications presents a challenging research agenda at the intersection of cyberinfrastructure and computational science. It is no exaggeration that the US Academic community has lagged in its ability to design and implement novel distributed scientific applications, tools and run-time systems that are broadly-used, extensible, interoperable and simple to use/adapt/deploy. The reasons are many and resistant to oversimplification. But one critical reason has been the absence of infrastructure where abstractions, run-time systems and applications can be developed, tested and hardened at the scales and with a degree of distribution (and the concomitant heterogeneity, dynamism and faults) required to facilitate the transition from "toy solutions" to "production grade", i.e., the intermediate infrastructure.
For the SAGA project that is concerned with all of the above elements, FutureGrid has proven to be that *panacea*, the hitherto missing element preventing progress towards scalable distributed applications. In a nutshell, FG has provided a persistent, production-grade experimental infrastructure with the ability to perform controlled experiments, without violating production policies and disrupting production infrastructure priorities. These attributes coupled with excellent technical support -- the bedrock upon which all these capabilities depend, have resulted in the following specific advances in the short period of under a year:
1.Use of FG for Standards based development and interoperability tests:
Interoperability, whether service-level or application-level, is an important requirement of distributed infrastructure. The lack of interoperability (and its corollary -- applications being tied to specific infrastructure), is arguably one of the single most important barriers in the progress and development of novel distributed applications and programming models. However as much as interoperability is important, it is difficult to implement and provide. The reasons are varied, but some critical elements have been the ability to provide (i) Persistent testing infrastructure that can support a spectrum of middleware -- standards-based or otherwise (ii) Single/consistent security context for such tests.
We have used FutureGrid to alleviate both of these shortcomings. Specifically, we have used FG as the test-bed for standards-compliant middleware for extensive OGF standards based testing as part of the Grid Interoperability Now (GIN) and Production Grid Infrastructure (PGI) research group efforts. As part of these extended efforts, we have developed persistent and pervasive experiments, which include ~10 different middleware and infrastructure types -- most of which are supported FG, including Genesis, Unicore, BES and AWS (i.e. Eucalyptus) and soon OCCI. The fact that the FG endpoints are permanent has allowed us to keep those experiments "alive", and enable us to extend static interoperability requirements to dynamic interoperability requirements. Being relieved of the need to maintain those endpoints has been a critical asset.
See the following URL for visual map on the status of the experiments:
2. Use of FG for Analyzing & Comparing Programming Models and Run-time tools for Computation and Data-Intensive Science
What existing distributed programming models will be applicable on Clouds? What new programming models and run-time abstractions will be required to enable the next-generation of data-intensive applications? We have used FG in our preliminary attempts to answer some of these questions.