The proof of concept presented in this document is neither a product nor a service offered by Microsoft or BULL S.A.S.

The information contained in this document represents the current view of Microsoft Corporation and BULL S.A.S. on the issues discussed as of the date of publication. Because Microsoft and BULL S.A.S. must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft or BULL S.A.S., and Microsoft and BULL S.A.S. cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT and BULL S.A.S. MAKE NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation and BULL S.A.S.

Microsoft and BULL S.A.S. may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft or BULL S.A.S., as applicable, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

? 2008, 2009 Microsoft Corporation and BULL S.A.S. All rights reserved.

NovaScale is a registered trademark of Bull S.A.S.

Microsoft, Hyper-V, Windows, Windows Server, and the Windows logo are trademarks of the Microsoft group of companies.

PBS GridWorks?, GridWorks?, PBS Professional?, PBS? and Portable Batch System? are trademarks of Altair Engineering, Inc.

The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

Initial publication: release 1.2, 52 pages, published in June 2008

Minor updates: release 1.5, 56 pages, published in Nov. 2008

This paper with meta-scheduler implementation: release 2.0, 76 pages, published in June 2009

Abstract

The choice of an operating system (OS) for a high performance computing (HPC) cluster is a critical decision for IT departments. The goal of this paper is to show that simple techniques are available today to optimize the return on investment by making that choice unnecessary, and keeping the HPC infrastructure versatile and flexible. This paper introduces Hybrid Operating System Clusters (HOSC). An HOSC is an HPC cluster that can run several OS’s simultaneously. This paper addresses the situation where two OS’s are running simultaneously: Linux Bull Advanced Server for Xeon and Microsoft? Windows? HPC Server 2008. However, most of the information presented in this paper can apply to 3 or more simultaneous OS’s, possibly from other OS distributions, with slight adaptations. This document gives general concepts as well as detailed setup information. Firstly, technologies necessary to design an HOSC are defined (dual-boot, virtualization, PXE, resource manager and job scheduler). Secondly, different approaches of HOSC architectures are analyzed and technical recommendations are given with a focus on computing performance and management flexibility. The recommendations are then implemented to determine the best technical choices for designing an HOSC prototype. The installation setup of the prototype and the configuration steps are explained. A meta-scheduler based on Altair PBS Professional is implemented. Finally, basic HOSC administrator operations are listed and ideas for future works are proposed.

This paper can be downloaded from the following web sites:

l.com/techtr e n ds

c rosoft.com/down l o a ds

rosoft.com/en-us/library/cc700329(WS.10).aspx

Abstract 3

1 Introduction 7

2 Concepts and products 9

2.1 Master Boot Record (MBR) 9

2.2 Dual-boot 9

2.3 Virtualization 10

2.4 PXE 12

2.5 Job schedulers and resource managers in a HPC cluster 13

2.6 Meta-Scheduler 13

2.7 Bull Advanced Server for Xeon 14

2.7.1 Description 14

2.7.2 Cluster installation mechanisms 14

2.8 Windows HPC Server 2008 16

2.8.1 Description 16

2.8.2 Cluster installation mechanisms 16

2.9 PBS Professional 18

3 Approaches and recommendations 19

3.1 A single operating system at a time 19

3.2 Two simultaneous operating systems 21

3.3 Specialized nodes 23

3.3.1 Management node 23

3.3.2 Compute nodes 23

3.3.3 I/O nodes 24

3.3.4 Login nodes 24

3.4 Management services 25

3.5 Performance impact of virtualization 25

3.6 Meta-scheduler for HOSC 26

3.6.1 Goals 26

3.6.2 OS switch techniques 26

3.6.3 Provisioning and distribution policies 26

4 Technical choices for designing an HOSC prototype 27

4.1 Cluster approach 27

4.2 Management node 27

4.3 Compute nodes 27

4.4 Management services 28

4.5 HOSC prototype architecture 32

4.6 Meta-scheduler architecture 33

5 Setup of the HOSC prototype 34

5.1 Installation of the management nodes 34

5.1.1 Installation of the RHEL5.1 host OS with Xen 34

5.1.2 Creation of 2 virtual machines 34

5.1.3 Installation of XBAS management node on a VM 36

5.1.4 Installation of InfiniBand driver on domain 0 36

5.1.5 Installation of HPCS head node on a VM 36

5.1.6 Preparation for XBAS deployment on compute nodes 37

5.1.7 Preparation for HPCS deployment on compute nodes 37

5.1.8 Configuration of services on HPCS head node 38

5.2 Deployment of the operating systems on the compute nodes 39

5.2.1 Deployment of XBAS on compute nodes 39

5.2.2 Deployment of HPCS on compute nodes 40

5.3 Linux-Windows interoperability environment 43

5.3.1 Installation of the Subsystem for Unix-based Applications (SUA) 43

5.3.2 Installation of the Utilities and SDK for Unix-based Applications 43

5.3.3 Installation of add-on tools 43

5.4 User accounts 43

5.5 Configuration of ssh 44

5.5.1 RSA key generation 44

5.5.2 RSA key 44

5.5.3 Installation of freeSSHd on HPCS compute nodes 45

5.5.4 Configuration of freeSSHd on HPCS compute nodes 45

5.6 Installation of PBS Professional 45

5.6.1 PBS Professional Server setup 46

5.6.2 PBS Professional setup on XBAS compute nodes 46

5.6.3 PBS Professional setup on HPCS nodes 46

5.7 Meta-Scheduler queues setup 46

5.7.1 Just in time provisioning setup 48

5.7.2 Calendar provisioning setup 48

6 Administration of the HOSC prototype 49

6.1 HOSC setup checking 49

6.2 Remote reboot command 49

6.3 Switch a compute node OS type from XBAS to HPCS 49

6.4 Switch a compute node OS type from HPCS to XBAS 50

6.4.1 Without sshd on the HPCS compute nodes 50

6.4.2 With sshd on the HPCS compute nodes 50

6.5 Re-deploy an OS 50

6.6 Submit a job with the meta-scheduler 51

6.7 Check node status with the meta-scheduler 52

7 Conclusion and perspectives 55

Appendix A: Acronyms 56

Appendix B: Bibliography and related links 58

Appendix C: Master boot record details 60

C.1 MBR Structure 60

C.2 Save and restore MBR 60

Appendix D: Files used in examples 61

D.1 Windows HPC Server 2008 files 61

D.1.1 Files used for compute node deployment 61

D.1.2 Script for IPoIB setup 62

D.1.3 Scripts used for OS switch 63

D.2 XBAS files 64

D.2.1 Kickstart and PXE files 64

D.2.2 DHCP configuration 65

D.2.3 Scripts used for OS switch 65

D.2.4 Network interface bridge configuration 67

D.2.5 Network hosts 68

D.2.6 IB network interface configuration 68

D.2.7 ssh host configuration 68

D.3 Meta-scheduler setup files 69

D.3.1 PBS Professional configuration files on XBAS 69

D.3.2 PBS Professional configuration files on HPCS 69

D.3.3 OS load balancing files 69

Appendix E: Hardware and software used for the examples 72

E.1 Hardware 72

E.2 Software 72

Appendix F: About Altair and PBS GridWorks 73

F.1 About Altair 73

F.2 About PBS GridWorks 73

Appendix G: About Microsoft and Windows HPC Server 2008 74

G.1 About Microsoft 74

G.2 About Windows HPC Server 2008 74

Appendix H: About BULL S.A.S. 75

1 Introduction

The choice of the right operating system (OS) for a high performance computing (HPC) cluster can be a very difficult decision for IT departments. And this choice will usually have a big impact on the Total Cost of Ownership (TCO) of the cluster. Parameters like multiple user needs, application environment requirements and security policies are adding to the complex human factors included in training, maintenance and support planning, all leading to associated risks on the final return on investment (ROI) of the whole HPC infrastructure. The goal of this paper is to show that simple techniques are available today to make that choice unnecessary, and keep your HPC infrastructure versatile and flexible.

In this white paper we will study how to provide the best flexibility for running several OS’s on an HPC cluster. There are two main types of approaches to providing this service depending on whether a single operating system is selected each time the whole cluster is booted, or whether several operating systems are run simultaneously on the cluster. The most common approach of the first type is called the dual-boot cluster (described in [1] and [2]). For the second type of approach, we introduce the concept of a Hybrid Operating System Cluster (HOSC): a cluster with some computing nodes running one OS type while the remaining nodes run another OS type. Several approaches to both types are studied in this document in order to determine their properties (requirements, limits, feasibility, and usefulness) with a clear focus on computing performance and management flexibility.

The study is limited to 2 operating systems: Linux Bull Advanced Server for Xeon 5v1.1 and Microsoft Windows HPC Server 2008 (respectively noted XBAS and HPCS in this paper). For optimizing the interoperability between the two OS worlds, we use the Subsystem for Unix-based Applications (SUA) for Windows. The description of the methodologies is as general as possible in order to apply to other OS distributions but examples are given exclusively in the XBAS/HPCS context. The concepts developed in this document could apply to 3 or more simultaneous OS’s with slight adaptations. However, this is out of the scope of this paper.

We introduce a meta-scheduler that provides a single submission point for both Linux and Windows. It selects the cluster nodes with the OS type required by submitted jobs. The OS type of compute nodes can be switched automatically and safely without administrator intervention. This optimizes computational workloads by adapting the distribution of OS types among the compute nodes.

A technical proof of concept is given by designing, installing and running an HOSC prototype. This prototype can provide computing power under both XBAS and HPCS simultaneously. It has two virtual management nodes (aka head nodes) on a single server and the choice of the OS distribution among the compute nodes can be done dynamically. We have chosen Altair PBS Professional software to demonstrate a meta-scheduler implementation. This project is the result of the collaborative work of Microsoft and Bull.

Chapter 2 defines the main technologies used in HOSC: the Master Boot Record (MBR), the dual-boot method, the virtualization, the Pre-boot eXecution Environment (PXE), the resource manager and job scheduler tools. If you are already familiar with these concepts, you may want to skip this chapter and go directly to Chapter 3 that analyzes different approaches to HOSC architectures and gives technical recommendations for their design. The recommendations are implemented in Chapter 4 in order to determine the best technical choices for building an HOSC prototype. The installation setup of the prototype and the configuration steps are explained in Chapter 5. Appendix D shows the files that were used during this step. Finally, basic HOSC administrator operations are listed in Chapter 6 and ideas for future works are proposed in Chapter 7, which concludes this paper.

This document is intended for computer scientists who are familiar with HPC cluster administration.

All acronyms used in this paper are listed in Appendix A. Complementary information can be found in the documents and web pages listed in Appendix B.

2 Concepts and products

We assume that the readers may not be familiar with every concept discussed in the remaining chapters in both Linux and Windows environments. Therefore, this chapter introduces the technologies (Master Boot Record, Dual-boot, virtualization and Pre-boot eXecution Environment) and products (Linux Bull Advanced Server, Windows HPC Server 2008 and PBS Professional) mentioned in this document.

If you are already familiar with these concepts or are more interested in general Hybrid OS Cluster (HOSC) considerations, you may want to skip this chapter and go directly to Chapter 3.

2.1 Master Boot Record (MBR)

The 512-byte boot sector is called the Master Boot Record (MBR). It is the first sector of a partitioned data storage device such as a hard disk. The MBR is usually overwritten by operating system (OS) installation procedures; the MBR previously written on the device is then lost.

The MBR includes the partition table of the 4 primary partitions and a bootstrap code that can start the OS or load and run the boot loader code (see the complete MBR structure in Table 3 of Appendix C.1). A partition is encoded as a 16-byte structure with size, location and characteristic fields. The first 1-byte field of the partition structure is called the boot flag.

Windows MBR starts the OS installed on the active partition. The active partition is the first primary partition that has its boot flag enabled. You can select an OS by activating the partition where it is installed. Tools diskpart.exe and fdisk can be used to change partition activation. Appendix D.1.3 and Appendix D.2.3 give examples of commands that enable/disable the boot flag.

Linux MBR can run a boot loader code (e.g., GRUB or Lilo). You can then select an OS interactively from its user interface at the console. If no choice is given at the console, the OS selection is taken from the boot loader configuration file that you can edit in advance before a reboot (e.g., grub.conf for the GRUB boot loader). If necessary, the Linux boot loader configuration file (that is written in a Linux partition) can be replaced from a Windows command line with the dd.exe tool.

Appendix C.2 explains how to save and restore the MBR of a device. It is very important to understand how the MBR works in order to properly configure dual-boot systems.