Windows HPC Server Survey
Thursday, February 21, 2008
Thank you for agreeing to complete this Windows HPC Server 2008 survey. This survey will ask you about the three major areas of Windows HPC Server 2008: systems administration, job scheduling, and MPI/networking. In each section you will be asked to rank if we’ve provided a good solution to a set of particular problems. You will also be asked if there are other problems that we’ve missed.
Please note that some of the solutions described below are not available in the Beta 1 release (November 2007) of Windows HPC Server. Most, if not all, of these features will be available in the Community Technical Preview (CTP) release available in mid-March 2008.
In exchange for completing this survey we’ll mail you a copy of the brand new Microsoft Press book, “Windows Server® 2008 Inside Out” by William R. Stanek. This 1500 page book provides detailed information about administering Windows Server 2008. For more information on this book please go to:
Thank you again for your assistance,
Ryan Waite
Group Program Manager – Windows HPC
Systems Administration Topics
Please rank the following systems administration issues on a scale of 1 to 5 where a “1” indicates an excellent solution in Windows HPC Server 2008 and a “5” indicates a poor solution.
Systems Administration / 1=Excellent solution,5=Poor solution
Sample / 2
Problem: Deploying a large scale cluster is difficult. There are many HW and SW components that may not be configured properly or may not be working at all. This can cause errors to manifest that can be hard to diagnose. How do I bring up large systems in a methodic, consistent manner as smoothly as possible?
Solution: HPC Server 2008 includes a step-by-step procedure (To Do List) imbedded in the admin console that walks the administrator through the steps of deployment from network configuration to node deployment. The procedure includes setting up node template for compute nodes that define the consistent set of tasks required to bring the nodes from bare metal to functioning compute node ready to accept jobs.
Deployment progress can easily be monitored with failures identified in the console. HPC Server also provides a set of diagnostic tests that can be run after deployment to help isolate deployment failures or anomalies, if any. Additionally, we provide OEMs with the ability to deliver to their customers pre-installed clusters so as to allow them to bypass node provisioning altogether.
Problem: I need an automatic way to deploy and manage systems that do not require manual intervention.
Solution: HPC Server 2008 provides powershell commands for all administrator functions that can be used to deploy and manage cluster resources.
Problem: Over time, my cluster environment changes in non-consistent ways as I reconfigure nodes to add new applications or change environmental settings to accommodate application requirements. It is hard to keep track of configuration changes to ensure that the system is in a consistent state. I need to track these changes so that I can address problems that may arise due to these changes. I may need to return the nodes to a known consistent state.
Solution: HPC Server 2008 provides a Node Template mechanism that the administrator can use to define the consistent steps needed to configure compute nodes. Node Templates can be used to define disk partitioning, operating system images, environmental variables, patching levels, and any script or command that could be run from the node’s command line.Administrators can return compute nodes to a known consistent state by simply reimaging the nodes using the node template.
My cluster needs maintenance over time to address such things as node local disks having old scratch files from previous jobs, log files growing too long, etc.How do I know when to do this? How often and how?
Solution:HPC Server 2008 automatically defragments the databases that are used for job scheduling.The administrator also have the ability to launch a command or script in parallel across a set of nodes (clusrun), making some cleanup tasks simpler. With node templates, the administrator can easily reimage a node, removing temporary files.
Problem: I need to understand how well my cluster is utilized and by what users and what types of jobs? This may be used to determine priority and resource requirements for my business.
Solution: HPC Server 2008 provides a set of reports that provide trend analysis on job resource usage, job throughput, job turnaround, and node availability. Reports can filter information based on user, project, service, and job template.
Problem: I need to partition my cluster among different teams or different applications based on application requirements or based on business policies within my company.
Solution: HPC Server 2008 provides the ability for the administrator to define custom groups of nodes based on their own semantics. Any administrative action that can be performed on a single node can be performed on a group of nodes. Job templates can be defined to submit jobs to a specific node group. This allows administrators to provide cluster partitioning. It also allows administrators to carve off a set of nodes for testing purposes or to define a set of nodes for which specific applications are deployed. HPC Server can report on job resource metrics or cluster resource usage by node groups as well, thereby assisting in business decision making.
Problem: I need to be able to effectively troubleshoot problems with the cluster. Cluster failures are sometimes nebulous. An error in one component may manifest as a symptom in a different part of the cluster (e.g., application performance degradation due to misconfigured network switch). How to I track down these problems?
Solution: HPC Server 2008 provides many tools to support troubleshooting.A set of diagnostic tests can be run to target HW and SW system components and report on failures or anomalies. Alerts are posted in the management console specifically identifying nodes are not detectible by the job scheduler. With node templates, the administrator can easily reimage a problematic node, a common practice as a first effort to correct a node failure or anomoly.By providing administrators an integrated solution that both manages nodes and schedules jobs, administrators can easily correlate information between different aspects of a cluster.For example, when a user calls about a job failure, the administrators can select the job and pivot quickly to see the nodes that ran the job and view the metrics and operations history on the nodes. Additionally, the admin console provides a heat map that shows cluster-specific node metrics at a glance across multiple nodes, allowing administrators to see outlier behavior (e.g, network hot spots) quickly. System logs and informational error messages are exposed for all job and node operations. The administrator also have the ability to launch a command in parallel across a set of nodes (clusrun) and view the output in interleaved or sequential format, allowing the administrator to compare node environments. We will also provide a Microsoft System Center Operations Manager (SCOM) Management Pack for HPC Server that allows for more extensive monitoring, including setting additional alerts and thresholds. SCOM also provides for 3rd party application monitoring and extensive customized reporting.
Problem: I want to recover quickly from failures so that current applications will continue to run and new applications can be started within a short time after a failure. Uptime for my cluster is important to my business.
Solution: HPC Server 2008 is resilient against a number of system failures.With node templates, the administrator can quickly reimage compute nodes for which failures have been detected, attempting to bring these nodes back to a consistent known state. In the event that the node cannot be successfully reimaged, the node can easily be removed from the cluster and replaced with a new nodes, using the same template to provision the replacement.HPC Server 2008 also provides the ability to configure multiple head nodes that can be used to failover job scheduler services in the event of a head node failure using Windows Server 2008 Failover Clustering (WSFC).With this high availability configuration, if the head node fails, the second head node will detect the failure and automatically assume control of the cluster services.Clients will be automatically reconnected to the new head node after a short timeout interval. Applications running on the compute nodes will continue to run so long as they do not require application resources on the failed node.
Problem: I need to manage upgrading my cluster in a consistent and unobtrusive manner. I often schedule a maintenance window during which updates occur. I may not want the overhead of applying updates to compute nodes to impact the performance of applications running on the nodes.
Solution: HPC Server 2008 provides support for patching compute nodes without interrupting running jobs. The administrator can specify the level of patching (critical or all) to be applied to compute nodes through a patching task in the node template.
The administrator can define the set of critical or recommended patches using Windows Server Update Services (WSUS). Alternatively, the administrator can run a diagnostic test which determines the set of patches available for each node, and the administrator can add the appropriate individual patches to the node template. When the administrator wishes to apply the patches, he selects the nodes, takes them offline, executes a “patch” action and monitors the progress of the update.
The administrator can choose to wait for jobs on selected nodes to complete before patching. With pivoting, the admin can easily view the jobs currently running on a set of nodes for which the admin wants to update. Or the administrator can “force” the node to go offline, thus allowing the maintenance window to start. Forcing a node offline will automatically requeue the current task running on the node, if possible, thereby allowing the running job’s task to be restarted on remaining nodes allocated to it rather than killing the job.
Problem: I need to balance and rebalance the administrative load of my cluster amongthe nodes, based on my needs.
Solution: HPC Server 2008 provides support for offloading services from the head node. The head node can be easily configured as a compute node or as a WCF router node or as neither. Compute nodes can be reconfigured quickly as WCF routers and visa versa. The head node can be configured to use SQL Server Standard or Enterprise on the head node rather than the default SQL Express database.
Did we miss any other important problems for an HPC solution to solve?
Other Important System Administration Issues / 1=Important, 5=UnimportantJob Scheduling Topics
Please rank the following job scheduling issues on a scale of 1 to 5 where a “1” indicates an excellent solution in Windows HPC Server 2008 and a “5” indicates a poor solution.
Job Scheduling / 1=Excellent solution,5=Poor solution
Sample / 3
Problem: In a cluster where the types / capabilities of compute nodes, networking systems are many and applications may not be accessible from all the nodes, it’s really hard for the user to figure out where they should run her jobs. If they are allowed to pick any nodes, then there will either be poor time-to-result e.g. a big simulation running on a “small-memory” nodes or poor resource efficacy e.g. run a 4-way parallel job on a 8 core machines exclusively
Solution: The resource matching feature of the Job Scheduler allows the user to specify compute, networking and application resource requirements so that the scheduler can perform right-sizing placement for the jobs. For example:
job submit /memorysize:1000000-3000000 /nodegroup:appX myapp.exe
Problem: Some big-model simulations require dedicated access to memory-controllers.As such, an MPI jobs suffer performance loss if the MPI processes are tightly packed into multiple cores on a socket
Solution: The multi-level compute resource allocation mechanism allows the job scheduler to optimally place the memory-intensive jobs in ways that avoids contention of memory and delivers the maximum and predictable application performance. For example, to run an mpi job on 4-8 sockets, you type:
job submit /numsockets:4-8 mpiexec myapp.exe
Problem: When running jobs on clusters which are shared between users or between business units, jobs will often vary in their priority due to business reasons (“We need this data by EOD”) or financial reasons (“My team paid for this cluster”). In scenarios where results are needed on short-notice or long-running jobs can hog the cluster, Priority-FIFO is not sufficient to ensure fast turn-around on these high-priority jobs.
Solution: Using the pre-emption feature high priority and usually short-running jobs can jump the queue and push other lower-priority running jobs aside to meet the organization’s processing deadlines.
Pre-emption seeks to expedite the execution of high-priority jobs by allowing them to take resources currently assigned to lower-priority jobs. This mechanism sacrifices some loss of calculation time (and thus, utilization) on lower-priority workloads, and in exchange provides improved latency for high-priority workloads.
To turn on the preemption feature, either use cluscfg command or the Admin Console
Problem: When jobs are composed of multiple tasks of varying runtime, per-job resource allocation can result in sub-optimal node reservations due to the “long tail” left behind as a few tasks take significantly longer than others.On the other hand, jobs sometimes get allocated less than their maximum number of nodes at start-up, but when more nodes become available they end up sitting idle or getting allocated to another job from the queue. This can result in a job finishing later than it would have by waiting in the queue until more processors become available. This is especially undesirable when it affects high-priority jobs.
Solution: Using the allocation grow and shrink policy, the job scheduler can dynamically change the resource allocations of running jobs, allowing better overall cluster utilization and improved resource allocation for jobs of varying priorities.
By default, if a job is submitted without any resource specification, the job scheduler will automatically calculate the resource usage of the job based on the number of concurrent tasks in the job.
The turn on and off automated grown and shrink, use the cluscfg command or the Admin Console
Problem: In an HPC data center, there are multiple user groups with competing priorities. It’s really complicated to set up access constraints to ensure that the resources are delivered to user groups in ways that satisfies the overall organization’s productivity goals.
Solution: With job template feature, an admin can carve-up the resources that best meet the processing needs and priorities of the multiple user groups. The admin can provide defaults of or constrain job terms such as number requested of processors, sockets or nodes, requested nodes or node groups, exclusive usage of nodes, runtime limit and project names.
The job template feature can be accessible through the command line or the Admin Console.
Problem: Building HPC applications are hard. To accelerate parametric sweep processing through traditional job scheduler job/tasks model, the developer sometimes is forced to serialize input and output and wrap their core logic into an executable, leading to low productivity. To debug multiple components of the applications are also difficult.
Solution: With the WCF Broker feature, the developer is presented with an service-oriented programming model based on WCF that effectively hides the details for data serialization and distributed computing. The end result is “instant gridification”.The WCF Broker automatically load balances the calculation requests to the backend services in ways that best utilize the capacity of the compute nodes.
Microsoft Visual Studio provides the tools to debug services and clients on the developer’s workstation.
Problem: Quite a number of applications have sub-second runtime, the job / task model usually incur long start-up overhead, yielding diminishing return for these type of applications.
Solution: The WCF Broker employs efficient request forwarding protocols such that the round-trip latency from the client to services is around a millisecond range.
Problem:As the number of users for the application services increase, it is imperative that the usage of the services are authorized and audited, the messages privacy be protected.
Solution: The WCF Broker feature supports the end-to-end Kerberos authentication.
The services can be run under the submission user so that the auditing of the service usage can be easily performed using the Windows Server tools.
End-to-end transport level security is used to encrypt the messages to protect the information privacy.
Problem:Application services and infrastructure are hard to manage. It is not easy to track where the services are installed, to diagnose the problems when failure occurs, to monitor the performance to gauge the health of the system.
Solution: HPC Server 2008 provides tools to view the inventory of the application services, diagnostic tests that checks the configuration of Broker and services and performance of the services on selected nodes, performance counters that monitors the number of total, outstanding, faulted and average time and throughput of service calls.
To monitor the broker health, the admin console provides the heat map view of the list of Brokers, which makes it easy to see the memory, cpu and network usage of the Broker.
Problem: An organization which has already invested in other job schedulers, it is hard-sell to replace with a new Job Scheduler.
Solution: HPC Server 2008 ships with an HPC Profile Web Services component which enables the interoperability with the third party job schedulers.
With the HPC Profile Web Services, a job submitted to a third party scheduler can be handed-off to the Windows cluster transparently. We are working closely with Platform Computing and Altair Engineering to deliver the interoperability to the customers of LSF and PBSPro.
Did we miss any other important problems for an HPC solution to solve?