Go Search

Job scheduling

A blog focused on explaining the new features in v2 and how to develop for and integrate with the HPC 2008 scheduler. This blog is owned by a group of Windows HPC Server 2008 development professionals.
Creating Activation and Submission Filters
There were a lot of questions about how to use Activation and Submission Filters to help customize queue managmeent and do things like license-aware scheduling.  That, on top of some changes made in a QFE, led us to do an updated doc on using filters.  You can check it out here:
 
It contains updated sample code and explanations.  Let us know if you think any information is missing!
Integrating Windows HPC Server 2008 with Linux
We find many of our Windows HPC Server 2008 deployments are going into environments where there are existing Linux (and Linux HPC) solutions. It is possible to configure these two environments to achieve integrated authentication, file sharing and job submission.
 
The following technical documents provide step by step instructions on how to do a typical installation of Windows HPC and a Linux HPC distribution so as to achieve a single sign on environment, file sharing and to submit jobs from a Linux environment running Sun Grid Engine into the Windows HPC 2008 job scheduler using the HPC Basic Profile web service specification.
 
 
 

Title

Details Page

Direct Download

Installation of Fedora Samba for Windows AD Compatibility

http://www.microsoft.com/downloads/details.aspx?FamilyId=1C2C91A8-6D81-4BC2-94E9-448D68A7D06D&displaylang=en

http://download.microsoft.com/download/1/6/9/16963418-6d06-4cb6-8b65-9fe3da11c583/Installation_of_Fedora-Samba_for_Windows_AD_Compatibility_Final.doc

Installation Instructions for Cluster Corp Rocks+ on the HP Proliant DL145 G2 Based Cluster

http://www.microsoft.com/downloads/details.aspx?FamilyId=7AE6D41D-4C86-4B34-9C62-466646915926&displaylang=en

http://download.microsoft.com/download/5/b/5/5b55533c-12a6-4979-8849-f7b7a57eff61/Linux_Installation_Final.doc

Installation of Fedora 8 Linux to Access a Windows HPC Server 2008 Cluster

http://www.microsoft.com/downloads/details.aspx?FamilyId=25647D7B-CE6A-45D2-8472-12F4DB537951&displaylang=en

http://download.microsoft.com/download/f/1/9/f190260a-a892-491b-ab2e-c884c72a9e7d/Linux_Installation_for_Windows_Cluster_Access_Final.doc

Installation Instructions for a Windows HPC Server 2008 Based Cluster on HP Proliant DL145 G2 Hardware

http://www.microsoft.com/downloads/details.aspx?FamilyId=1349438A-A05B-4E2B-91F8-8BF3058EB307&displaylang=en

http://download.microsoft.com/download/0/1/a/01af1aba-0015-4236-a4ab-7498d2e51829/Microsoft_Installation_with_AD_Final.doc

The Windows HPC Server 2008 Cluster in a Linux Environment

http://www.microsoft.com/downloads/details.aspx?FamilyId=9E65676E-D34E-4671-B841-0D1DCA996A8B&displaylang=en

http://download.microsoft.com/download/1/1/6/116be099-9c6b-424c-81e6-c9ce2455ae80/Windows HPC in Linux Environment_Final.doc

 

Clusrunning with Windows HPC Server 2008

One of our most popular features in the Compute Cluster Pack was clusrun (known to you GUI users as “Remote Command Execution”), which allowed you to run a command line command across a set of cluster nodes in parallel, with their output piped back to you on the client.  Not content to rest on our laurels, we’ve made some additions to clusrun’s capabilities in Windows HPC Server 2008.  I’ll dig into some of them below.

 

But First, Clusrun Basics

At a basic level, clusrun runs a job with a task in it for each node that you specify.  This job completely bypasses the queue to start right away, and the tasks pipe their information back to the client machine.  This has a couple of requirements to work, namely:

·         All of the target machines must be nodes in the cluster (with the HPC pack installed and able to communicate with the head node), but they don’t have to be in the “Online” state

·         Your compute nodes must be able to right to a fileshare on the client computer; you can test this by logging into a node and attempting to connect to \\client\c$

·         Your job scheduler needs to be working

Assuming these requirements are met, you can run a clusrun command either from the command line (using the clusrun command) or from the HPC Cluster Manager (by right clicking some nodes and selecting “Run Command . . .”).  As a simple example, try running clusrun /all hostname.exe, each of the nodes in your cluster will print out its name onto your client:

PS> clusrun /all hostname.exe

Enter the password for 'REDMOND\jbarnard' to connect to 'JBarnardHN':

Remember this password? (Y/N)Y

-------------------------- JBARNARDCN01 returns 0 --------------------------

JBARNARDCN01

-------------------------- JBARNARDCN03 returns 0 --------------------------

JBARNARDCN03

-------------------------- JBARNARDHN returns 0 --------------------------

JBarnardHN

-------------------------- JBARNARDCN02 returns 0 --------------------------

JBARNARDCN02

-------------------------- Summary --------------------------

4 Nodes succeeded

0 Nodes failed

 

So What’s New?

There are a lot of new options for clusrun in HPCS 2008.  These includes

 

New Formatting Options: Sorted or Interleaved Output

By default, clusrun returns output as each node completes the command.  But you can override this by using either the /sorted or /interleaved flags.

/Sorted prints node output in alphabetical order, making it easier to find a specific node.  /Interleaved prints out lines of output as they come back, which is great for processing with a script or for determining just where things are going wrong.

 

Picking Your Nodes: Exclude, Job, Task

We’ve got some great new options for picking your nodes, including the ability to exclude a set of nodes with the /exclude flag.  So the command “clusrun /all /exclude:Node14 ipconfig” will return the IP configuration of every node other than Node14.

Next up are the /job and /task options, which are my personal favorites!  They allow you to run a clusrun command against all of the nodes which are (or were) assigned to a particular job or task.  For example, “clusrun /task:10.4 del /q SomeFile.txt” will delete SomeFile.txt from every node that ran task #10.4.

 

History Tracking

Clusrun jobs now live in the database just like regular jobs, making it easier to track what you’ve done and to uncover failures.  You can easily find them from the command line by running job list /jobname:”Remote command”, or in the HPC Cluster Manager by selecting the “Clusrun Commands” node in the navigation pane.  Each node in the run will have a separate task (including exit code, error message, etc . . .) allowing you to more easily dig into the causes of failures.

 

Happy Clusrunning!

-Josh

MPI Process Placement with Windows HPC Server 2008

We get an awful lot of questions about how to go about getting the desired process placement across nodes in an MPI job (just check our forums if you don’t believe me), so I thought I’d post here to shed some light on the things that are possible.

 

Requesting Resources

The first thing you need to do is express the number of resources that you’d like allocated for your job.  This can be done at either the node, socket, or core, level.  Requesting 4-8 cores will assign your job as few as 4 and as many as 8 processors; requesting 4-8 nodes will get you as few as 4 and as many as 8 hosts, and so forth.  For more details on how this sort of request works, check out my previous post, How that Node/Socket/Core thing Works.

Why a Min and a Max?

You may be wondering why it’s necessary to specify a Min and a Max for your job.  The reason is that this enables you to help the Job Scheduler decide when and where to run your job.  The job scheduler will start your job as soon as it has at least the minimum number of processors.  If at that time there are more than the minimum number of resources available, the scheduler will give you up to your maximum number of resources. Thus, setting a small min will allow your job to start sooner, while setting a large min will cause your job to wait in the queue until more resources are available.  There is no best way for everyone: you should decide on the min and max that works best for you!  But the general guidelines are:

·         The smaller your Minimum, the sooner your job will run, so pick as small a Minimum as you’d reasonably accept

·         Set a Maximum that’s as large as your job can take advantage of to reduce its overall run time, so pick a Maximum which matches the largest your application scales with reasonable performance gains

·         You can always set the Minimum and Maximum to the same value to request a fixed number of nodes

This capability, and all of the MPI process placement features described below, are designed to allow you to specify how you want your job to run without needing to know ahead of time how many or which nodes your MPI application will end up running on.

 

Setting How Many Processes and Where They Will Run

Once you’ve figured out how many resources you want for your job, the next step is to figure out how you want your MPI ranks started across these nodes.  By default, ranks will be started on a “1-per resource requested” basis, so requesting 4 sockets will result in 4 MPI ranks, 1 per socket.  For example:

Command Line

Result

C:\>job submit /numsockets:9 mpiexec MyApp.exe

9 ranks of MyApp.exe will be started across an unknown number of nodes, with no more than 1 rank per socket started on any node.

C:\>job submit /numnodes:2-4 mpiexec  MyApp.exe

 

2-4 ranks of MyApp.exe will be started across 2-4 physical nodes (depending on how many are available), with 1 rank per node.

Table 1: Submitting an MPI Job with Default Process Placement

 

Setting the Number of Cores per Node with -c

We provide a new mpiexec option, -cores (or –c) which allows you to specify the number of ranks to start on each node assigned to your job.  This especially useful with node-level scheduling; allowing you to control the size and placement of your job with laser-like precision!  Adding some of the other node selection options (like corespernode) will make this even more powerful.  For example:

Command Line

Result

C:\>job submit /numnodes:4-4 mpiexec –cores 2 MyApp.exe

 

MyApp.exe will be started across 4 nodes, with 2 ranks per node (for a total of 8 ranks).

C:\>job submit /numnodes:1-8 mpiexec –cores 3 MyApp.exe

Between 3 and 24 ranks of MyApp.exe will be started, with 3 ranks per node spanning up to 8 nodes.

C:\>job submit /numnodes:8 /corespernode:8 mpiexec –cores 7 MyApp.exe

MyApp.exe will start on 8 nodes.  All 8 nodes must have at least 8 cores on them, and 7 ranks of MyApp.exe will be started on each of the nodes (for a total of 32 ranks).

Table 2: Submitting an MPI Job and Specifying the Number of Cores per Node

Note: The /corespernode option refers to the minimum number of cores which must be present on any node assigned to the job, not the number of cores to allocate on a node.

 

Setting the Number of Total Ranks with -n

You can use the –n argument to mpiexec to set the total number of ranks to start across the entire run, allowing even more fine grained control.  For example:

Command Line

Result

C:\>job submit /numcores:8 mpiexec –n 16 MyApp.exe

 

16 ranks of MyApp.exe will be started, 2 to a core over 8 cores.

C:\>job submit /numnodes:4 mpiexec –n 8 MyApp.exe

8 ranks of MyApp.exe will be started across 4 nodes.

Table 3: Using the -n option to mpiexec

 

Now Set Affinity

Setting affinity can result in huge performance improvements for MPI applications, and we’ve made it way easier for you to take advantage of affinity!  How easy?  Just use the –affinity flag to mpiexec, and each rank of your MPI application will be locked to a single core (which can dramatically improve performance for certain applications).  For example:

Command Line

Result

C:\>job submit /numnodes:2-4 mpiexec –cores 2 –affinity MyApp.exe

 

MyApp.exe will be started with 2 ranks on each of between 2 – 4 nodes, for a total of 4 -8 ranks.  Each rank will get affinity to one of the cores on its assigned node, so that the two ranks sharing a node cannot step on each other’s toes.  If the nodes in question have a NUMA architecture, the ranks on each node will automatically be placed on separate NUMA nodes.

C:\>job submit /numsockets:8 mpiexec –affinity MyApp.exe

8 ranks of MyApp.exe will be started across an unknown number of nodes, where each rank will have a dedicated path to memory that cannot be used by any other job.

Table 4: Submitting an MPI Job with Affinity

Note: Mpiexec will attempt to automatically insure that ranks are spaced as “far apart” as possible; i.e. on different sockets in a NUMA system.

 

Checking Your Work

You can run a very simple test that will tell you how your placement worked.  If you use mpiexec –l hostname.exe (plus any other arguments you need), your output will be a list of MPI ranks and the nodes that they appeared in.  This will allow you to see the number of ranks started on each node, as well as the round-robin order that mpiexec uses.

 

Go Forth and Conquer!

That’s the scoop on process placement with Windows HPC Server 2008.  Go try it out!  And if you encounter any problems, please post up on our forums and we’ll be happy to help you out.

How to Submit a Job

I was asked today where there was a quick explanation of how to submit a job from the command line, and I realized that I didn't have a really good answer. So I figured I'd do a blog post (first in a while, I know!) on how to do this.

 

Starting Simple

Let's say your have an application called "Divide.exe" which takes two arguments, "-Numerator X" and "-Denominator Y". So for example, to divide 3/6 you would run the command:

divide.exe –Numerator 3 –Denominator 6

 

So now you want to run this thing on a single processor on some node in the cluster. It's really easy! You can just do:

job submit divide.exe –Numerator 3 –Denominator 6

 

No problem, right?

 

Getting Parallel

Of course if you're using a cluster, you don't just want to run one command line on one processor of one machine. You want to run in parallel! There are a few ways that you can create parallel jobs depending on the way that your application works.

The two most common ways to do this in HPC are using MPI or using Parameter Sweeps.

 

Submitting MPI Jobs

Submitting MPI jobs is just as easy as submitting any other job! You simply do a job submit followed by your mpi command line. The number of options offered by mpi is astounding (try mpiexec –help3 for more on that) and will probably be covered in a future post, but the basic command line to submit a 16 processor mpi job would be:

Job submit /numcores:16 mpixec MpiDivide.exe –Numerator 3 –Denominator 6

 

Submitting a Parameter Sweep

We use the term "parameter sweep" to refer to a job which encompasses running a single, serial application many, achieving parallelism by running many instances of it at the same time. In HPC Server (and in the Compute Cluster Pack), this is accomplished by creating a job with many tasks. So for example, you could create a job that divded 3 by 6, 9, and 12 by doing the following (where X is the job ID returned by the first step):

Job new /jobname:"My Parameter Sweep Job"

Job add X divide.exe –Numerator 3 –denominator 6

Job add X divide.exe –Numerator 3 –denominator 9

Job add X divide.exe –Numerator 3 –denominator 12

Job submit /id: X

 

New in HPC Server 2008, we have a much simpler (and faster!) way of running parameter sweeps using wildcards! This approach will save you a lot of time since they're much easier to create and because the Job Scheduler needs to store much less information for a Parametric Task than it does for job with many distinct tasks. An example of how to do the same step above using this new technique follows:

Job submit /parametric:6-12:3 divide.exe –Numerator 3 –denominator *

 

For more info on how to use the new HPC Server parametric sweeps, see my earlier blog post, Making a Clean Sweep with Windows HPC Server 2008.

 

For more info on the command line tools in general, check out our Command Line Reference. Unfortunately it's not yet updated for v2 . . . but there should be a new one out on the web in just a few weeks!

Job Template White Paper is Now Available
We've posted a white paper on how Job Templates can help you manage your cluster.  You can check out the white paper here:
Making a Clean Sweep with Windows HPC Server 2008

Parameter Sweeps are one of the most common types of jobs that get run on HPC clusters, and we've done some work in Windows HPC Server 2008 to make them easier (and faster) than ever. Today I'll dig in a little on what's different and how to take advantage of these new features.

For those familiar with other scheduling products, you may recognize these as very similar to "Job Arrays."

 

What is a Parameter Sweep?

Simply put, a parameter sweep is when you run a single command many times over a set of different input parameters. Problems which can be solved in this way are very common. They're also a great use of HPC clusters, since parameter sweeps are inherently embarrassingly parallel; namely they can be run in parallel with little or no effort and scale almost limitlessly.

In general, Parameter Sweeps take the form of a single command line which is run N times, with different input, output, and/or command line arguments for each of the N steps. For example, think about an application called FileZipper.exe that takes an input file and generates a compressed output file. To run it on 100 data files, you'd basically want to do something like:

FileZipper.exe <FileToZip1.dat >ZippedFile1.zip

FileZipper.exe <FileToZip2.dat >ZippedFile2.zip

FileZipper.exe <FileToZip3.dat >ZippedFile3.zip

FileZipper.exe <FileToZip100.dat >ZippedFile100.zip

These instances can already independently of one another, and you can run as many in parallel as you have processors to do the compression with.

Make sense? If it does, then you already understand pretty much everything there is to know about Parameter Sweeps.

 

What's different in Windows HPC Server 2008?

In the Compute Cluster Pack (our product from 2005), parameter sweeps could be generated pretty easily from the UI by inputting the Start Index (the number to start counting from), End Index (the last number to use use), and the Increment (the number to add for each step in the sweep). Using the UI to create such a sweep would then create N individual tasks in the job. This was useful, but had a number of downsides, namely:

  • You're storing a lot of repeated information in the scheduler database
  • If you wanted to change a part of your sweep, you had to change every task
  • Large sweeps became very unwieldy to view in the UI or at the command line

In Windows HPC Server 2008, we add a new type of task called a Parametric Task:

Adding a Parametric Task

Figure 1: Adding a Parametric Task to a Job

Now, these tasks are stored as a unit, which makes, storing, editing, managing, and monitoring them a snap! Let's go ahead and give it a try . . .

 

Creating a Parameter Sweep

To create your first parameter sweep, open up the HPC Job Manager. The simplest way to create a sweep is to click on the Parametric Sweep Job link in the right-hand Actions Pane.

Let's create a sweep in the form of the example up above; namely, a sweep that zips up 100 files:

  1. Go ahead and provide a Name like "File Zipper".
  2. Set the Start Value of 1 and an End Value of 2.
  3. Leave the Increment Value to be 1 (since we'll be counting up by 1's).
  4. Enter your Command Line, in this case "FileZipper.exe".
  5. We'll see the Standard Input and Standard Output as above:
    1. Stdin: FileToZip*.dat
    2. Stdout: ZippedFile*.zip
  6. Check the preview box at the bottom of the dialog to see what your sweep will look like, then go ahead and submit.

You should end up with something that looks like this:

Creating a Sweep

Figure 2: Creating a Parameter Sweep Task

 

Tracking Your Sweep's Progress

If you check the job list, you should now see that you've submitted a job with a single task in it. But actually, you can easily track each task individually by checking the box labeled Expand parametric tasks. This allows you track your sweep as a unit, but dig in on failures or results for individual steps.

 

Doing that From the Command Line

Of course you can do the same thing from the Command Line:

C:\>job submit /parametric:100 /StdIn:"FileToZip*.dat" /StdOut:"ZippedFile*.zip" FileZipper.exe

Or from PowerShell:

PS> New-HpcJob | Add-HpcTask -Parametric -Start 1 -End 100 -Stdin "FileToZip*.dat" -Stdout "ZippedFile*.zip" -CommandLine "FileZipper.exe" | Submit-HpcJob

 

That's all for this time. Happy sweeping!

What the heck is a “Job Template” and why should I care?

Job Templates are one of the most whiz-bang new features in HPC Server 2008, but we've gotten a lot of feedback from the community that you don't really know what they're for or how to use them. So I figured a quick post here might help solve that problem by introducing you to one of the most powerful features of the v2 Job Scheduler.

 

What is a Job Template?

Simply put, a job template is a custom submission policy configured by the admin. Admins can create a number of different job templates and then let users pick the one that is right for them and their job (assuming they have the necessary permissions).

 

Job Templates vs. Job Queues

In many ways, HPC Server job templates are the same the queues found in other scheduling products (like Platform's LSF), in that they allow you to:

  • Partition the cluster
  • Give different permissions to different jobs and users
  • Provide handling for different types of jobs

That being said, job templates aren't queues because of the simple fact that in the end, all jobs submitted to the system end up in the same queue. We think that's a great design, because there is a single place to view all of the jobs, and a single, ordered queue of all the jobs waiting to execute.

 

How Job Templates Work

Job templates work by allowing administrators to provide Defaults and Constraints to every job that comes into the system. They are also ACL'd, meaning that Administrators can control which sets of users can submits which types of jobs.

This diagram quickly explains how Job Templates are applied to a job:

Validating a Job 

Figure 1: The Scheduler Validates a Job Using a Job Template

 

How to Use Job Templates

The easiest way to explain how to use job templates is to make up a scenario and then show you how to enforce the policies required by that scenario. So let's do just that . . .

 

Say you have two groups of users:

  • Paying Customers- These are the guys who paid for the cluster to be installed, and they get nearly unlimited rights to use the cluster as they see fit.
  • Freeloaders- These are other employees at your company. They are allowed to use the cluster, but only in limited amounts and only if they don't get in the Paying Customers way.

 

Let's go ahead and create two templates, one for each group, with the necessary settings and permissions to make everything work out nicely.

 

Step 1: Create a template for the Paying Customers group

First, let's head into the product and great a template for the "Paying Customers:"

  • Click on "Configuration" in the lower left of the HPC Cluster Manager, and then select "Job Templates" in the Navigation Pane.
  • Now select "New" in the Action Pane to create a new Job Template.
  • On the Welcome page, set the template name to "Paying Customers Template."
  • Accept the defaults on the Job Run Times page, since these guys should be able to do whatever they want.
  • On the Job Priorities page, set the Maximum priority to Maximum; this will allow paying customers to submit jobs with any priority that they'd like.
  • Accept the defaults on the Project Names and Node Groups pages.
  • Click finish to complete the wizard and create this job template.

 

Step 2: Create a template for the Freeloaders groups

  1. Click on "New" again to create another new Job Template.
  2. Enter the name "Freeloaders" on the Welcome page.
  3. On the Job Run Times page, enter a maximum run time of 1 hour to prevent Freeloaders from submitting long-running jobs.
  4. On the Job Priorities page, set a default priority of Lowest and a Maximum Priority of Below Normal. This will ensure that Freeloader's jobs always get a nice low priority . . . so Paying Customers will pass them in the queue and even pre-empt them if the pre-emption scheduling policy is enabled.
  5. Accept the defaults for Project Names and Node Groups.
  6. Click finish to complete the wizard and create this job template.

 

Step 3: Set permissions

The final step is to set the appropriate permissions so that no one can use job templates that they shouldn't.

  • In the job templates view, highlight the "Default" template and say "Set Permissions."
    • Remove the "Users" group from this ACL so that no users can use the Default template any more.
  • Now highlight the "Paying Customers" template and say "Set Permissions."
    • Remove the "Users" group from this ACL.
    • Add the Paying Customers group (you can create a new local group and manage the users in it by doing to Computer -> Right-click -> Manage -> Users and Groups from the start menu), and give them the Submit Job permission.
  • Now highlight the "Freeloaders" template and say "Set Permissions."
    • Remove the "Users" group from this ACL.
    • Add the Freeloaders group (you can create a new local group and manage the users in it by doing to Computer -> Right-click -> Manage -> Users and Groups from the start menu), and give them the Submit Job permission.

 

Step 4: Profit

You're actually done. At this point, members of your local "Freeloaders" group can only submit using the Freeloaders template. This means they can't submit jobs with a priority above Below Normal, and they can't submit jobs that run for more than 1 hour. Sucks to be them, huh? Meanwhile members of the "Paying Customers" group can pretty much do whatever they want. Members of both groups can use either job template.

 

Step 5: Getting more advanced

There are actually many more advanced things that you can do with Job Templates, since they can be used to default and constrain any job property. For an example, let's say you wanted to change things up by saying that Freeloaders couldn't submit jobs which required Exclusive use of a node. This can easily be done!

  1. Highlight Freeloaders in the job template view and click "Edit."
  2. Click the Add button to add a constraint, and select the Exclusive constraint.
  3. Highlight the Exclusive constraint in the Job Template Details window to show the settings for this Job Property.
    1. In the Details for the Exclusive constraint, set the Default Value to "False," and set the Valid Values to include only "False."
  4. Hit Save.

Editing a Job Template

Figure 2: Editing a Job Template

Now when Freeloaders submit a job, it will always be marked as non-exclusive (because Exclusive is False by default). If they try to mark it as exclusive, job submission will fail (since True isn't in the list of valid values for the Exclusive property).

HPC Basic Profile Web Service

Do you want to access the HPCS 2008 Job Scheduler from other environments such as Java, Linux, etc? Or build Job Submission tools around a standard web services based interface?

Well, in the latest Community Technology Preview (CTP) released this week we have included a new feature – the HPC Basic Profile Web Service or the HPCBP for short – that can help you do just that! This is a web service, built using the Windows Communication Foundation (WCF) that provides access to some of HPCS 2008’s core job submission functionality. Through the HPCBP you are able to submit a job, discover a job’s status, discover a job’s properties, terminate a job, and find out information about the cluster that you are running on.

The primary motivation for this feature came from other groups in the HPC community who wanted a standard interface that allowed jobs to be passed between HPC resources. Over the last few years, using an open process within the Open Grid Forum (OGF), developers from industry and research from both the open source and commercial software communities, have come to agreement on the web service interface and protocols that can provide greatest interoperability. These set of specifications are encapsulated within the HPC Basic Profile 1.0.

More information relating to the HPCBP specification and its implementation and deployment within HPCS 2008 can be found here.

How that Node/Socket/Core thing works

This week, I’d like to take some time to explain how a new feature, Multi Level Resource Allocation, can help you get the most out of your applications.

 

The basic explanation for this feature is that when creating a job, you can choose at what granularity your job gets scheduled.  This is as simple as picking from a drop down in the UI, but as with most choices, it deserves a bit of thought!

Setting the resource unit type on a job

Figure 1: Setting the resource unit type on a job

 

The first question that pops to mind is: what exactly do Core, Node, and Socket mean?

·         Node (a.k.a. host, machine, computer) refers to an entire compute node.  Each node contains 1 or more sockets.

·         Socket (a.k.a. numa node) refers to collection of cores with a direct pipe to memory.  Each socket contains 1 or more cores.  Note that this does not necessarily refer to a physical socket, but rather to the memory architecture of the machine, which will depend on your chip vendor.

·         Core (a.k.a. processor, cpu, cpu core, logical processor) refers to a single processing unit capable of performing computations.  A core is the smallest unit of allocation available in HPC Server 2008.

 

Next, let me explain how resources actually get allocated to your job.  To do, I’ll refer to this handy diagram (labeled as Figure 2 if I’ve got my post to publish correctly).

Multi Level Resource Allocation at work

Figure 2: Multi Level Resource Allocation at work

In the above example, job J1 requested allocation at the Socket level.  This may mean it has a single task that requires 3 sockets, or many tasks which each require 1 socket.  The scheduler has reserved 3 sockets for it (and since it’s running on quad-core sockets, it’s implicitly been allocated 12 cores).  Assuming it is a job with many single-socket tasks, the scheduler will start a single task per socket in the job’s allocation.

Job J2, on the other hand, requested allocation at the Node level, and has been allocated a single node (and implicitly, 16 cores).  The scheduler will thus start 1 task on each node in the jobs allocation.  No other jobs or tasks can be started on that node, so it’s quite similar to using the task Exclusive property.

Job J3 has requested Core allocation, and has shown above, it is has been allocated 4 cores.  The scheduler starts 1 task per core.

 

When should I use each level?

When to use each of these settings will depend on your application, and some experimentation is necessary.  In general, the rule is:

·         Use core allocation if your application is CPU bound; the more processors you can throw at it the better!

·         Use socket allocation if memory access is what bottlenecks your application’s performance.  Since how much data can come in from memory is what limits the speed of the job, running more tasks on the same memory bus won’t result in speed-up since all of those tasks are fighting over the path to memory.

·         Use node allocation if some node-wide resource is what bottlenecks your application.  This is the case with applications that are relying heavily on access to disk or to networks resources.  Running multiple tasks per node won’t result in a speed-up since all of those tasks are waiting for access to the same disk or network pipe.

 

Some key facts:

·         The unit type set on your job also applies to all tasks in that job (i.e. you can’t have a job requesting 4 nodes with a bunch of tasks requesting 2 cores each).

·         You can still use batch scripts or your applications mechanisms to launch multiple threads or processes on the resources that your job is allocated.

·         By using these correctly, you can improve your cluster utilization since jobs are more likely to get only the resources they need.  See Figure 2, where job J1 and job J2 can peacefully coexist on a node.

·         This feature is explicitly designed to work with heterogeneous systems, namely those where your compute nodes have varying hardware.  So a socket allocation job will still get a dedicated pipe to memory for each task whether you are running single-core, dual-core, or quad-core processors.  A node allocation job will get a node per task, whether those nodes have 1 core or 16.

1 - 10 Next

 ‭(Hidden)‬ Admin Links