I’ve been working on deploying LINPACK on our Windows HPC Server 2008, including compiling source code, setting up environment for the machines and also making adjustments on input parameters for LINPACK, so I would like to share some experience with you on the issue.
In order to run LINPACK on windows platform, we should do such steps:
1. Find out right version of source code and compile it.
There are several versions of LINPACK, the High Performance Computing LINPACK Benchmark is called HPL, and its current version is HPL 1.0a, you can find the source file “hpl.tgz” from the website: http://www.netlib.org/benchmark/hpl/.
If your machine is INTEL based, you can also find the binary version from INTEL MKL, but pay attention to find the one suits your machines.
2. Set up running environment.
In order to run LINPACK, we should have MPI and BLAS (Basic Linear Algebra Subprograms) libraries on our machines. So first, have HPC Pack installed, and then we can use MS MPI, and have a choice among BLAS libraries: GOTO、Atlas、ACML、ESSL、MKL. Some libraries are machine specific, so find out the suitable one from http://www.netlib.org/blas/faq.html. Here I take INTEL MKL as a first choice; you may find it from http://www.intel.com/cd/software/products/asmo-na/eng/266857.htm. Install MKL in the nodes that you want to run LINPACK.
3. Install CCP SDK
Find CCP SDK from http://www.microsoft.com/downloads/details.aspx?FamilyID=D8462378-2F68-409D-9CB3-02312BC23BFD&displaylang=en, or if you have Windows HPC Server 2008 installed, then CCP SDK is already included.
4. Configure paths
In order to run LINPACK on multiple nodes, we should set up shared folders for input, output as well as program executable. Take my setting as an example, we establish a new share folder on head node, named “Scratch”, then make three directories: input, output, bin. To run LINPACK, we should provide a file named “HPL.DAT” containing the input parameters; we should put this input file in the directory “input”. Then the output file containing results will be put into “output”, and the executable file of LINPACK in “bin”.
5. Estimate Results
For better tuning the input parameters, we would like to see the performance efficiency under current configuration. The maximum value is calculated in this way: Clock Speed (GHZ) * Flops per Cycle. Flops per Cycle” are the number of flops per clock, for Opteron and Xeon the value is 2, for Xeon dual-cores and Quad-Cores, this values is 4. Then current result / max value will be your efficiency.
6. Submit jobs
· Input Parameters: Modify hpl.dat file to suit the target configuration. Firstly, four major parameters: N, NB, P, Q can be decided and the others remain default values. A standard input file is like the following:

· Submit job:
Use “Job submit /numberprocessors:P*Q /workdir : \\%CCP_CLUSTER_NAME%\Scratch\Linpack /stdout:hpl.log /stderr:hpl.err mpiexec -wdir \\%CCP_CLUSTER_NAME%\Scratch\Linpack\bin xhp.exe” to submit the job. Then you may find it through “job management” in “admin console”:

· View the benchmark results: After the job is finished, you may find the result like below:

7. Issues on input parameters
Maybe you have heard there are 29 input parameters for LINPACK, so it is a very hard work to decide these inputs and it is always the most important work when running LINPACK. But we can start from 4 parameters: N, NB, P and Q. N is the problem size, it should be large enough to reach the maximum performance, but not too large, which may result in paging, which would reduce the performance. . It is recommended that the matrix uses 80% of total memory. . As my experience, we can do some test on the machine, and monitor the available physical usage from heat map:

If there are too many available physical memory, then we can increase N, and vice versa. However, the best value will be obtained after several times of actual running.
The value of NB should also be achieved from the real tests, a guideline is N mode NB = 0. Some experience results tell, for Intel Xeon processors, NB should be 192, but according to my tests on our TYAN cluster with Xeon dual cores, 224 is a better choice. So I think we may increase NB at a fixed N, increase NB by 16 each time until we find a max Gflops.
When related to P & Q, I really don’t know how to make a decision, the only thing I am very sure is, P * Q must be the number of cores. I’ve found a lot of materials written by different persons, some said values of P, Q must be close to each other, and others said P should be as small as possible. I’ve talked with Xavier, he suggests me to have a small P at first because when he does so, he gets the best performance. However, it is very funny, when I am making a test on a four cores node, P = 4 with Q = 1 gets the best result and P = 1 with Q = 4 has a much poorer performance, the results are as below:

But situation changes a lot when it comes to 3 nodes with 12 cores, P = 12 with Q = 1 performs much worse than P = 1 with Q = 12, the results are as below:

Maybe the only way to find the best combination is through your own exploration.
So these are some experience these weeks, though I’ve not achieved a satisfying efficiency, I am sure the performance can be improved in many ways, also I am very appreciate George for guidance and Xavier for precious suggestion.
Lewis Liu 刘贤斐
PM Intern,Microsoft STB China HPC