A blog for the University of Tennessee, decribing projects and involvement with Microsoft Windows CCS. |
|
1/18/2008The following posts are exact copies of entries to the University of Tennessee Blog on the previous Windowshpc portal site. I added the author's name in the beginning of each entry and edited the posting dates to reflect the original date of the entry.
Michael Cole Project Manager/Site Administrator 3/29/2007
posted Thursday, March 29, 2007 7:54 PM by dongarra | 0 Comments
The Epilog tracing library (part of the KOJAK toolset [1,2]) has been ported to the Microsoft Compute Cluster platform. The library can capture traces from MPI applications written in either FORTRAN or C/C++. The traces can be used to visualize and analyze the message passing behavior in Vampir (Intel Trace Analyzer) after converting traces to VTF format and to automatically search for patterns of inefficient execution with KOJAK/Scalasca [3].
INSTALLATION:
1. Download the package from here:
http://icl.cs.utk.edu/projectsfiles/kojak/software/kojak/win_epilog.zip
2. The package contains the Epilog tracing library (epilog.dll) and the static imports library (epilog.lib) for both the 32-bit and 64-bit versions. The package also includes a utility (elg_merge.exe) to merge the process-local tracefiles into a single coherent tracefile.
3. Link your target application with the appropriate version of the Epilog library instead of the Microsoft MPI library (msmpi.lib), adjust the include paths in your project as appropriate.
4. When executing the target application, make sure the Epilog tracing library (epilog.dll) is in the executable path, i.e., place it in a systems directory or in the same folder as the target application executable. Also make sure that the program elg_merge.exe is in a standard executable directory (such as %SYSTEMROOT% or resides in the same folder as the target application executable
5. Execute the target application as a normal mpi job, e.g., mpiexec -n 2 myapp.exe, etc.
6. Each MPI process writes process-local tracefiles. Opon program termination, elg_merge.exe is automatically invoked to merge the tracefiles into a single tracefile, called a.elg located in the same directory as the target application.
7. The a.elg file is a binary tracefile. Use the tools of the KOJAK [1,2] or SCALASCA [3] toolset on a *nix machine to analyze it: - Use elg2vtf to convert the trace file to a VTF compatible tracefile that can be visualized using Intel Trace Analyzer (Vampir) [4] - Use Expert to automatically search for patterns of inefficient execution (such as late-sender, wait-at-barrier, etc.)
[1] http://icl.cs.utk.edu/kojak/
[2] http://www.fz-juelich.de/zam/kojak/
[3] http://www.fz-juelich.de/zam/scalasca/
[4] http://www.intel.com/cd/software/products/asmo-na/eng/306321.htm
2/19/2007posted Monday, February 19, 2007 7:08 PM by dongarra | 0 Comments
The Top500 is a list computers ranked by their performance on the HPL Benchmark. HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Linpack Benchmark.
Current Status
Currently we have compiled HPL with both Visual Studio 2005 and Subsystem for UNIX Applications (SUA). We have run HPL successfully on all of our interfaces which include: GigE, Mellanox's Infiniband, Silverstorm's Infiniband, and Myricom's Myrinet. We have also compiled HPL against both Intel's MKL and AMD's ACML BLAS libraries to compare performance results since our cluster runs AMD Opteron processors. The ACML BLAS performs much faster. The only thing we have been unsuccessful at running is HPL with Winsock Direct (WSD). Currently achieving 242.4 GFlops on 96 processors.
Future Work
Upcoming work on HPL will focus primarily on understanding performance differences between Windows CCS and Linux. Based on the findings we will tune the code. We're interested in creating tuning guidelines for various interconnects and the BLAS libraries in use. Also, we will attempt to get HPL running with WSD so that we have a better comparison of results against Linux.
For more information about the Top500 and HPL, visit the respective websites at http://www.top500.org/ and http://www.netlib.org/benchmark/hpl/
posted Monday, February 19, 2007 2:18 PM by dongarra | 1 Comments
Our work in this area includes LAPACK and ScaLAPACK. As part of this effort, development of the following algorithms and software continues and below we provide the current status and future plans for our linear algebra work with regard to the Windows CCS environment.
• LAPACK
LAPACK provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. LAPACK is used by Matlab, Mathematica, Numeric Python (NumPy), and a tuned version is provided by the following vendors: AMD, Apple, Compaq, Cray, Fujitsu, Hewlett-Packard, Hitachi, IBM, Intel, MathWorks, NAG, NEC, PGI, SUN, Visual Numerics. Microsoft and most of the Linux distributions ( SUSE, Red Hat, Fedora, Debian, Cygwin, etc.) also provide a tuned version.
Current Status
Work on the current version of LAPACK 3.1.0 was recently completed and it has been released as well as installed on the CCS. This includes a Windows Visual Studio implementation with the Intel Fortran Compiler that generates the Windows library and runs all of our tests.
Future Work
Ongoing efforts continue to increase performance and accuracy while attempting to extended precision and improve the ease of use.
More information about LAPACK can be found on the website – http://icl.cs.utk.edu/lapack/
• ScaLAPACK
The ScaLAPACK library is a parallel implementation of LAPACK, scaling on parallel hardware from 10’s to 100’s to 1000’s of processors. It includes a subset of LAPACK routines redesigned for distributed memory MIMD parallel computers. It is currently written in a Single-Program-Multiple-Data style using explicit message passing for interprocessor communication. It assumes matrices are laid out in a two-dimensional block cyclic decomposition and is designed for heterogeneous computing. It is also portable on any computer that supports MPI or PVM.
Current Status Currently, we only have a Cygwin implementation of ScaLAPACK running on the cluster. For BLACS and ScaLAPACK, the Windows native Visual Studio efforts are roughly 70% complete. However, problems with the Intel C compiler are hindering its completion.
Future Work
Our future efforts will include targeting new architectures and a new parallel environment. We plan a port to the CCS and match the functionalities of the current LAPACK installation.
More information about ScaLAPACK can be found on the website – http://icl.cs.utk.edu/scalapack/ 2/18/2007
posted Sunday, February 18, 2007 1:04 PM by dongarra | 1 Comments
• PAPI
The Performance API (PAPI) project specifies a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. PAPI provides portability across different platforms and uses the same routines with similar argument lists to control and access the counters But to be successful, the PAPI library needs a little help from the operating system to gain access to the information in the counters.
Current Status
Presently, we have the latest version of PAPI (v3.5) running on the Cluster. Recompiling the test harness and the dll proved to be relatively straightforward; the majority of the difficulty came in sorting through the assembly level portions of the kernel driver that provides access to the counters. The AMD64 environment provides no inline assembler. The WinPMC kernel driver relied on inline assembly to access the hardware counters. Also, there was some inconsistency in the availability of compiler intrinsics to provide access to the assembly instructions needed to access to the PMC registers. This revolved around implementations of the cpuid instruction and the readpmc instruction.
The C test programs provided with a normal PAPI distribution were built and tested as appropriate for the Windows environment. Most converted and ran cleanly in the Windows 2003 Server environment; some had features that were no longer applicable. The Fortran test and example programs were not converted, since at the time of this work, a suitable Fortran compiler replacement for the older Compaq Fortran compiler had not been identified.
Future Work
Remaining work revolves around two areas. The first involves completing the test and example programming to bring it up to par with what’s available in other PAPI distributions. The second is significantly more involved and requires some explanation.
PAPI is primarily intended as a ‘first-person’ mechanism for attributing hardware counter events to portions of program code. In order to do that, the programmer (or a higher level tool) inserts calls into the user code to start, stop and read the hardware counters at specific points. This fundamentally assumes that the counts occurring between the start call and the stop (or read) call can all be attributed to the user’s code. Such a situation can only be approximated in a multitasking system and can be wildly inaccurate in a busy system. The only way to guarantee that counts can be properly attributed is for the operating system’s context switch routine to save and restore the state of the performance monitoring registers. This is how PAPI behaves in Linux systems. On Windows, the WinPMC driver currently simply controls the state of the counters and hopes for the best. This works acceptably well on laptop or single user systems; not so well on clusters.
We would like to work with Microsoft engineers to determine the feasibility of modifying the Compute Cluster kernel software to support functionality similar to the open source perfmon2 performance interface http://sourceforge.net/projects/perfmon2 that is being incorporated into the Linux kernel and rapidly adopted as the standard mechanism for accessing hardware performance counters. 2/17/2007
posted Saturday, February 17, 2007 2:53 PM by dongarra | 1 Comments
• FT-MPI
FT-MPI is a full 1.2 MPI specification implementation that provides process level fault tolerance at the MPI API level. FT-MPI has been developed in the frame of the HARNESS (Heterogeneous Adaptive Reconfigurable Networked SyStem) project with the goal of providing the end-user a communication library containing an MPI API, which benefits from the fault-tolerance already found in the HARNESS system.
Current Status
Currently, FT-MPI has been compiled under Cygwin, Windows Subsystem for UNIX Applications (SUA) and native Windows. There is presently no possibility to start the daemons automatically, as the only supported method (SSH) is not natively available in the Windows environment. However, once the daemons are manually started, we have been able to spawn as many applications as necessary. Also, as the daemons are started manually, security is provided by the Windows user log-on. We also only have current support for BSD-like TCP, i.e. using read and write. But as of yet, there is no support for any direct WinSock2 functions.
Future Work
Most of our future work will be focused on Open MPI. We plan to tighten the security for starting the applications, to provide full support for the XML format supported by the windows batch scheduler, memory and processor affinity, support for the Windows registry, completely dynamic MPI libraries and internal modules. Moreover, we know that the performances can be improved by at least another 20% (and that's a minimum).
Additional information about FT-MPI can be found on the website – http://icl.cs.utk.edu/ftmpi/.
• Open MPI
Open MPI is an open source implementation of both the MPI-1 and MPI-2 documents and combines technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI) in order to build the best MPI library available.
Current Status
Currently and like FT-MPI, we have compiled Open MPI under Cygwin, Windows Subsystem for UNIX Applications (SUA) and native Windows. The most used and tested way to compile has been under native Windows. We have provided solutions and project files for Visual C Express, allowing us to compile Open MPI both as a static or a dynamic library. Support for C++, Fortran 77 as well as Fortran 90 is automatically built. We are also able to start daemons locally, using Windows functionality (spawn and/or CreateProcess) and we can start jobs on the cluster with CCS (using submit). However, so far the only available communication framework is on top of WinSock2, but work on Direct Socket is in progress. The Visual C compiler (VC) is used as a backend for mpicc, which allows us to compile the user applications in a normal environment.
Integration with the parallel debugger is in progress, however the lack of comprehensive documentation make this task difficult. We have the same problem for accessing the high performance socket interface. The sparse documentation available on MSDN or the Web does not provide enough insight for a smooth transition.
Performance results compared with the Microsoft MPI have shown that Open MPI performed faster over both shared memory and TCP, by a factor of ~10%. No application benchmark has been run in order to compare these 2 MPI implementations further.
Future Work
Once the support for Direct Socket is completed, we will benchmark again and we expect a larger performance gap between these 2 MPI libraries. We still need to define the behavior of MPI in the event a failure occurs at the process level.
For more information about Open MPI, visit the website at - http://icl.cs.utk.edu/open-mpi/. 10/20/2006
posted Friday, October 20, 2006 4:56 PM by DennisCr | 0 Comments
High Performance Compute Clustering with Windows
University of Tennessee
Innovative Computing Laboratory
Computer Science Department
Jack Dongarra
Windows Cluster Project
People
Jack Dongarra
George Bosilca
Dave Cronk
Julien Langou
Piotr Luszczek
Projects:
1. Numerical Linear Algebra Algorithms and Software
a. LAPACK, ScaLAPACK, ATLAS
b. Self Adapting Numerical Algorithms (SANS) Effort
c. Generic Code Optimization
d. LAPACK For Clusters – easy access to clusters
2. Heterogeneous Distributed Computing
a. NetSolve, FT-MPI, Open-MPI
3. Performance Evaluation
a. PAPI, HPC Challenge, Top500
4. Software Repositories
a. Netlib
LAPACK
1. Used by Matlab, Mathematica, Numeric Python,…
2. Tuned version provided by vendors: AMD, Apple, Compaq, Cray, Fujitsu, Hewlett-Packard, Hitachi, IBM, Intel, MathWorks, NAG, NEC, PGI, SUN, Visual Numerics, by Microsoft and most of Linux distribution (Fedora, Debian, Cygwin,...).
3. On going work: performance, accuracy, extended precision, ease of use
ScaLAPACK
1. Parallel implementation of LAPACK scaling on parallel hardware from 10’s to 100’s to 1000’s of processors
2. On going work: Match functionalities of current LAPACK
3. On going work: Target new architectures, new parallel environment. For example port to Microsoft HPC cluster solution
LAPACK for Clusters (LFC)
1. Most of ScaLAPACK functionality from serial clients (Matlab, Python, Mathematica)
FT-MPI and Open-MPI
1. Define the behavior of MPI in event a failure occurs at the process level.
2. FT-MPI based on MPI 1.3 (plus some MPI 2 features) with a fault tolerant model similar to what was done in PVM.
3. Complete reimplementation, not based on other implementations.
a. Gives the application the possibility to recover from a process-failure.
b. A regular, non fault-tolerant MPI program will run using FT-MPI.
c. What FT-MPI does not do:
4. Recover user data (e.g. automatic check-pointing)
5. Provide transparent fault-tolerance
Performance Application Programming Interface (PAPI)
1. A portable library to access hardware counters found on processors
2. Provides a standardized list of performance metrics
KOJAK (Joint with Felix Wolf)
1. Software package for the automatic performance analysis of parallel apps
2. Message passing and multi-threading (MPI and/or OpenMP)
3. Parallel performance
4. CPU and memory performance
Posters for Related Projects
· FT-MPI
· HPCC
· Kojak
· LAPACK / ScaLAPACK
· NetSolve / ActiveSheets
· NetSolve / .NET
· Open MPI
· PAPI
· top500
|
Hardware Configuration |
|
Team HPC |
|
Dual Core 4GB AMD Opterons |
|
Team HPC Turnkey Beowulf-Class Supercomputer |
|
26 4GB AMD Opteron DC Compute Nodes, 1 Head Node |
|
CPU Manufacturer |
AMD |
|
CPU Model |
Opteron 265 |
|
CPU Speed |
1.8 GHZ |
|
Number of nodes |
26 |
|
Number of cores |
2 |
|
Interconnect(s) |
Infiniband, Myranet, GigE |
|
|
| | | |