Go Search

University of Tennessee

A blog for the University of Tennessee, decribing projects and involvement with Microsoft Windows CCS.
Note to the following posts
The following posts are exact copies of entries to the University of Tennessee Blog on the previous Windowshpc portal site. I added the author's name in the beginning of each entry and edited the posting dates to reflect the original date of the entry.
 
Michael Cole
Project Manager/Site Administrator
Epilog Tracing library ported to CCS

posted Thursday, March 29, 2007 7:54 PM by dongarra | 0 Comments

The Epilog tracing library (part of the KOJAK toolset [1,2]) has been ported to the Microsoft Compute Cluster platform. The library can capture traces from MPI applications written in either FORTRAN or C/C++. The traces can be used to visualize and analyze the message passing behavior in Vampir (Intel Trace Analyzer) after converting traces to VTF format and to automatically search for patterns of inefficient execution with KOJAK/Scalasca [3].

INSTALLATION:

1. Download the package from here:

http://icl.cs.utk.edu/projectsfiles/kojak/software/kojak/win_epilog.zip

2. The package contains the Epilog tracing library (epilog.dll) and the static imports library (epilog.lib) for both the 32-bit and 64-bit versions. The package also includes a utility (elg_merge.exe) to merge the process-local tracefiles into a single coherent tracefile.

3. Link your target application with the appropriate version of the Epilog library instead of the Microsoft MPI library (msmpi.lib), adjust the include paths in your project as appropriate.

4. When executing the target application, make sure the Epilog tracing library (epilog.dll) is in the executable path, i.e., place it in a systems directory or in the same folder as the target application executable. Also make sure that the program elg_merge.exe is in a standard executable directory (such as %SYSTEMROOT% or resides in the same folder as the target application executable

5. Execute the target application as a normal mpi job, e.g., mpiexec -n 2 myapp.exe, etc.

6. Each MPI process writes process-local tracefiles. Opon program termination, elg_merge.exe is automatically invoked to merge the tracefiles into a single tracefile, called a.elg located in the same directory as the target application.

7. The a.elg file is a binary tracefile. Use the tools of the KOJAK [1,2] or SCALASCA [3] toolset on a *nix machine to analyze it: - Use elg2vtf to convert the trace file to a VTF compatible tracefile that can be visualized using Intel Trace Analyzer (Vampir) [4] - Use Expert to automatically search for patterns of inefficient execution (such as late-sender, wait-at-barrier, etc.)

[1] http://icl.cs.utk.edu/kojak/

[2] http://www.fz-juelich.de/zam/kojak/

[3] http://www.fz-juelich.de/zam/scalasca/

[4] http://www.intel.com/cd/software/products/asmo-na/eng/306321.htm

 

HPL on CCS at UT

posted Monday, February 19, 2007 7:08 PM by dongarra | 0 Comments

  • High Performance Linpack (HPL)

  The Top500 is a list computers ranked by their performance on the HPL Benchmark.
  HPL is a software package that solves a (random) dense linear system in double
  precision (64 bits) arithmetic on distributed-memory computers. It can thus be
  regarded as a portable as well as freely available implementation of the
  High Performance Linpack Benchmark.

Current Status

 Currently we have compiled HPL with both Visual Studio 2005 and Subsystem for
 UNIX Applications (SUA).  We have run HPL successfully on all of our interfaces
 which include: GigE, Mellanox's Infiniband, Silverstorm's Infiniband, and Myricom's Myrinet. 
 We have also compiled HPL against both Intel's MKL and AMD's ACML BLAS libraries to
 compare performance results since our cluster runs AMD Opteron processors.
 The ACML BLAS performs much faster.  The only thing we have been unsuccessful at
 running is HPL with Winsock Direct (WSD). 
 Currently achieving 242.4 GFlops on 96 processors.

Future Work

 Upcoming work on HPL will focus primarily on understanding performance differences
 between Windows CCS and Linux. Based on the findings we will tune the code.
 We're interested in creating tuning guidelines for various interconnects and the
 BLAS libraries in use. Also, we will attempt to get HPL running with WSD so that
 we have a better comparison of results against Linux.

For more information about the Top500 and HPL,
visit the respective websites at
http://www.top500.org/
and
http://www.netlib.org/benchmark/hpl/

Numerical Algebra development with LAPACK and ScaLAPACK

posted Monday, February 19, 2007 2:18 PM by dongarra | 1 Comments

Our work in this area includes LAPACK and ScaLAPACK. As part of this effort,
development of the following algorithms and software continues and below we
provide the current status and future plans for our linear algebra work with
regard to the Windows CCS environment.

• LAPACK

LAPACK provides routines for solving systems of simultaneous linear equations,
least-squares solutions of linear systems of equations, eigenvalue problems,
and singular value problems. LAPACK is used by Matlab, Mathematica, Numeric
Python (NumPy), and a tuned version is provided by the following vendors: AMD,
Apple, Compaq, Cray, Fujitsu, Hewlett-Packard, Hitachi, IBM, Intel, MathWorks,
NAG, NEC, PGI,  SUN, Visual Numerics. Microsoft and most of the Linux distributions
( SUSE, Red Hat, Fedora, Debian, Cygwin, etc.) also provide a tuned version.

Current Status

Work on the current version of LAPACK 3.1.0 was recently completed and it
has been released as well as installed on the CCS. This includes a Windows
Visual Studio implementation with the Intel Fortran Compiler that generates
the Windows library and runs all of our tests.

Future Work

Ongoing efforts continue to increase performance and accuracy while attempting
to extended precision and improve the ease of use.

More information about LAPACK can be found on the website –
http://icl.cs.utk.edu/lapack/

• ScaLAPACK

The ScaLAPACK library is a parallel implementation of LAPACK, scaling on
parallel hardware from 10’s to 100’s to 1000’s of processors. It includes
a subset of LAPACK routines redesigned for distributed memory MIMD parallel
computers. It is currently written in a Single-Program-Multiple-Data style
using explicit message passing for interprocessor communication. It assumes
matrices are laid out in a two-dimensional block cyclic decomposition and
is designed for heterogeneous computing. It is also portable on any
computer that supports MPI or PVM.

Current Status
 Currently, we only have a Cygwin implementation of ScaLAPACK running on the cluster.
 For BLACS and ScaLAPACK, the Windows native Visual Studio efforts are roughly 70% complete.
 However, problems with the Intel C compiler are hindering its completion.

Future Work

 Our future efforts will include targeting new architectures and a new parallel environment.
 We plan a port to the CCS and match the functionalities of the current LAPACK installation.

More information about ScaLAPACK can be found on the website –
http://icl.cs.utk.edu/scalapack/

Performance Evaluation and Analysis with PAPI

posted Sunday, February 18, 2007 1:04 PM by dongarra | 1 Comments

• PAPI

 The Performance API (PAPI) project specifies a standard
 application programming interface (API) for accessing hardware
 performance counters available on most modern microprocessors.
 PAPI provides portability across different platforms and uses
 the same routines with similar argument lists to control and
 access the counters But to be successful, the PAPI library
 needs a little help from the operating system to gain access
 to the information in the counters.

Current Status

 Presently, we have the latest version of PAPI (v3.5) running on
 the Cluster. Recompiling the test harness and the dll proved to
 be relatively straightforward; the majority of the difficulty
 came in sorting through the assembly level portions of the kernel
 driver that provides access to the counters. The AMD64 environment
 provides no inline assembler. The WinPMC kernel driver relied on
 inline assembly to access the hardware counters. Also, there was
 some inconsistency in the availability of compiler intrinsics to
 provide access to the assembly instructions needed to access to
 the PMC registers. This revolved around implementations of the
 cpuid instruction and the readpmc instruction.

 The C  test programs provided with a normal PAPI distribution were
 built and tested as appropriate for the Windows environment.
 Most converted and ran cleanly in the Windows 2003 Server environment;
 some had features that were no longer applicable. The Fortran test and
 example programs were not converted, since at the time of this work,
 a suitable Fortran compiler replacement for the older Compaq Fortran
 compiler had not been identified.


Future Work

 Remaining work revolves around two areas. The first involves completing the
 test and example programming to bring it up to par with what’s available in
 other PAPI distributions. The second is significantly more involved and
 requires some explanation.

 PAPI is primarily intended as a ‘first-person’ mechanism for attributing
 hardware counter events to portions of program code. In order to do that,
 the programmer (or a higher level tool) inserts calls into the user code to start,
 stop and read the hardware counters at specific points. This fundamentally assumes
 that the counts occurring between the start call and the stop (or read) call can
 all be attributed to the user’s code. Such a situation can only be approximated
 in a multitasking system and can be wildly inaccurate in a busy system. The only
 way to guarantee that counts can be properly attributed is for the operating system’s
 context switch routine to save and restore the state of the performance monitoring
 registers. This is how PAPI behaves in Linux systems. On Windows, the WinPMC driver
 currently simply controls the state of the counters and hopes for the best. This
 works acceptably well on laptop or single user systems; not so well on clusters.

 We would like to work with Microsoft engineers to determine the feasibility of
 modifying the Compute Cluster kernel software to support functionality similar
 to the open source perfmon2 performance interface
 
http://sourceforge.net/projects/perfmon2
 that is being incorporated into the Linux kernel and rapidly adopted as the
 standard mechanism for accessing hardware performance counters.

UT MPI development with CCS

posted Saturday, February 17, 2007 2:53 PM by dongarra | 1 Comments

• FT-MPI

FT-MPI is a full 1.2 MPI specification implementation that provides process
level fault tolerance at the MPI API level. FT-MPI has been developed in the
frame of the HARNESS (Heterogeneous Adaptive Reconfigurable Networked SyStem)
project with the goal of providing the end-user a communication library
containing an MPI API, which benefits from the fault-tolerance already
found in the HARNESS system.

Current Status

 Currently, FT-MPI has been compiled under Cygwin, Windows Subsystem
 for UNIX Applications (SUA) and native Windows. There is presently no
 possibility to start the daemons automatically, as the only supported
 method (SSH) is not natively available in the Windows environment.
 However, once the daemons are manually started, we have been able to
 spawn as many applications as necessary. Also, as the daemons are
 started manually, security is provided by the Windows user log-on.
 We also only have current support for BSD-like TCP, i.e. using read
 and write. But as of yet, there is no support for any direct WinSock2 functions.

Future Work

 Most of our future work will be focused on Open MPI. We plan to
 tighten the security for starting the applications, to provide
 full support for the XML format supported by the windows batch
 scheduler, memory and processor affinity, support for the Windows
 registry, completely dynamic MPI libraries and internal modules.
 Moreover, we know that the performances can be improved by at
 least another 20% (and that's a minimum).

Additional information about FT-MPI can be found on the website –
http://icl.cs.utk.edu/ftmpi/.

• Open MPI

 Open MPI is an open source implementation of both the MPI-1 and MPI-2
 documents and combines technologies and resources from several other
 projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI) in order to build
 the best MPI library available.

Current Status

 Currently and like FT-MPI, we have compiled Open MPI under Cygwin,
 Windows Subsystem for UNIX Applications (SUA) and native Windows.
 The most used and tested way to compile has been under native Windows.
 We have provided solutions and project files for Visual C Express,
 allowing us to compile Open MPI both as a static or a dynamic library.
 Support for C++, Fortran 77 as well as Fortran 90 is automatically built.
 We are also able to start daemons locally, using Windows functionality
 (spawn and/or CreateProcess) and we can start jobs on the cluster with CCS
 (using submit). However, so far the only available communication framework
 is on top of WinSock2, but work on Direct Socket is in progress. The
 Visual C compiler (VC) is used as a backend for mpicc, which allows us to
 compile the user applications in a normal environment.

 Integration with the parallel debugger is in progress, however the lack of
 comprehensive documentation make this task difficult. We have the same problem
 for accessing the high performance socket interface. The sparse documentation
 available on MSDN or the Web does not provide enough insight for a smooth transition.

 Performance results compared with the Microsoft MPI have shown that Open MPI
 performed faster over both shared memory and TCP, by a factor of ~10%. No
 application benchmark has been run in order to compare these 2 MPI implementations further.

Future Work

 Once the support for Direct Socket is completed, we will benchmark again
 and we expect a larger performance gap between these 2 MPI libraries.
 We still need to define the behavior of MPI in the event a failure occurs
 at the process level.

For more information about Open MPI, visit the website at -
http://icl.cs.utk.edu/open-mpi/.

UT HPC Windows Cluster Update

posted Friday, February 16, 2007 7:55 PM by dongarra | 1 Comments

 Since April of 2006, the cluster has been fully operational on-site at UT in the Innovative Computing Laboratory. Running genuine Windows software, this cluster continues to be used for significant testing and development of multiple HPC tools described in the following sections. Specifications for the cluster in its current state are as follows:

 

Operating System:

 

¨      Windows Server 2003 R2 x64 Edition

¨      Microsoft Compute Cluster Edition 2003

 

Hardware:

 

¨      24 Custom Built Nodes from TeamHPC

¨      Dual socket AMD Opteron 265 (Dual Core) 1.8GHz Processors (total of 96 processors)

¨      4GB Ram / node

¨      80GB SATA Hard Drive / node  

¨      Nforce Gigabit NIC

¨      Silverstorm 10Gb/s Infiniband & NICs

¨      Mellanox 20Gb/s DDR Infiniband & NICs installed (drivers currently don’t support dual cards) as well as a Myricom 10G 16 port switch & NICs.

 

Software (vendor):

 

¨      Visual Studio 2005

¨      Intel C++ Compiler 9.1

¨      Intel Fortran Compiler 9.1

¨      Intel MKL 8.1

¨      ACML 3.6.0

Software (development native windows):

¨      LAPACK  (http://icl.cs.utk.edu/lapack)

¨      FT-MPI     (http://icl.cs.utk.edu/ftmpi)

¨      Open MPI (http://icl.cs.utk.edu/openmpi)

¨      PAPI         (http://icl.cs.utk.edu/papi)

 

Software (development under cygwin):

¨       ScaLAPACK            (http://icl.cs.utk.edu/scalapack)

¨      NetSolve/GridSolve  (http://icl.cs.utk.edu/gridsolve)

¨      HPCC                        (http://icl.cs.utk.edu/hpcc)

 

Software (future development):

¨       KOJAK (http://icl.cs.utk.edu/kojak)

¨      LFC       (http://icl.cs.utk.edu/lfc)

¨      HPL       (http://icl.cs.utk.edu/hpl)

¨      SANS    (http://icl.cs.utk.edu/sans)

¨      ATLAS  (http://icl.cs.utk.edu/atlas)

HPC Institute, University of Tennessee - Project and Hardware

posted Friday, October 20, 2006 4:56 PM by DennisCr | 0 Comments

High Performance Compute Clustering with Windows

University of Tennessee

Innovative Computing Laboratory

Computer Science Department

Jack Dongarra

Windows Cluster Project

 

People

Jack Dongarra

George Bosilca

Dave Cronk

Julien Langou

Piotr Luszczek

 

Projects:

1.     Numerical Linear Algebra Algorithms and Software

a.     LAPACK, ScaLAPACK, ATLAS

b.    Self Adapting Numerical Algorithms (SANS) Effort

c.     Generic Code Optimization

d.    LAPACK For Clusters – easy access to clusters

2.     Heterogeneous Distributed Computing

a.     NetSolve, FT-MPI, Open-MPI

3.     Performance Evaluation

a.     PAPI, HPC Challenge, Top500

4.     Software Repositories

a.     Netlib

 

LAPACK

1.     Used by Matlab, Mathematica, Numeric Python,…

2.     Tuned version provided by vendors: AMD, Apple, Compaq, Cray, Fujitsu, Hewlett-Packard, Hitachi, IBM, Intel, MathWorks, NAG, NEC, PGI, SUN, Visual Numerics, by Microsoft and most of Linux distribution (Fedora, Debian, Cygwin,...).

3.     On going work: performance, accuracy, extended precision, ease of use

 

ScaLAPACK

1.     Parallel implementation of LAPACK scaling on parallel hardware from 10’s to 100’s to 1000’s of processors

2.     On going work: Match functionalities of current LAPACK

3.     On going work: Target new architectures, new parallel environment. For example port to Microsoft HPC cluster solution

 

LAPACK for Clusters (LFC)

1.     Most of ScaLAPACK functionality from serial clients (Matlab, Python, Mathematica)

 

FT-MPI and Open-MPI

1.        Define the behavior of MPI in event a failure occurs at the process level.

2.        FT-MPI based on MPI 1.3 (plus some MPI 2 features) with a fault tolerant model similar to what was done in PVM.

3.        Complete reimplementation, not based on other implementations.

a.     Gives the application the possibility to recover from a process-failure.

b.    A regular, non fault-tolerant MPI program will run using FT-MPI.

c.     What FT-MPI does not do:

4.     Recover user data (e.g. automatic check-pointing)

5.     Provide transparent fault-tolerance

 

Performance Application Programming Interface (PAPI)

1.     A portable library to access hardware counters found on processors

2.     Provides a standardized  list of performance metrics

 

KOJAK (Joint with Felix Wolf)

1.     Software package for the automatic performance analysis of parallel apps

2.     Message passing and multi-threading (MPI and/or OpenMP)

3.     Parallel performance

4.     CPU and memory performance


Posters for Related Projects

·         FT-MPI

·         HPCC

·         Kojak

·         LAPACK / ScaLAPACK

·         NetSolve / ActiveSheets

·         NetSolve / .NET

·         Open MPI

·         PAPI

·         top500

 

Hardware Configuration

Team HPC

Dual Core 4GB AMD Opterons

Team HPC Turnkey Beowulf-Class Supercomputer

26 4GB AMD Opteron DC Compute Nodes, 1 Head Node

CPU Manufacturer

AMD

CPU Model

Opteron 265

CPU Speed

1.8 GHZ

Number of nodes

26

Number of cores

2

Interconnect(s)

Infiniband, Myranet, GigE

 

Item Description

QTY

26 Compute Nodes

Supermicro H8DCE Motherboard

26

3U Chassis w/ 350W PS with PCI-E riser & Slide Rails

26

AMD Opteron 265 1.8GHz with Heatsink

52

4GB PC3200 Registered/ECC DDR

104

1Gb X4 Total memory

80GB 7200rpm SATA 8 MB cache HDD

26

ATI Rage on board

26

Dual Gigabit Ethernet Integrated on board

26

One Year Standard Warranty

26

Opteron Linux Installed and Tested

26

Built, Tested & Configured

26

Torque, Kick-Start Utility & Web-Based Mon. Software

Head Node 4Gb per Node

 

Supermicro H8DCE Motherboard

1

3U Chassis w/ PS and Slide Rails

1

AMD Opteron 265 1.8 GHZ with Heatsink and Fan

2

4GB PC3200 Registered/ECC DDR

4

1GB X 4 Total memory

DVD Combo Drive