Tuesday, December 1, 2009

Windows HPC2008 Cluster Operational

Today the Windows HPC2008 Cluster of RRZE successfully got operational.

If you are interested in getting access to the system, contact hpc@rrze.uni-erlangen.de

Initial information for Login and usage can be found here:

Windows HPC2008 Cluster Launch Slides

Thursday, November 19, 2009

Java; A quest with unattended installation

Some guidelines for unattended Java installation in an Win2008 HPC Cluster environment:


  • Deactivate UAC on all nodes; Otherwise the nodes will simple hang, and wait for the UAC acceptance that will never happen. You can omit this by doing the first Java installation by hand via RDesktop login. Afterwards all successive unattended installation will succeed. We have currently no clue why. Perhaps some kind of adaptive UAC?
    Best practice is to deactive UAC via a registry key:
    %windir%\system32\reg.exe ADD HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System /v EnableLUA /t REG_DWORD /d 1 /f


  • Reboot the nodes, so that the registry change becomes effective.

  • All will run smoothly now, if the user installing java was logged in to the nodes at least once. This poses a problem with 20++ cluster nodes. however. The basic point is, that there is no User directory created yet and neither are all temp and AppData paths.
    Java kindly ignores any variable defined by the OS, e.g. TEMP or TMP. And gathers its own temp dirs which leads to C:\Users\Username\AppData\LocalLow\Temp and many more.
    So the installation fails once more, unless these directories are there.
    So you have to create them yourself:
    mkdir C:\Users\%USERNAME%\AppData\LocalLow\Temp\


  • After that the usual JRE unattended deployment should proceed



Note, that any login to the nodes to be installed and any prior to that, java installation can change all of the above experiences

Wednesday, November 18, 2009

Windows HPC 2008 Cluster Launch

RRZE recently extended its Windows High-Performance-Computing Ressources.
Along with upgrading to the latest Windows HPC Server Release 2008, the hardware has been upgraded significantly:

16 dual-socket hexa-core AMD Istanbul Opteron processors (Dell Blade Center enclosure) equipped with 32GB of RAM service a peak performance of 2 TFLOP/s.
AMD Istanbul Die
Interested users are invited to join the official launch on December 1st. 2009 at RRZE room 1.026.
After a quick tour of the new Job Scheduler, the main part is organized as an hands-on session were everyone can make themselves comfortable with the new environment.

A Registration via email to hpc@rrze.uni-erlangen.de is necessary for attending.

WindowsCluster



Designated trademarks and brands are the property of their respective owners

Monday, October 5, 2009

Ganglia 3.1.2 Running as as a service After all

With the help of the srvany.exe from the Windows Ressource Kit Tools 2003 you can run any executable in Win2008 and Win2008 R2 either as a service.

You create yourself a service running solely srvany.exe

sc create GMOND binpath= c:\programme\ganglia\srvany.exe

Edit the service specs in the registry:
LocalMachine--> System\\CurrentControlSet\\Services\\GMond

Add a subkey named Parameters
Inside "Parameters" create a String value named Application.
Edit Application and put the call to ganglia into the value data field.
E.g. c:\programme\ganglia\gmond.exe -c "c:\programme\ganglia\gmond-node.conf "



Start the service over mmc or by sc start GMOND and it should be running.



(There should also be a way to do this with the cygwin service creation tool cygrunsrv. Thanks to Nigel for pointing that out.)

Thursday, September 24, 2009

PCI express pinned Host Memory

Retesting my benchmarks with the current release of Cuda 2.3 I finally incorporated new features like pinned host memory allocation. Specs say that this improves the host to device transfers and vice versa.
Due to the special allocation the arrays will stay at the same location in memory , will not be swapped and are faster available for DMA transfers. In the other case, most data is first copied to a pinned memory buffer and then to the ordinarily allocated memory space. This detour is omitted in this case here.

The performance plot shows, that pinned memory now offers a performance of up to 5.9 GB/s on the fastest currently available PCIe X16 Gen 2 Interface which has a peak transfer rate of 8 GB/s. This corresponds to 73% of Peak performance with almost no optimization applied. In contrast, optimization such as a blocked data transfer, which prooved to increase performance some time ago [PCIe revisited] have no positive effect on performance anymore.

Using only the blocked optimzations without pinned memory still is better then doing an unblocked transfer from unpinned memory, but it only transfers about 4.5 GB/s which corresponds to 56 % of peak to the device.
Reading from the device is far worst with only 2.3 GB/s.

PCIe Bandwidth Measurements GTX280 using pinned Memory

Thursday, August 20, 2009

Drag and Drop not Working in Vista and Windows7

It might occur to you that your favorite Media Player (any other program might do the same) is not able to accept files added per Drag and Drop.
The most probable time for this behavior is right after install, when the installer starts the program the first time.
The installer of course has elevated rights for setup purposes and the program itself, too.
In this case Windows Vista + 7 forbid the drag and drop functionality for security reasons.

In most cases it is enough to just close and start the program again, now in non elevated mode.

You can reproduce this behavior by simply starting a program with elevated rights.

Thursday, July 23, 2009

Cuda 2.3 released

NVIDIA just released Cuda Version 2.3 with the corresponding driver.
F22 @RRZE has already been updated to support this Version.

Thursday, July 16, 2009

Tracing of MPI Programs

Overview


STILL UNDER CONSTRUCTION

To trace MPI programs with the intel mpi tracing capabilities the following steps are at least necessary.
(Note that his guide demands not to be the only way nor to be complete and error proof!)


Tutorial



  1. module load itac

  2. env LD_PRELOAD=/apps/intel/itac/7.2.0.011/itac/slib_impi3/libVT.so mpirun -pernode ./bin/solver ./examples/2X_2Y_2Z_200X200X200c_file.prm

  3. Watch that additional LD_PRELOAD commands might override this one!
  4. e.g: env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so:/apps/intel/itac/7.2.0.011/itac/slib_impi3/libVT.so mpirun -npernode 2 $MPIPINNING ./bin/solver ./examples/8X_8Y_4Z_800X800X400c_file.prm

  5. Another way of doing this is to run mpiexec -trace ..... (remember this is true for intel MPI)







env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so:/apps/intel/itac/7.2.0.011/itac/slib_impi3/libVT.so mpirun -npernode 2 $MPIPINNING ./bin/solver ./examples/8X_8Y_4Z_800X800X400c_file.prm





Official Intel Docu on that matter


Intel® Trace Analyzer and Collector for Linux* OS
Getting Started Guide
Overview
To simplify the use of the Intel® Trace Analyzer and Collector, a set of environmental scripts is provided to you. Source/execute the appropriate script (/bin/itacvars.[c]sh) in your shell before using the software. For example, if using the Bash shell:

$ source /bin/itacvars.sh # better added to $HOME/.profile or similar

The typical use of the Trace Analyzer and Collector is as follows:

* Let your application run together with the Trace Collector to generate one (or more) trace file(s).
* Start the Trace Analyzer and to load the generated trace for analysis.

Generating a Trace File
Generating a trace file from an MPI application can be as simple as setting just one environment variable or adding an argument to mpiexec. Assume you start your application with the following command:

$ mpiexec -n 4 myApp

Then generating a trace can be accomplished by adding:

$ LD_PRELOAD=/slib/libVT.so mpiexec -n 4 myApp

or even simpler (for the Intel® MPI Library)

$ mpiexec -trace -n 4 myApp

This will create a set of trace files named myApp.stf* containing trace information for all MPI calls issued by the application.

If your application is statically linked against the Intel® MPI Library you have to re-link your binary like this:

$ mpiicc -trace -o myApp # when using the Intel® C++ Compiler

or

$ mpiifort -trace -o myApp # when using the Intel® Fortran Compiler

Normal execution of your application:

$ mpiexec -n 4 myApp

will then create the trace files named myApp.stf*.
Analyzing a Trace File
To analyze the generated trace, invoke the graphical user interface:

$ traceanalyzer myApp.stf

Read section For the Impatient in the Trace Analyzer Reference Guide to get guidance on the first steps with this tool.

Wednesday, July 8, 2009

Cuda Machines @ RRZE

This information will not be updated any more. Please visit our official page as we provide GPU computing now as a cluster ressource:

RRZE HPC Services


Currently the available CUDA test systems @ RRZE are:



lightning: (available with upgraded hardware)
Ubuntu 8.04 x86_64
2x Quadcore Intel Clovertown (2,33 GHz), 4 MB L2 pro 2 Cores,
GeForce 8800 Ultra (768 MB) (G80 core)
Cuda Driver Version: 180.22
Cuda Toolkit: 2.0

f22: (Last Update 29.09.09)
Ubuntu 8.04 x86_64
2x Quadcore Intel Xeon L5420 (2.5 GHz)
GeForce GTX 280 SC (1 GB) (GT200 Core)
Current: Cuda Driver Version: 190.29(Cuda2.3) --> with OpenCL Support!
Before: Cuda Driver Version: 190.16 (Cuda2.3)
Cuda Toolkit: 2.3

Tuesday, July 7, 2009

Cuda Tutorial @ RRZE

Currently we have two test systems running different GPUs from NVIDIA inside the testcluster environment.


  • Please apply for a HPC account at RRZE (ask your local administrator) .

  • You get access to one of the machines by issuing either a job script or by requesting an interactive shell, e.g.:

  • qsub -I -lnodes=f22:ppn=8,walltime=01:00:00
  • Note, that interactive sessions are limited to one hour, but it is the recommended way to try things out in the beginning

  • The module system now supplies you with various versions of compilers and CUDA Versions, e.g.

  • module load cuda/2.2 will give you Cuda Version 2.2 64bit
  • Next thing you wanna try is compiling the SDK examples.


    • Therefore, download the SDK matching the CUDA version you want to use (please chek wether it is available too!) and extract it to some directory by running it.

    • The cuda path you have to specify (not the install path!) is /usr/local/cudaXX were XX is the version and the architecture (e.g. -32 ).

    • Then enter the directory you extracted to and type make. It should compile, if it doesn't please look to /usr/local/cudaXX/bin/linux/release/. If you find executables in there and you can acutally run them, Then somewhere in your settings is a mistake. If you are trying to compile in 32bit mode, please contact us at hpc@rrze.uni-erlangen.de because then you would need further assistance.



  • Assuming compilation went well (went well = no errors; We neglect the warnings here), you should have runable SDK examples in /bin/release/linux/

  • Now your basic CUDA environment is set up and ready to go for your own codes.


Monday, July 6, 2009

Taming Literature, Zotero Firefox Plugin

Again and again I felt the urge to have a real tool organize my literature instead of crawling inside bibtex files with every new paper or report again.

So recently I heard something about SCOPUS from the publisher ELSEVIER, Citavi is free for members of our University but all lack some features.

The largest problem however is, how to import bibliographys from webpages and similar "free form" sources.

Finally I stumbled across the Firefox extension Zotero.
Once installed its active in the background and is only visible by clicking its logo in the lower window bar. If you now visit a site which has bibliography similar input, a small icon appears in the address bar which exhibits all literature found on this page. With a selection and by hitting ok, you now import one or more books, articles aso. Into the Zotero DB. Furthermore you can search in the DB by tags and other fields. Can organize categories and subcategories and output to RTF aso. and of course bibtex files.
You can as well import bibtex files to appreciate your hard work generating these files by hand in the past :-).

Zotero is of course not free from flaws and some webpages, even that ones which Zotero has optimized parsers for, provide information with errors till the next parser release comes out.


I've not figured out how to efficiently use this on multiple Computers at the same time, but I'm sure there is a solution as well.

Wednesday, June 24, 2009

EIHECS roundup

Geballte HPC-Expertise in Erlangen
I myself got the nice little Lenovo S10e Ideapad for being second in the Call for Innovative Multi- and Many-Core Programming.

Currently I'm running Windows 7 RC1 with no problems at all but getting away from it.

Lectures

G.Wellein G. Hager, J. Habich: Lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2011.

G. Hager, J. Habich: Tutorials for the lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2011.

G.Wellein G. Hager, J. Habich: Lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2010.

G. Hager, J. Habich: Tutorials for the lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2010.

G. Hager, J. Habich: Tutorials for the lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2009.

Friday, June 19, 2009

MPI/OpenMP Hybrid pinning or no pinning that is the question

Intro


Recently performed benchmarks of a hybrid-parallelized flow solver showed what one has to consider in order to get best performance.
On the theoretical side, hybrid implementations are thought to be most flexible and still maintain high performance. This is because, one thinks that OpenMP is perfect for intranode communication and faster than MPI there.
Between nodes, MPI anyway is the choice for portable distributed memory parallelization.

In reality however not few MPI implementations already use shared memory buffers when communicating with other ranks in the same shared memory system. So basically there is no advantage between a parallelization of MPI and OpenMP on the same level, when using MPI nevertheless for internode communication.

Quite contrary, it means apart from the additional implementation, a lot of more understanding of processor and memory hierarchy layout, thread and process affinity to cores than the pure MPI implementation.

Nevertheless there are scenarios, where hybrid really can pay off, as MPI lacks the OpenMP feature of accessing shared data in shared caches for example.

Finally if you want to run your hybrid code on RRZE systems, there are the following features available.

Pinning of MPI/OpenMP hybrids


I assume you use the mpirun wrapper provided

  • mpirun -pernode issues just one MPI process per node regardless of the nodefile content

  • mpirun -npernode issues just n MPI processes per node regardless of the nodefile content

  • mpirun -npernode 2 -pin 0,1_2,3 env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so Issues 2 MPI processes per node and gives threads of the MPI process 0 just access to core 0 and 1 and threads of MPI process 1 access to cores 2 and 3 (of course the MPI processes themselves are also limited to that cores). Furthermore the , e.g. OpenMP, threads are pinned to one core only, so that migration is no longer an issue

  • mpirun -npernode 2 -pin 0_1_2_3 is your choice if you would like to test 1 OpenMP thread per MPI process and 4 MPI processes in total per node. Adding the LD_PRELOAD from above however decreases performance a lot. This is currently under investigation.

  • export PINOMP_MASK=2 changes the skip mask of the pinning tool


OpenMP spawns not only worker threads but also threads for administrative business as synchronization etc. Usually you would only pin the threads, contributing to the computation. The default skip mask, skipping the non computationally intensive threads, might not be correct in the case of hybrid programming, as MPI as well spawns non-worker threads. The PINOMP_MASK variable is hereby interpreted like a bitmask, e.g. 2 --> 10 and 6 --> 110. A zero means to pin the thread and a 1 means to skip the pinning of the thread. The least significant bit hereby corresponds to thread zero (bit 0 is 0 in the examples above ) .
    6 was used in the algorithm under investigation as soon as one MPI process and 4 OpenMP worker threads were used per node, to have the correct thread pinning.

The usage of the rankfile for pinning hybrid jobs is described in Thomas Zeisers Blog


Thanks to Thomas Zeiser and Michael Meier for their help in resolving this issue.

Keyowords: Thread Pinning Hybrid Affinity
Incorporated Comments of Thomas Zeiser

Thomas Zeiser, Donnerstag, 23. Juli 2009, 18:15

PINOMP_MASK for hybrid codes using Open-MPI

If recent Intel compilers and Open-MPI are used for hybrid OpenMP-MPI programming, the correct PINOMP_MASK seems to be 7 (instead of 6 for hybrid codes using Intel-MPI).

Thomas Zeiser, Montag, 22. Februar 2010, 20:41
PIN OMP and recent mvapich2

Also recent mvapich2 requires special handling for pinning hybrid codes: PINOMP_SKIP=1,2 seems to be appropriate.

Wednesday, May 20, 2009

OpenMP Fortran

I'm currently investigating, that a scalar (integer 4 byte) variable cannot be defined as FIRSTPRIVATE inside an !$OMP Parallel section.
This will cause abnormal program abortion, seg faults and undefined behavihor.
However, defining the varibale as PRIVATE works and SHARED of course, too.
Hopefully a small code snippet will provide more insight.

Personal

My personal Homepage

HPC Links

RRZE HPC Group Page

RRZE HPC lattice Boltzmann activities

KONWIHR 1 and 2

Wednesday, April 29, 2009

Ganglia 3.1.2 for Windows HPC2008


Recent tests of the windows ported ganglia on Microsoft Windows HPC 2008, obtained from APR Consulting web page, showed a problem.
After a few minutes of runtime, the ganglia executable eats up more and more memory till the systems starts to swap, finally becomes unstable and crashes or is no longer reachable.
Not able to deploy ganglia to the cluster I tested different releases from APR and none of them had the problem running on Win2003 x64, however all showed the same memory leak problem on HPC2008x64 or just didn't work at all.
So finally we compiled our own Cygwin based gmond.exe binary and came up with a pretty stable version, with just one flaw:
Till now the installation as a service doesn't work, neither with gmondservice.exe from APR Consulting nor with the windows native tool sc.exe.
However the installation with schtasks.exe as a scheduled task to run once on startup and then daemonize (thats what Linux calls a service), works fine.
In addition a pure swap of the executables or the config file, will now result in an updated ganglia once the node reboots or a task restart is triggered instead of removing and reinstalling a service.
All steps of deployment can be easily done with the clusrun extension, which is essential for cluster administration.




Small tutorial

(all links are below, drop a comment if something is missing/wrong)

  • Download a ganglia version (3.1.2 Langley worked indeed very well)

  • Download and install cygwin with a gcc and g++ compiler and the additional packages mentioned in the README.WIN file of the ganglia package

  • currently:
    libapr1, expat, diffutils, gcc, make, python, sharutils, sunrpc
    and for libconfuse:
    libiconv
  • Do: ./configure make make install in the root directory of the confuse lib

  • Perhaps you have to exclude the examples from the build:
    replace line: SUBDIRS = m4 po src examples tests doc with
    SUBDIRS = m4 po src tests doc
    They throwed an error on my system.

  • Do: ./configure --with-libconfuse=/usr/local --enable-static-build and make in the root of ganglia

  • With some additional dll files from cygwin, your release is now runnable. Just start the gmond.exe and look into the Event viewer which dll is missing and place them in the same folder or in a folder which is in the PATH.



Please note, that this a x86_32 binary and not x64, due to the fact that cygwin is not x64.
It should however be possible to build ganglia with the Windows Services for Unix to native x64.



Links:

Corresponding discussion in HPC2008 MS Forum
Cygwin
Ganglia
confuse library
APR Consulting web page

Thursday, January 22, 2009

Were is my "My Desktop" button

In order to get the "My Desktop" button back , e.g.on Windows Terminal servers, just execute the following command:



regsvr32 /n /i:U shell32



With the next reboot or upon restart of the Quick launch bar, the icon should appear.