Johannes Habich

Wednesday, June 24, 2009

EIHECS roundup

Geballte HPC-Expertise in Erlangen
I myself got the nice little Lenovo S10e Ideapad for being second in the Call for Innovative Multi- and Many-Core Programming.

Currently I'm running Windows 7 RC1 with no problems at all but getting away from it.

Lectures

G.Wellein G. Hager, J. Habich: Lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2011.

G. Hager, J. Habich: Tutorials for the lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2011.

G.Wellein G. Hager, J. Habich: Lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2010.

G. Hager, J. Habich: Tutorials for the lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2010.

G. Hager, J. Habich: Tutorials for the lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2009.

Friday, June 19, 2009

MPI/OpenMP Hybrid pinning or no pinning that is the question

Intro

Recently performed benchmarks of a hybrid-parallelized flow solver showed what one has to consider in order to get best performance.
On the theoretical side, hybrid implementations are thought to be most flexible and still maintain high performance. This is because, one thinks that OpenMP is perfect for intranode communication and faster than MPI there.
Between nodes, MPI anyway is the choice for portable distributed memory parallelization.

In reality however not few MPI implementations already use shared memory buffers when communicating with other ranks in the same shared memory system. So basically there is no advantage between a parallelization of MPI and OpenMP on the same level, when using MPI nevertheless for internode communication.

Quite contrary, it means apart from the additional implementation, a lot of more understanding of processor and memory hierarchy layout, thread and process affinity to cores than the pure MPI implementation.

Nevertheless there are scenarios, where hybrid really can pay off, as MPI lacks the OpenMP feature of accessing shared data in shared caches for example.

Finally if you want to run your hybrid code on RRZE systems, there are the following features available.

Pinning of MPI/OpenMP hybrids

I assume you use the mpirun wrapper provided

mpirun -pernode issues just one MPI process per node regardless of the nodefile content

mpirun -npernode issues just n MPI processes per node regardless of the nodefile content

mpirun -npernode 2 -pin 0,1_2,3 env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so Issues 2 MPI processes per node and gives threads of the MPI process 0 just access to core 0 and 1 and threads of MPI process 1 access to cores 2 and 3 (of course the MPI processes themselves are also limited to that cores). Furthermore the , e.g. OpenMP, threads are pinned to one core only, so that migration is no longer an issue

mpirun -npernode 2 -pin 0_1_2_3 is your choice if you would like to test 1 OpenMP thread per MPI process and 4 MPI processes in total per node. Adding the LD_PRELOAD from above however decreases performance a lot. This is currently under investigation.

export PINOMP_MASK=2 changes the skip mask of the pinning tool

OpenMP spawns not only worker threads but also threads for administrative business as synchronization etc. Usually you would only pin the threads, contributing to the computation. The default skip mask, skipping the non computationally intensive threads, might not be correct in the case of hybrid programming, as MPI as well spawns non-worker threads. The PINOMP_MASK variable is hereby interpreted like a bitmask, e.g. 2 --> 10 and 6 --> 110. A zero means to pin the thread and a 1 means to skip the pinning of the thread. The least significant bit hereby corresponds to thread zero (bit 0 is 0 in the examples above ) .

6 was used in the algorithm under investigation as soon as one MPI process and 4 OpenMP worker threads were used per node, to have the correct thread pinning.

The usage of the rankfile for pinning hybrid jobs is described in Thomas Zeisers Blog

Thanks to Thomas Zeiser and Michael Meier for their help in resolving this issue.

Keyowords: Thread Pinning Hybrid Affinity

Incorporated Comments of Thomas Zeiser

Thomas Zeiser, Donnerstag, 23. Juli 2009, 18:15

PINOMP_MASK for hybrid codes using Open-MPI

If recent Intel compilers and Open-MPI are used for hybrid OpenMP-MPI programming, the correct PINOMP_MASK seems to be 7 (instead of 6 for hybrid codes using Intel-MPI).

Thomas Zeiser, Montag, 22. Februar 2010, 20:41
PIN OMP and recent mvapich2

Also recent mvapich2 requires special handling for pinning hybrid codes: PINOMP_SKIP=1,2 seems to be appropriate.

Wednesday, May 20, 2009

OpenMP Fortran

I'm currently investigating, that a scalar (integer 4 byte) variable cannot be defined as FIRSTPRIVATE inside an !$OMP Parallel section.
This will cause abnormal program abortion, seg faults and undefined behavihor.
However, defining the varibale as PRIVATE works and SHARED of course, too.
Hopefully a small code snippet will provide more insight.

Personal

My personal Homepage

HPC Links

RRZE HPC Group Page

RRZE HPC lattice Boltzmann activities

KONWIHR 1 and 2

Wednesday, April 29, 2009

Ganglia 3.1.2 for Windows HPC2008

Recent tests of the windows ported ganglia on Microsoft Windows HPC 2008, obtained from APR Consulting web page, showed a problem.
After a few minutes of runtime, the ganglia executable eats up more and more memory till the systems starts to swap, finally becomes unstable and crashes or is no longer reachable.
Not able to deploy ganglia to the cluster I tested different releases from APR and none of them had the problem running on Win2003 x64, however all showed the same memory leak problem on HPC2008x64 or just didn't work at all.
So finally we compiled our own Cygwin based gmond.exe binary and came up with a pretty stable version, with just one flaw:
Till now the installation as a service doesn't work, neither with gmondservice.exe from APR Consulting nor with the windows native tool sc.exe.
However the installation with schtasks.exe as a scheduled task to run once on startup and then daemonize (thats what Linux calls a service), works fine.
In addition a pure swap of the executables or the config file, will now result in an updated ganglia once the node reboots or a task restart is triggered instead of removing and reinstalling a service.
All steps of deployment can be easily done with the clusrun extension, which is essential for cluster administration.

Small tutorial

(all links are below, drop a comment if something is missing/wrong)

Download a ganglia version (3.1.2 Langley worked indeed very well)

Download and install cygwin with a gcc and g++ compiler and the additional packages mentioned in the README.WIN file of the ganglia package

Do: ./configure make make install in the root directory of the confuse lib

Perhaps you have to exclude the examples from the build:
replace line: SUBDIRS = m4 po src examples tests doc with
SUBDIRS = m4 po src tests doc
They throwed an error on my system.

Do: ./configure --with-libconfuse=/usr/local --enable-static-build and make in the root of ganglia

With some additional dll files from cygwin, your release is now runnable. Just start the gmond.exe and look into the Event viewer which dll is missing and place them in the same folder or in a folder which is in the PATH.

Please note, that this a x86_32 binary and not x64, due to the fact that cygwin is not x64.
It should however be possible to build ganglia with the Windows Services for Unix to native x64.

Links:

Corresponding discussion in HPC2008 MS Forum
Cygwin
Ganglia
confuse library
APR Consulting web page

Thursday, January 22, 2009

Were is my "My Desktop" button

In order to get the "My Desktop" button back , e.g.on Windows Terminal servers, just execute the following command:

regsvr32 /n /i:U shell32

With the next reboot or upon restart of the Quick launch bar, the icon should appear.

Tuesday, December 16, 2008

Windows CCS Cluster Upgrade

Recently the Windows CCS Cluster of the RRZE got a small upgrade.

One of the initial nodes rejoined the cluster and there are now 28 Opteron Cores available again.
Due to the usage of CFD for production runs, the user home was recently upgraded and the quota was extended to 10 GB per user.
Furthermore for special purposes and a limited amount of time there is an extra project home available with up to 120 GB space for extensive usage.

Monday, December 15, 2008

PCI express revisited

Test results with the new generation, i.e. GT 200 based and PCIe Generation 2.0 with doubled performance, show that general naive implemented copys do not get any speedups.
Blocked copys however, climb up to 4.5 GB/s when writing data to GPU memory.
Data copy back to the host is still relatively low at 2 GB/s.

pcix bandwidth measurements 8800 gtx vs. gtx 280

pcix bandwidth measurements 8800 gtx vs. gtx 280

Link to first article