Tuesday, December 1, 2009
If you are interested in getting access to the system, contact email@example.com
Initial information for Login and usage can be found here:
Windows HPC2008 Cluster Launch Slides
Thursday, November 19, 2009
- Deactivate UAC on all nodes; Otherwise the nodes will simple hang, and wait for the UAC acceptance that will never happen. You can omit this by doing the first Java installation by hand via RDesktop login. Afterwards all successive unattended installation will succeed. We have currently no clue why. Perhaps some kind of adaptive UAC?
Best practice is to deactive UAC via a registry key:
%windir%\system32\reg.exe ADD HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System /v EnableLUA /t REG_DWORD /d 1 /f
- Reboot the nodes, so that the registry change becomes effective.
- All will run smoothly now, if the user installing java was logged in to the nodes at least once. This poses a problem with 20++ cluster nodes. however. The basic point is, that there is no User directory created yet and neither are all temp and AppData paths.
Java kindly ignores any variable defined by the OS, e.g. TEMP or TMP. And gathers its own temp dirs which leads to C:\Users\Username\AppData\LocalLow\Temp and many more.
So the installation fails once more, unless these directories are there.
So you have to create them yourself:
- After that the usual JRE unattended deployment should proceed
Note, that any login to the nodes to be installed and any prior to that, java installation can change all of the above experiences
Wednesday, November 18, 2009
Along with upgrading to the latest Windows HPC Server Release 2008, the hardware has been upgraded significantly:
16 dual-socket hexa-core AMD Istanbul Opteron processors (Dell Blade Center enclosure) equipped with 32GB of RAM service a peak performance of 2 TFLOP/s.
Interested users are invited to join the official launch on December 1st. 2009 at RRZE room 1.026.
After a quick tour of the new Job Scheduler, the main part is organized as an hands-on session were everyone can make themselves comfortable with the new environment.
A Registration via email to firstname.lastname@example.org is necessary for attending.
Designated trademarks and brands are the property of their respective owners
Monday, October 5, 2009
You create yourself a service running solely srvany.exe
sc create GMOND binpath= c:\programme\ganglia\srvany.exe
Edit the service specs in the registry:
Add a subkey named Parameters
Inside "Parameters" create a String value named Application.
Edit Application and put the call to ganglia into the value data field.
E.g. c:\programme\ganglia\gmond.exe -c "c:\programme\ganglia\gmond-node.conf "
Start the service over mmc or by sc start GMOND and it should be running.
(There should also be a way to do this with the cygwin service creation tool cygrunsrv. Thanks to Nigel for pointing that out.)
Thursday, September 24, 2009
Due to the special allocation the arrays will stay at the same location in memory , will not be swapped and are faster available for DMA transfers. In the other case, most data is first copied to a pinned memory buffer and then to the ordinarily allocated memory space. This detour is omitted in this case here.
The performance plot shows, that pinned memory now offers a performance of up to 5.9 GB/s on the fastest currently available PCIe X16 Gen 2 Interface which has a peak transfer rate of 8 GB/s. This corresponds to 73% of Peak performance with almost no optimization applied. In contrast, optimization such as a blocked data transfer, which prooved to increase performance some time ago [PCIe revisited] have no positive effect on performance anymore.
Using only the blocked optimzations without pinned memory still is better then doing an unblocked transfer from unpinned memory, but it only transfers about 4.5 GB/s which corresponds to 56 % of peak to the device.
Reading from the device is far worst with only 2.3 GB/s.
Thursday, August 20, 2009
The most probable time for this behavior is right after install, when the installer starts the program the first time.
The installer of course has elevated rights for setup purposes and the program itself, too.
In this case Windows Vista + 7 forbid the drag and drop functionality for security reasons.
In most cases it is enough to just close and start the program again, now in non elevated mode.
You can reproduce this behavior by simply starting a program with elevated rights.
Thursday, July 23, 2009
Thursday, July 16, 2009
STILL UNDER CONSTRUCTION
To trace MPI programs with the intel mpi tracing capabilities the following steps are at least necessary.
(Note that his guide demands not to be the only way nor to be complete and error proof!)
- module load itac
- env LD_PRELOAD=/apps/intel/itac/7.2.0.011/itac/slib_impi3/libVT.so mpirun -pernode ./bin/solver ./examples/2X_2Y_2Z_200X200X200c_file.prm
- e.g: env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so:/apps/intel/itac/7.2.0.011/itac/slib_impi3/libVT.so mpirun -npernode 2 $MPIPINNING ./bin/solver ./examples/8X_8Y_4Z_800X800X400c_file.prm
- Another way of doing this is to run mpiexec -trace ..... (remember this is true for intel MPI)
Watch that additional LD_PRELOAD commands might override this one!
env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so:/apps/intel/itac/7.2.0.011/itac/slib_impi3/libVT.so mpirun -npernode 2 $MPIPINNING ./bin/solver ./examples/8X_8Y_4Z_800X800X400c_file.prm
Official Intel Docu on that matter
Intel® Trace Analyzer and Collector for Linux* OS
Getting Started Guide
To simplify the use of the Intel® Trace Analyzer and Collector, a set of environmental scripts is provided to you. Source/execute the appropriate script (/bin/itacvars.[c]sh) in your shell before using the software. For example, if using the Bash shell:
$ source /bin/itacvars.sh # better added to $HOME/.profile or similar
The typical use of the Trace Analyzer and Collector is as follows:
* Let your application run together with the Trace Collector to generate one (or more) trace file(s).
* Start the Trace Analyzer and to load the generated trace for analysis.
Generating a Trace File
Generating a trace file from an MPI application can be as simple as setting just one environment variable or adding an argument to mpiexec. Assume you start your application with the following command:
$ mpiexec -n 4 myApp
Then generating a trace can be accomplished by adding:
$ LD_PRELOAD=/slib/libVT.so mpiexec -n 4 myApp
or even simpler (for the Intel® MPI Library)
$ mpiexec -trace -n 4 myApp
This will create a set of trace files named myApp.stf* containing trace information for all MPI calls issued by the application.
If your application is statically linked against the Intel® MPI Library you have to re-link your binary like this:
$ mpiicc -trace -o myApp # when using the Intel® C++ Compiler
$ mpiifort -trace -o myApp # when using the Intel® Fortran Compiler
Normal execution of your application:
$ mpiexec -n 4 myApp
will then create the trace files named myApp.stf*.
Analyzing a Trace File
To analyze the generated trace, invoke the graphical user interface:
$ traceanalyzer myApp.stf
Read section For the Impatient in the Trace Analyzer Reference Guide to get guidance on the first steps with this tool.
Wednesday, July 8, 2009
RRZE HPC Services
Currently the available CUDA test systems @ RRZE are:
lightning: (available with upgraded hardware)
Ubuntu 8.04 x86_64
2x Quadcore Intel Clovertown (2,33 GHz), 4 MB L2 pro 2 Cores,
GeForce 8800 Ultra (768 MB) (G80 core)
Cuda Driver Version: 180.22
Cuda Toolkit: 2.0
f22: (Last Update 29.09.09)
Ubuntu 8.04 x86_64
2x Quadcore Intel Xeon L5420 (2.5 GHz)
GeForce GTX 280 SC (1 GB) (GT200 Core)
Current: Cuda Driver Version: 190.29(Cuda2.3) --> with OpenCL Support!
Before: Cuda Driver Version: 190.16 (Cuda2.3)
Cuda Toolkit: 2.3
Tuesday, July 7, 2009
- Please apply for a HPC account at RRZE (ask your local administrator) .
- You get access to one of the machines by issuing either a job script or by requesting an interactive shell, e.g.:
- Note, that interactive sessions are limited to one hour, but it is the recommended way to try things out in the beginning
- The module system now supplies you with various versions of compilers and CUDA Versions, e.g.
- Next thing you wanna try is compiling the SDK examples.
- Therefore, download the SDK matching the CUDA version you want to use (please chek wether it is available too!) and extract it to some directory by running it.
- The cuda path you have to specify (not the install path!) is /usr/local/cudaXX were XX is the version and the architecture (e.g. -32 ).
- Then enter the directory you extracted to and type make. It should compile, if it doesn't please look to /usr/local/cudaXX/bin/linux/release/. If you find executables in there and you can acutally run them, Then somewhere in your settings is a mistake. If you are trying to compile in 32bit mode, please contact us at email@example.com because then you would need further assistance.
- Assuming compilation went well (went well = no errors; We neglect the warnings here), you should have runable SDK examples in /bin/release/linux/
- Now your basic CUDA environment is set up and ready to go for your own codes.
qsub -I -lnodes=f22:ppn=8,walltime=01:00:00
module load cuda/2.2 will give you Cuda Version 2.2 64bit
Monday, July 6, 2009
So recently I heard something about SCOPUS from the publisher ELSEVIER, Citavi is free for members of our University but all lack some features.
The largest problem however is, how to import bibliographys from webpages and similar "free form" sources.
Finally I stumbled across the Firefox extension Zotero.
Once installed its active in the background and is only visible by clicking its logo in the lower window bar. If you now visit a site which has bibliography similar input, a small icon appears in the address bar which exhibits all literature found on this page. With a selection and by hitting ok, you now import one or more books, articles aso. Into the Zotero DB. Furthermore you can search in the DB by tags and other fields. Can organize categories and subcategories and output to RTF aso. and of course bibtex files.
You can as well import bibtex files to appreciate your hard work generating these files by hand in the past :-).
Zotero is of course not free from flaws and some webpages, even that ones which Zotero has optimized parsers for, provide information with errors till the next parser release comes out.
I've not figured out how to efficiently use this on multiple Computers at the same time, but I'm sure there is a solution as well.
Wednesday, June 24, 2009
I myself got the nice little Lenovo S10e Ideapad for being second in the Call for Innovative Multi- and Many-Core Programming.
Currently I'm running Windows 7 RC1 with no problems at all but getting away from it.
G. Hager, J. Habich: Tutorials for the lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2011.
G.Wellein G. Hager, J. Habich: Lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2010.
G. Hager, J. Habich: Tutorials for the lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2010.
G. Hager, J. Habich: Tutorials for the lecture on Programming Techniques for Supercomputers PTfS, Summer Term 2009.
Friday, June 19, 2009
Recently performed benchmarks of a hybrid-parallelized flow solver showed what one has to consider in order to get best performance.
On the theoretical side, hybrid implementations are thought to be most flexible and still maintain high performance. This is because, one thinks that OpenMP is perfect for intranode communication and faster than MPI there.
Between nodes, MPI anyway is the choice for portable distributed memory parallelization.
In reality however not few MPI implementations already use shared memory buffers when communicating with other ranks in the same shared memory system. So basically there is no advantage between a parallelization of MPI and OpenMP on the same level, when using MPI nevertheless for internode communication.
Quite contrary, it means apart from the additional implementation, a lot of more understanding of processor and memory hierarchy layout, thread and process affinity to cores than the pure MPI implementation.
Nevertheless there are scenarios, where hybrid really can pay off, as MPI lacks the OpenMP feature of accessing shared data in shared caches for example.
Finally if you want to run your hybrid code on RRZE systems, there are the following features available.
Pinning of MPI/OpenMP hybrids
I assume you use the mpirun wrapper provided
- mpirun -pernode issues just one MPI process per node regardless of the nodefile content
- mpirun -npernode issues just n MPI processes per node regardless of the nodefile content
- mpirun -npernode 2 -pin 0,1_2,3 env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so Issues 2 MPI processes per node and gives threads of the MPI process 0 just access to core 0 and 1 and threads of MPI process 1 access to cores 2 and 3 (of course the MPI processes themselves are also limited to that cores). Furthermore the , e.g. OpenMP, threads are pinned to one core only, so that migration is no longer an issue
- mpirun -npernode 2 -pin 0_1_2_3 is your choice if you would like to test 1 OpenMP thread per MPI process and 4 MPI processes in total per node. Adding the LD_PRELOAD from above however decreases performance a lot. This is currently under investigation.
- export PINOMP_MASK=2 changes the skip mask of the pinning tool
OpenMP spawns not only worker threads but also threads for administrative business as synchronization etc. Usually you would only pin the threads, contributing to the computation. The default skip mask, skipping the non computationally intensive threads, might not be correct in the case of hybrid programming, as MPI as well spawns non-worker threads. The PINOMP_MASK variable is hereby interpreted like a bitmask, e.g. 2 --> 10 and 6 --> 110. A zero means to pin the thread and a 1 means to skip the pinning of the thread. The least significant bit hereby corresponds to thread zero (bit 0 is 0 in the examples above ) .
- 6 was used in the algorithm under investigation as soon as one MPI process and 4 OpenMP worker threads were used per node, to have the correct thread pinning.
The usage of the rankfile for pinning hybrid jobs is described in Thomas Zeisers Blog
Thanks to Thomas Zeiser and Michael Meier for their help in resolving this issue.
Keyowords: Thread Pinning Hybrid Affinity
Incorporated Comments of Thomas Zeiser
Thomas Zeiser, Donnerstag, 23. Juli 2009, 18:15
PINOMP_MASK for hybrid codes using Open-MPI
If recent Intel compilers and Open-MPI are used for hybrid OpenMP-MPI programming, the correct PINOMP_MASK seems to be 7 (instead of 6 for hybrid codes using Intel-MPI).
Thomas Zeiser, Montag, 22. Februar 2010, 20:41
PIN OMP and recent mvapich2
Also recent mvapich2 requires special handling for pinning hybrid codes: PINOMP_SKIP=1,2 seems to be appropriate.
Wednesday, May 20, 2009
This will cause abnormal program abortion, seg faults and undefined behavihor.
However, defining the varibale as PRIVATE works and SHARED of course, too.
Hopefully a small code snippet will provide more insight.
Wednesday, April 29, 2009
Recent tests of the windows ported ganglia on Microsoft Windows HPC 2008, obtained from APR Consulting web page, showed a problem.
After a few minutes of runtime, the ganglia executable eats up more and more memory till the systems starts to swap, finally becomes unstable and crashes or is no longer reachable.
Not able to deploy ganglia to the cluster I tested different releases from APR and none of them had the problem running on Win2003 x64, however all showed the same memory leak problem on HPC2008x64 or just didn't work at all.
So finally we compiled our own Cygwin based gmond.exe binary and came up with a pretty stable version, with just one flaw:
Till now the installation as a service doesn't work, neither with gmondservice.exe from APR Consulting nor with the windows native tool sc.exe.
However the installation with schtasks.exe as a scheduled task to run once on startup and then daemonize (thats what Linux calls a service), works fine.
In addition a pure swap of the executables or the config file, will now result in an updated ganglia once the node reboots or a task restart is triggered instead of removing and reinstalling a service.
All steps of deployment can be easily done with the clusrun extension, which is essential for cluster administration.
(all links are below, drop a comment if something is missing/wrong)
- Download a ganglia version (3.1.2 Langley worked indeed very well)
- Download and install cygwin with a gcc and g++ compiler and the additional packages mentioned in the README.WIN file of the ganglia package
- Do: ./configure make make install in the root directory of the confuse lib
- Perhaps you have to exclude the examples from the build:
replace line: SUBDIRS = m4 po src examples tests doc with
SUBDIRS = m4 po src tests doc
They throwed an error on my system.
- Do: ./configure --with-libconfuse=/usr/local --enable-static-build and make in the root of ganglia
- With some additional dll files from cygwin, your release is now runnable. Just start the gmond.exe and look into the Event viewer which dll is missing and place them in the same folder or in a folder which is in the PATH.
libapr1, expat, diffutils, gcc, make, python, sharutils, sunrpc
and for libconfuse:
Please note, that this a x86_32 binary and not x64, due to the fact that cygwin is not x64.
It should however be possible to build ganglia with the Windows Services for Unix to native x64.
Corresponding discussion in HPC2008 MS Forum
APR Consulting web page
Thursday, January 22, 2009
In order to get the "My Desktop" button back , e.g.on Windows Terminal servers, just execute the following command:
regsvr32 /n /i:U shell32
With the next reboot or upon restart of the Quick launch bar, the icon should appear.