- 2010
- Facing the Multicore-Challenge, Heidelberger Akademie der Wissenschaften, Heidelberg, Germany, March 2010
- ECCOMAS CFD 2010, Lisbon, Portugal, June 2010
- 2009
- First International Workshop on Computational Engineering – Special Topic Fluid-Structure Interaction, Herrsching am Ammersee, Germany, October 2009
- 5th Erlangen High-End-Computing Symposium , Erlangen, Germany, June 2009
- First International Conference on Parallel, Distributed and Grid Computing for Engineering (PARENG09-S01), Pecs, Hungary, April 2009.
Thursday, September 16, 2010
SKALB Conferences
Conferences I've presented SKALB related results.
Friday, September 3, 2010
TinyGPU offers new hardware
TinyGPU has new hardware:
tg010
. The hardware configuration and the currently deployed software are different to the non-Fermi nodes:- Ubuntu 10.04 LTS (instead of 8.04 LTS) as OS.
Note: For using the Intel Compiler <= 11.1 locally on tg010, you have to use gcc/3.3.6 Module [currently]. If not, libstdc++.so.5 is missing , as Ubuntu 10.04 does no longer contain this version. This is necessary only for compilation. Compiled Intel binaries will run as expected. /home/hpc
and/home/vault
are mounted [only] through NFS (and natively via GPFS-Cross-Cluster-Mount)- Dual-Socket-System with Intel Westmere X5650 (2.66 GHz) processor, having 6 native cores per socket (instead of Dual-Socket-System with Intel Nehalem X5550 (2.66 GHz), having 4 native cores per socket)
- 48 GB DDR3 RAM (instead of 24 GB DDR3 RAM)
- 1x NVidia Tesla C250 (“Fermi” with 3 GB GDDR5 featuring ECC)
- 1x NVidia GTX 280 (Consumer-Card with 1 GB RAM – formerly know as F22)
- 2 further PCIe2.0 16x slots will be equipped with NVidia C2070 Cards (“Fermi” with 6 GB GDDR5 featuring ECC) in Q4, instead of 2x NVidia Tesla M1060 (“Tesla” with 4 GB RAM) as in the remaining cluster nodes
- SuperServer 7046GT-TRF / X8DTG-QF with dual Intel 5520 (Tylersburg) chipset instead of SuperServer 6016GT-TF-TM2 / X8DTG-DF with Intel 5520 (Tylersburg) chipset
To allocate the fermi node, specify
:ppn=24
with your job (instead of :ppn=16
) and explicitly submit to the TinyGPU-Queue fermi
. The wallclock limit is set to the default of 24h . The ECC Memory status is shown on job startup.This article tries to be a translation from the original posted here: Zuwachs im TinyGPU-Cluster
Wednesday, August 4, 2010
Intel 5300 AGN Wireless Network adapter drops connection in Win7
When connecting to a wireless N-Network my Xt2 drops quite often connection if certain amounts of data are transferred, i.e. mostly when the data rate is quite high > 5 MB/s.
Following different suggestions about disabling powermanagement, N-Network compatibility and even encryption, I tried different settings in the device driver in Win7.
Basically I changed just one entry:
Fat Channel Intolerant
and the disconnects went away.
Intels states to Fat Cannel Intolerant:
Fat Channel Intolerant
This setting communicates to surrounding networks that this Wi-Fi adapter is not tolerant of 40MHz channels in the 2.4GHz band. The default setting is for this to be disabled (turned off) so that the adapter does not send this notification.
Note These settings are available only if the adapter is an Intel® WiMAX/WiFi Link 5350, Intel® WiMAX/WiFi Link 5150, Intel® WiFi Link 5300, Intel® WiFi Link 5100 or Intel® Wireless WiFi Link 4965AGN.
Source:
My opinion is, that the driver did not send this properly, got a wide 40 Mhz Channel and then dropped out.
This often occurs, if the data rate is increasing and quite large in order to ensure maximum bandwidth.
Any comments appreciated that would clear up this behaviour. Even more interesting would be some insights how this behaves in Linux.
/edit
Still the disconnects occur, but not so often.
/edit
With the newest driver from the dell Homepage the problems are gone till now. (27.08.2010)
/edit
It works till now with the current driver 3.15.0 directly from the Intel Homepage
/edit 19.09.2011
Finally I did a reinstall of Win7 due to some other concerns (Partitioning etc.) and with the default Win7 or Windows Update Driver it works completely flawless since 2 weeks.
Maybe the installs and uninstalls flawed something in the networking, basically, I don't care.
Following different suggestions about disabling powermanagement, N-Network compatibility and even encryption, I tried different settings in the device driver in Win7.
Basically I changed just one entry:
Fat Channel Intolerant
and the disconnects went away.
Intels states to Fat Cannel Intolerant:
Fat Channel Intolerant
This setting communicates to surrounding networks that this Wi-Fi adapter is not tolerant of 40MHz channels in the 2.4GHz band. The default setting is for this to be disabled (turned off) so that the adapter does not send this notification.
Note These settings are available only if the adapter is an Intel® WiMAX/WiFi Link 5350, Intel® WiMAX/WiFi Link 5150, Intel® WiFi Link 5300, Intel® WiFi Link 5100 or Intel® Wireless WiFi Link 4965AGN.
Source:
My opinion is, that the driver did not send this properly, got a wide 40 Mhz Channel and then dropped out.
This often occurs, if the data rate is increasing and quite large in order to ensure maximum bandwidth.
Any comments appreciated that would clear up this behaviour. Even more interesting would be some insights how this behaves in Linux.
/edit
Still the disconnects occur, but not so often.
/edit
With the newest driver from the dell Homepage the problems are gone till now. (27.08.2010)
/edit
It works till now with the current driver 3.15.0 directly from the Intel Homepage
/edit 19.09.2011
Finally I did a reinstall of Win7 due to some other concerns (Partitioning etc.) and with the default Win7 or Windows Update Driver it works completely flawless since 2 weeks.
Maybe the installs and uninstalls flawed something in the networking, basically, I don't care.
Wednesday, July 21, 2010
Windows HPC and Ganglia Monitoring
We had the common problem, that we needed to restart ganglia a lot lately as several nodes did not report their data anymore and furthermore the service never came up in the first place.
We did several tests and it seems that the ganglia clients send a specific packet at startup; only at startup and only once. If this packet is not recieved than the server does not display any data of this client, although the data is actually collected and sent.
Randomly, some nodes cannot get the inital packet through and are not displayed.
Therefore we start the clients with a time displacement and ensure that all clients can report to the server in a fair fashion.
Still the restart issue remains but we could extend the restart period to 6h.
Thanks to my colleague Kosta G. for finding this issue.
We did several tests and it seems that the ganglia clients send a specific packet at startup; only at startup and only once. If this packet is not recieved than the server does not display any data of this client, although the data is actually collected and sent.
Randomly, some nodes cannot get the inital packet through and are not displayed.
Therefore we start the clients with a time displacement and ensure that all clients can report to the server in a fair fashion.
Still the restart issue remains but we could extend the restart period to 6h.
Thanks to my colleague Kosta G. for finding this issue.
Wednesday, June 23, 2010
Friday, June 11, 2010
Thread Pinning/Affinity
Thread pinning is very important in order to get feasible and reliable results on todays multi and manycore architectures. Otherwise threads will migrate from one core to another, causing the waste of clock cycles. Even more important, if you placed your memory correctly by first touch on ccNUMA systems, e.g. SGI Altix or every dual socket Intel XEON Core i7, the thread accessing the memory has to go over the QPI interface connecting the two sockets to access the memory if it is migrated to another socket.
Jan Treibig developed a tool for that called likwid-pin.
A sample usage would be as follows:
likwid-pin -s 0 -c"0,1,2,3,4,5,6,7" ./Exec
This would pin 8 Threads of the executable of the cores 0 to 8.
For information about the topology, just use the other tool, called likwid-topology which gives you cache and core hierarchy.
The skipmask is important and thread implementation specific. Also consider, that in hybrid programs, e.g. OpenMP and MPI, multiple shepard threads are present.
Jan Treibig developed a tool for that called likwid-pin.
A sample usage would be as follows:
likwid-pin -s 0 -c"0,1,2,3,4,5,6,7" ./Exec
This would pin 8 Threads of the executable of the cores 0 to 8.
For information about the topology, just use the other tool, called likwid-topology which gives you cache and core hierarchy.
The skipmask is important and thread implementation specific. Also consider, that in hybrid programs, e.g. OpenMP and MPI, multiple shepard threads are present.
Friday, May 28, 2010
Single Precision: Friend or Foe
The recent developments of so called disruptive technologies always lead to some kind of everlasting discussion.
Today I want to say something about the hassle whether GPUs are feasible in any way for scientific computing as their double precision Performance is nowadays not too far away from standard CPUs. And single precision is not worth the discussion, as nobody wants to board a plane or a ship which was simulated just in single precision.
Detour
So for non-simulators first some explanation: single precision means a floating point representation of a given number using up to 4 bytes. Double precision uses up to 8 bytes and can therefore provide much more accuracy.
GPUs are originally designed for graphics applications that do not actually need single precision. There is a bunch of very fast FLOP commands just working on 24 bits instead of 32 bits (again 32 bits = 4 byte = single precision).
E.g. current NVIDIA cards just have 1 dp FLOP unit per 8 sp FLOP unit.
Till now its obvious why everyone complains about the worse dp performance in contrast to sp performance. However, nobody (well I do) complains about the low dp performance I actually get off a current x86 processor. There are some kinds of system configuration were you will just get about 10% or even less the performance.
This comes as data is brought much slower to the computing units than it is computed on there.
This is true for most scientific codes, e.g. stencil codes. Therefore you will see the usual breakdown to 50% of performance when switching from sp to dp on GPUs as you see on CPUs, because you simply transfer twice the data over the same system bus.
So, the dp units are most often not the limit of compute performance.
Today I want to say something about the hassle whether GPUs are feasible in any way for scientific computing as their double precision Performance is nowadays not too far away from standard CPUs. And single precision is not worth the discussion, as nobody wants to board a plane or a ship which was simulated just in single precision.
Detour
So for non-simulators first some explanation: single precision means a floating point representation of a given number using up to 4 bytes. Double precision uses up to 8 bytes and can therefore provide much more accuracy.
GPUs are originally designed for graphics applications that do not actually need single precision. There is a bunch of very fast FLOP commands just working on 24 bits instead of 32 bits (again 32 bits = 4 byte = single precision).
E.g. current NVIDIA cards just have 1 dp FLOP unit per 8 sp FLOP unit.
Till now its obvious why everyone complains about the worse dp performance in contrast to sp performance. However, nobody (well I do) complains about the low dp performance I actually get off a current x86 processor. There are some kinds of system configuration were you will just get about 10% or even less the performance.
This comes as data is brought much slower to the computing units than it is computed on there.
This is true for most scientific codes, e.g. stencil codes. Therefore you will see the usual breakdown to 50% of performance when switching from sp to dp on GPUs as you see on CPUs, because you simply transfer twice the data over the same system bus.
So, the dp units are most often not the limit of compute performance.
Thursday, May 20, 2010
LaTex: Floatflt.sty missing on ubuntu lucid 10.04
The recent upgrade to the new ubuntu stable version missed installing all tex-live ressources, I thought at first.
However license of floatflt.sty has been changed, thus it is no longer in ubuntu or tex-live.
Here's a quick guide to reenable it.
LaTeX Error: File `floatflt.sty' not found
sudo mkdir -p /usr/share/texmf-texlive/tex/latex/floatflt
cd /usr/share/texmf-texlive/tex/latex/floatflt
sudo rm -f floatflt.* float*.tex
sudo wget http://mirror.ctan.org/macros/latex/contrib/floatflt/floatflt.ins
sudo wget http://mirror.ctan.org/macros/latex/contrib/floatflt/floatflt.dtx
sudo latex floatflt.ins
sudo texhash /usr/share/texmf-texlive
Source of suggestion with discussion wether to use backport ........
I would appreciate any hint to any solution which does this more automatically please drop me a comment on your solution.
However license of floatflt.sty has been changed, thus it is no longer in ubuntu or tex-live.
Here's a quick guide to reenable it.
Problem:
LaTeX Error: File `floatflt.sty' not found
Solution (to be run as root ):
sudo mkdir -p /usr/share/texmf-texlive/tex/latex/floatflt
cd /usr/share/texmf-texlive/tex/latex/floatflt
sudo rm -f floatflt.* float*.tex
sudo wget http://mirror.ctan.org/macros/latex/contrib/floatflt/floatflt.ins
sudo wget http://mirror.ctan.org/macros/latex/contrib/floatflt/floatflt.dtx
sudo latex floatflt.ins
sudo texhash /usr/share/texmf-texlive
Source of suggestion with discussion wether to use backport ........
I would appreciate any hint to any solution which does this more automatically please drop me a comment on your solution.
Monday, May 10, 2010
JUROPA MPI Buffer on demand
To enable huge runs with lots of MPI ranks you have to disable the per default allocated all-to-all send buffer on the NEC- Nehalem Cluster Juropa at FZ Jülich.
Juropa Introduction @ FZJ
Here is an excerpt from the official docu:
Most MPI programs do not need every connection
- Nearest neighbor communication
- Scatter/Gather and Allreduce based on binary trees
- Typically just a few dozen connections when having hundreds of
processes - ParaStation MPI supports this with ?on demand connections?
- export PSP_ONDEMAND=1
- was used for the Linpack runs (np > 24000)
- mpiexec --ondemand
- Backdraw
- Late all-to-all communication might fail due to short memory
- Default on JuRoPA is not to use ?on demand connections?
Links
Juropa Introduction @ FZJ
Tuesday, December 1, 2009
Windows HPC2008 Cluster Operational
Today the Windows HPC2008 Cluster of RRZE successfully got operational.
If you are interested in getting access to the system, contact hpc@rrze.uni-erlangen.de
Initial information for Login and usage can be found here:
Windows HPC2008 Cluster Launch Slides
If you are interested in getting access to the system, contact hpc@rrze.uni-erlangen.de
Initial information for Login and usage can be found here:
Windows HPC2008 Cluster Launch Slides
Subscribe to:
Posts (Atom)