Wednesday, June 23, 2010
Friday, June 11, 2010
Thread Pinning/Affinity
Thread pinning is very important in order to get feasible and reliable results on todays multi and manycore architectures. Otherwise threads will migrate from one core to another, causing the waste of clock cycles. Even more important, if you placed your memory correctly by first touch on ccNUMA systems, e.g. SGI Altix or every dual socket Intel XEON Core i7, the thread accessing the memory has to go over the QPI interface connecting the two sockets to access the memory if it is migrated to another socket.
Jan Treibig developed a tool for that called likwid-pin.
A sample usage would be as follows:
likwid-pin -s 0 -c"0,1,2,3,4,5,6,7" ./Exec
This would pin 8 Threads of the executable of the cores 0 to 8.
For information about the topology, just use the other tool, called likwid-topology which gives you cache and core hierarchy.
The skipmask is important and thread implementation specific. Also consider, that in hybrid programs, e.g. OpenMP and MPI, multiple shepard threads are present.
Jan Treibig developed a tool for that called likwid-pin.
A sample usage would be as follows:
likwid-pin -s 0 -c"0,1,2,3,4,5,6,7" ./Exec
This would pin 8 Threads of the executable of the cores 0 to 8.
For information about the topology, just use the other tool, called likwid-topology which gives you cache and core hierarchy.
The skipmask is important and thread implementation specific. Also consider, that in hybrid programs, e.g. OpenMP and MPI, multiple shepard threads are present.
Friday, May 28, 2010
Single Precision: Friend or Foe
The recent developments of so called disruptive technologies always lead to some kind of everlasting discussion.
Today I want to say something about the hassle whether GPUs are feasible in any way for scientific computing as their double precision Performance is nowadays not too far away from standard CPUs. And single precision is not worth the discussion, as nobody wants to board a plane or a ship which was simulated just in single precision.
Detour
So for non-simulators first some explanation: single precision means a floating point representation of a given number using up to 4 bytes. Double precision uses up to 8 bytes and can therefore provide much more accuracy.
GPUs are originally designed for graphics applications that do not actually need single precision. There is a bunch of very fast FLOP commands just working on 24 bits instead of 32 bits (again 32 bits = 4 byte = single precision).
E.g. current NVIDIA cards just have 1 dp FLOP unit per 8 sp FLOP unit.
Till now its obvious why everyone complains about the worse dp performance in contrast to sp performance. However, nobody (well I do) complains about the low dp performance I actually get off a current x86 processor. There are some kinds of system configuration were you will just get about 10% or even less the performance.
This comes as data is brought much slower to the computing units than it is computed on there.
This is true for most scientific codes, e.g. stencil codes. Therefore you will see the usual breakdown to 50% of performance when switching from sp to dp on GPUs as you see on CPUs, because you simply transfer twice the data over the same system bus.
So, the dp units are most often not the limit of compute performance.
Today I want to say something about the hassle whether GPUs are feasible in any way for scientific computing as their double precision Performance is nowadays not too far away from standard CPUs. And single precision is not worth the discussion, as nobody wants to board a plane or a ship which was simulated just in single precision.
Detour
So for non-simulators first some explanation: single precision means a floating point representation of a given number using up to 4 bytes. Double precision uses up to 8 bytes and can therefore provide much more accuracy.
GPUs are originally designed for graphics applications that do not actually need single precision. There is a bunch of very fast FLOP commands just working on 24 bits instead of 32 bits (again 32 bits = 4 byte = single precision).
E.g. current NVIDIA cards just have 1 dp FLOP unit per 8 sp FLOP unit.
Till now its obvious why everyone complains about the worse dp performance in contrast to sp performance. However, nobody (well I do) complains about the low dp performance I actually get off a current x86 processor. There are some kinds of system configuration were you will just get about 10% or even less the performance.
This comes as data is brought much slower to the computing units than it is computed on there.
This is true for most scientific codes, e.g. stencil codes. Therefore you will see the usual breakdown to 50% of performance when switching from sp to dp on GPUs as you see on CPUs, because you simply transfer twice the data over the same system bus.
So, the dp units are most often not the limit of compute performance.
Thursday, May 20, 2010
LaTex: Floatflt.sty missing on ubuntu lucid 10.04
The recent upgrade to the new ubuntu stable version missed installing all tex-live ressources, I thought at first.
However license of floatflt.sty has been changed, thus it is no longer in ubuntu or tex-live.
Here's a quick guide to reenable it.
LaTeX Error: File `floatflt.sty' not found
sudo mkdir -p /usr/share/texmf-texlive/tex/latex/floatflt
cd /usr/share/texmf-texlive/tex/latex/floatflt
sudo rm -f floatflt.* float*.tex
sudo wget http://mirror.ctan.org/macros/latex/contrib/floatflt/floatflt.ins
sudo wget http://mirror.ctan.org/macros/latex/contrib/floatflt/floatflt.dtx
sudo latex floatflt.ins
sudo texhash /usr/share/texmf-texlive
Source of suggestion with discussion wether to use backport ........
I would appreciate any hint to any solution which does this more automatically please drop me a comment on your solution.
However license of floatflt.sty has been changed, thus it is no longer in ubuntu or tex-live.
Here's a quick guide to reenable it.
Problem:
LaTeX Error: File `floatflt.sty' not found
Solution (to be run as root ):
sudo mkdir -p /usr/share/texmf-texlive/tex/latex/floatflt
cd /usr/share/texmf-texlive/tex/latex/floatflt
sudo rm -f floatflt.* float*.tex
sudo wget http://mirror.ctan.org/macros/latex/contrib/floatflt/floatflt.ins
sudo wget http://mirror.ctan.org/macros/latex/contrib/floatflt/floatflt.dtx
sudo latex floatflt.ins
sudo texhash /usr/share/texmf-texlive
Source of suggestion with discussion wether to use backport ........
I would appreciate any hint to any solution which does this more automatically please drop me a comment on your solution.
Monday, May 10, 2010
JUROPA MPI Buffer on demand
To enable huge runs with lots of MPI ranks you have to disable the per default allocated all-to-all send buffer on the NEC- Nehalem Cluster Juropa at FZ Jülich.
Juropa Introduction @ FZJ
Here is an excerpt from the official docu:
Most MPI programs do not need every connection
- Nearest neighbor communication
- Scatter/Gather and Allreduce based on binary trees
- Typically just a few dozen connections when having hundreds of
processes - ParaStation MPI supports this with ?on demand connections?
- export PSP_ONDEMAND=1
- was used for the Linpack runs (np > 24000)
- mpiexec --ondemand
- Backdraw
- Late all-to-all communication might fail due to short memory
- Default on JuRoPA is not to use ?on demand connections?
Links
Juropa Introduction @ FZJ
Tuesday, December 1, 2009
Windows HPC2008 Cluster Operational
Today the Windows HPC2008 Cluster of RRZE successfully got operational.
If you are interested in getting access to the system, contact hpc@rrze.uni-erlangen.de
Initial information for Login and usage can be found here:
Windows HPC2008 Cluster Launch Slides
If you are interested in getting access to the system, contact hpc@rrze.uni-erlangen.de
Initial information for Login and usage can be found here:
Windows HPC2008 Cluster Launch Slides
Thursday, November 19, 2009
Java; A quest with unattended installation
Some guidelines for unattended Java installation in an Win2008 HPC Cluster environment:
Note, that any login to the nodes to be installed and any prior to that, java installation can change all of the above experiences
- Deactivate UAC on all nodes; Otherwise the nodes will simple hang, and wait for the UAC acceptance that will never happen. You can omit this by doing the first Java installation by hand via RDesktop login. Afterwards all successive unattended installation will succeed. We have currently no clue why. Perhaps some kind of adaptive UAC?
Best practice is to deactive UAC via a registry key:
%windir%\system32\reg.exe ADD HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System /v EnableLUA /t REG_DWORD /d 1 /f - Reboot the nodes, so that the registry change becomes effective.
- All will run smoothly now, if the user installing java was logged in to the nodes at least once. This poses a problem with 20++ cluster nodes. however. The basic point is, that there is no User directory created yet and neither are all temp and AppData paths.
Java kindly ignores any variable defined by the OS, e.g. TEMP or TMP. And gathers its own temp dirs which leads to C:\Users\Username\AppData\LocalLow\Temp and many more.
So the installation fails once more, unless these directories are there.
So you have to create them yourself:
mkdir C:\Users\%USERNAME%\AppData\LocalLow\Temp\ - After that the usual JRE unattended deployment should proceed
Note, that any login to the nodes to be installed and any prior to that, java installation can change all of the above experiences
Wednesday, November 18, 2009
Windows HPC 2008 Cluster Launch
RRZE recently extended its Windows High-Performance-Computing Ressources.
Along with upgrading to the latest Windows HPC Server Release 2008, the hardware has been upgraded significantly:
16 dual-socket hexa-core AMD Istanbul Opteron processors (Dell Blade Center enclosure) equipped with 32GB of RAM service a peak performance of 2 TFLOP/s.

Interested users are invited to join the official launch on December 1st. 2009 at RRZE room 1.026.
After a quick tour of the new Job Scheduler, the main part is organized as an hands-on session were everyone can make themselves comfortable with the new environment.
A Registration via email to hpc@rrze.uni-erlangen.de is necessary for attending.

Designated trademarks and brands are the property of their respective owners
Along with upgrading to the latest Windows HPC Server Release 2008, the hardware has been upgraded significantly:
16 dual-socket hexa-core AMD Istanbul Opteron processors (Dell Blade Center enclosure) equipped with 32GB of RAM service a peak performance of 2 TFLOP/s.
Interested users are invited to join the official launch on December 1st. 2009 at RRZE room 1.026.
After a quick tour of the new Job Scheduler, the main part is organized as an hands-on session were everyone can make themselves comfortable with the new environment.
A Registration via email to hpc@rrze.uni-erlangen.de is necessary for attending.
Designated trademarks and brands are the property of their respective owners
Monday, October 5, 2009
Ganglia 3.1.2 Running as as a service After all
With the help of the srvany.exe from the Windows Ressource Kit Tools 2003 you can run any executable in Win2008 and Win2008 R2 either as a service.
You create yourself a service running solely srvany.exe
sc create GMOND binpath= c:\programme\ganglia\srvany.exe
Edit the service specs in the registry:
LocalMachine--> System\\CurrentControlSet\\Services\\GMond
Add a subkey named Parameters
Inside "Parameters" create a String value named Application.
Edit Application and put the call to ganglia into the value data field.
E.g. c:\programme\ganglia\gmond.exe -c "c:\programme\ganglia\gmond-node.conf "
Start the service over mmc or by sc start GMOND and it should be running.
(There should also be a way to do this with the cygwin service creation tool cygrunsrv. Thanks to Nigel for pointing that out.)
You create yourself a service running solely srvany.exe
sc create GMOND binpath= c:\programme\ganglia\srvany.exe
Edit the service specs in the registry:
LocalMachine--> System\\CurrentControlSet\\Services\\GMond
Add a subkey named Parameters
Inside "Parameters" create a String value named Application.
Edit Application and put the call to ganglia into the value data field.
E.g. c:\programme\ganglia\gmond.exe -c "c:\programme\ganglia\gmond-node.conf "
Start the service over mmc or by sc start GMOND and it should be running.
(There should also be a way to do this with the cygwin service creation tool cygrunsrv. Thanks to Nigel for pointing that out.)
Thursday, September 24, 2009
PCI express pinned Host Memory
Retesting my benchmarks with the current release of Cuda 2.3 I finally incorporated new features like pinned host memory allocation. Specs say that this improves the host to device transfers and vice versa.
Due to the special allocation the arrays will stay at the same location in memory , will not be swapped and are faster available for DMA transfers. In the other case, most data is first copied to a pinned memory buffer and then to the ordinarily allocated memory space. This detour is omitted in this case here.
The performance plot shows, that pinned memory now offers a performance of up to 5.9 GB/s on the fastest currently available PCIe X16 Gen 2 Interface which has a peak transfer rate of 8 GB/s. This corresponds to 73% of Peak performance with almost no optimization applied. In contrast, optimization such as a blocked data transfer, which prooved to increase performance some time ago [PCIe revisited] have no positive effect on performance anymore.
Using only the blocked optimzations without pinned memory still is better then doing an unblocked transfer from unpinned memory, but it only transfers about 4.5 GB/s which corresponds to 56 % of peak to the device.
Reading from the device is far worst with only 2.3 GB/s.
Due to the special allocation the arrays will stay at the same location in memory , will not be swapped and are faster available for DMA transfers. In the other case, most data is first copied to a pinned memory buffer and then to the ordinarily allocated memory space. This detour is omitted in this case here.
The performance plot shows, that pinned memory now offers a performance of up to 5.9 GB/s on the fastest currently available PCIe X16 Gen 2 Interface which has a peak transfer rate of 8 GB/s. This corresponds to 73% of Peak performance with almost no optimization applied. In contrast, optimization such as a blocked data transfer, which prooved to increase performance some time ago [PCIe revisited] have no positive effect on performance anymore.
Using only the blocked optimzations without pinned memory still is better then doing an unblocked transfer from unpinned memory, but it only transfers about 4.5 GB/s which corresponds to 56 % of peak to the device.
Reading from the device is far worst with only 2.3 GB/s.
Subscribe to:
Posts (Atom)