Tuesday, December 16, 2008
Windows CCS Cluster Upgrade
One of the initial nodes rejoined the cluster and there are now 28 Opteron Cores available again.
Due to the usage of CFD for production runs, the user home was recently upgraded and the quota was extended to 10 GB per user.
Furthermore for special purposes and a limited amount of time there is an extra project home available with up to 120 GB space for extensive usage.
Monday, December 15, 2008
PCI express revisited
Blocked copys however, climb up to 4.5 GB/s when writing data to GPU memory.
Data copy back to the host is still relatively low at 2 GB/s.
Link to first article
Monday, December 8, 2008
Fast Network, Fast disconnects (Linksys WRT610N )
Looking forward to fast streaming HD Media over my new wireless router (WRT610N) I got into serious trouble on having a stable connection at all.
Having my network set up for WPA2 and TKIP for compatibility reasons, I got random disconnects of the whole 5 GHz band, while 2.4 GHz performed flawlessly. Searching the internet I stumbled across some serious accusations, that the WRT610N is a flawed design and overheats a lot.
Whether this is right or not I cannot say for sure, however I expected much more from Linksys and a home premium line product.
Searching a little more I came across another users experience that a change from the TKIP encryption to AES solved the problem of occuring disconnects.
And voila the problems seems to be solved.
So for everyone who can live with an AES only encryption on the 5 GHz 11N band and TKIP or AES on the 2.4 GHz 11g band the router is a perfect catch in both performance and appearance.
Monday, November 24, 2008
Yeehhaa: NVIDIA GT200 rocks
Some preliminary figures show the great improvement of this new generation as I expected from the data sheets. Soon I will post some verified results here and some about the changes from the G80 generation to the current GT200 chip.
Friday, November 7, 2008
Running MPI Jobs on Windows CCS
As a consequence, four MPI processes are started.
In order to remove the redundant hostnames you call your program the following way from inside the scheduler:
mpiexec.exe -hosts %CCP_NODES: 4= 1%
%CCP_NODES: 4= 1% removes three out of four lines, which reduces each hostname down to one occurence, as the same hostnames are always consecutive.
Tuesday, October 21, 2008
Distributed Revision System Mercurial
Converting CVS to HG
To get hand on knowledge on the distributed revision systems like Mercurial,
just export one of your CVS Reps to a test HG Rep. Important for any repository, the history should stay intact (and hopefully will)!
A more complete guide can be found here:
Generate the repository folder and enter:
mkdir -p /path/to/hg/repo
cd /path/to/hg/repo
Generate the config file:
tailor -v --source-kind cvs --target-kind hg --repository /path/to/CVS/REP --module YourModuleName -r INITIAL >Config.tailor
for SSH access to the repository change /path/to/CVS/REP to :
:ext:USERNAME@YOURSERVER:/path/to/cvsrep
Change configfile to your needs
vi Config.tailor
Now you will at least need to change subdir from . to MODULENAME, and remove /MODULENAME from root-directory in the MODULENAME.tailor file (if it is really there).
Add the line:
patch-name-format =
Generate Mercurial project
tailor --configfile MODULENAME.tailor
Cloning repositories with ssh
To clone the repository, ssh can be used easily.
Just type the following hg clone ssh://yourlogin@yourhost/
or insert ssh://yourlogin@yourhost// in your client program as the source path.
Distributed Revision System Mercurial
Converting CVS to HG
To get hand on knowledge on the distributed revision systems like Mercurial,
just export one of your CVS Reps to a test HG Rep. Important for any repository, the history should stay intact (and hopefully will)!
A more complete guide can be found here:
Generate the repository folder and enter:
mkdir -p /path/to/hg/repo
cd /path/to/hg/repo
Generate the config file:
tailor -v --source-kind cvs --target-kind hg --repository /path/to/CVS/REP --module YourModuleName -r INITIAL >Config.tailor
for SSH access to the repository change /path/to/CVS/REP to :
:ext:USERNAME@YOURSERVER:/path/to/cvsrep
Change configfile to your needs
vi Config.tailor
Now you will at least need to change subdir from . to MODULENAME, and remove /MODULENAME from root-directory in the MODULENAME.tailor file (if it is really there).
Add the line:
patch-name-format =
Generate Mercurial project
tailor --configfile MODULENAME.tailor
Cloning repositories with ssh
To clone the repository, ssh can be used easily.
Just type the following hg clone ssh://yourlogin@yourhost/
or insert ssh://yourlogin@yourhost// in your client program as the source path.
Thursday, October 16, 2008
Co-array Fortran and UPC
CAF and UPC are Fortran and C extensions for the Partitioned Global Adress Space (PGAS) model.
So independent of the hardware restrictions, each processor can access (read and write) data from other processors, without the need of additional communication libraries, e.g. MPI.
HLRS provided an introductory course about this.
At the current development stage I do not clearly see the benefit for production codes. However, some ideas might be implemented more quickly with these paradigms than with ordinary MPI for testing purposes.
Monday, October 13, 2008
Theses
- Johannes Habich: Performance Evaluation of Numeric Compute Kernels on NVIDIA GPUs, Master's Thesis , RRZE-Erlangen, LSS-Erlangen, 2008.
- Johannes Habich: Improving computational efficiency of Lattice Boltzmann methods on complex geometries , Bachelor's Thesis , RRZE-Erlangen, LSS-Erlangen, 2006.
Other publications (not fully reviewed)
- G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Submitted. Preprint: arXiv:1208.2908
- J. Habich, C. Feichtinger, G. Wellein: GPGPU implementation of the LBM: Architectural Requirements and Performance Result,
Parallel CFD Conference 2011, BSC, Barcelona, Spain, May 2011. - G. Wellein, J. Habich, G. Hager, T. Zeiser: Node-level performance of the lattice Boltzmann method on recent multicore CPUs,
Parallel CFD Conference 2011, BSC, Barcelona, Spain, May 2011. - C. Feichtinger, J. Habich, H. Köstler, U. Rüde, G. Wellein: WaLBerla: Heterogeneous Simulation of Particulate Flows on GPU Clusters,
Parallel CFD Conference 2011, BSC, Barcelona, Spain, May 2011. - J. Habich, C. Feichtinger, G. Hager, G. Wellein: Poster: Parallelizing Lattice Boltzmann Simulations on Heterogeneous GPU&CPU Clusters. 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing '10, New Orleans, 13.11. -- 19.11.2010) , 2010.
- J. Habich, T. Zeiser, G. Hager, G. Wellein: Enabling temporal blocking for a lattice Boltzmann flow solver through multicore aware wavefront parallelization. Parallel CFD Conference 2009, NASA AMES, Moffet Field (CA, USA), Mai, 2009.
- S. Donath, T. Zeiser, G. Hager, J. Habich, G. Wellein: Optimizing performance of the lattice Boltzmann method for complex geometries on cache-based architectures, (In: F. Hülsemann, M. Kowarschik, U. Rüde (editors), Frontiers in Simulation -- Simulationstechnique, 18th Symposium in Erlangen, September 2005 (ASIM)), SCS Publishing, Fortschritte in der Simulationstechnik, ISBN 3-936150-41-9, (2005) 728-735.
Given or co-authored talks and presentations (see also section on lectures below)
- J. Habich, C. Feichtinger, G. Wellein, waLBerla: MPI parallele Implementierung eines LBM Lösers auf dem Tsubame 2.0 GPU Cluster, Seminar Talk, Leibniz Rechenzentrum, München, Germany, Feb. 29th 2012.
- J. Habich, C. Feichtinger, G. Wellein, Hochskalierbarer Lattice Boltzmann Löser für GPGPU Cluster , High Performance Computing Workshop , Leogang, Austria, Feb. 27th 2012.
- G. Wellein, J.Habich, G. Hager, T. Zeiser, Node-level performance of the lattice Boltzmann method on recent multicore CPUs I,
Parallel CFD Conference 2011, Barcelona, Spain, May 2011. - G. Wellein, J.Habich, G. Hager, T. Zeiser, Node-level performance of the lattice Boltzmann method on recent multicore CPUs II,
Parallel CFD Conference 2011, Barcelona, Spain, May 2011. - J.Habich, C. Feichtinger, G. Wellein, GPGPU implementation of the LBM: Architectural Requirements and Performance Result,
Parallel CFD Conference 2011, Barcelona, Spain, May 2011. - C. Feichtinger, J. Habich, H. Köstler, U. Rüde G. Wellein, WaLBerla: Heterogeneous Simulation of Particulate Flows on GPU Clusters,
Parallel CFD Conference 2011, Barcelona, Spain, May 2011. - J.Habich, Ch. Feichtinger and G. Wellein, GPU optimizations at RRZE,
invited Talk, ZISC GPU Workshop, Erlangen, Germany, April, 2011. - G. Wellein, G. Hager and J.Habich, The Lattice Boltzmann Method: Basic Performance Characteristics and Performance Modeling,
invited Minisymposia talk, SIAM CSE 2011, Reno, Nevada, USA, March, 2011. - J.Habich and Ch. Feichtinger, Performance Optimizations for Heterogeneous and Hybrid 3D Lattice Boltzmann Simulations on Highly Parallel On-Chip Architectures,
invited Minisymposia talk, SIAM CSE 2011, Reno, Nevada, USA, March, 2011. - J.Habich, Ch. Feichtinger, T. Zeiser, G. Wellein, Optimizations on Highly Parallel On-Chip Architectures: GPUs vs. Multi-Core CPUs (for stencil codes),
iRMB TU-Braunschweig, invited Seminar talk, Braunschweig, Germany, July 2010. - J.Habich, Ch. Feichtinger, T. Zeiser, G. Wellein, Optimizations on Highly Parallel On-Chip Architectures: GPUs vs. Multi-Core CPUs (for stencil codes),
iRMB TU-Braunschweig, invited Seminar talk, Braunschweig, Germany, July 2010. - J.Habich, Ch. Feichtinger, T. Zeiser, G. Hager, G. Wellein, Performance Modeling and Optimization for 3D Lattice Boltzmann Simulations on Highly Parallel On-Chip Architectures: GPUs Vs. Multi-Core CPUs,
ECCOMAS CFD Lisboa, Lisbon, Portugal, June 2010. - J.Habich, T. Zeiser, G. Hager, G. Wellein, Performance Modeling and Multicore-aware Optimization for 3D Parallel Lattice Boltzmann Simulations,
Facing the Multicore-Challenge, Heidelberger Akademie der Wissenschaften, Heidelberg, Germany, March 2010. - J. Habich, T. Zeiser, G. Hager, G. Wellein: Performance Evaluation of Numerical Compute Kernels on GPUs,
First International Workshop on Computational Engineering - Special Topic Fluid-Structure Interaction, Herrsching am Ammersee, Germany, October, 2009. - J.Habich, T. Zeiser, G. Hager, G. Wellein: Towards multicore-aware wavefront parallelization of a lattice Boltzmann flow solver,
5th Erlangen High-End-Computing Symposium, Erlangen, Germany, June 2009. - J. Habich, T. Zeiser, G. Hager, G. Wellein: Enabling temporal blocking for a lattice Boltzmann flow solver through multicore-aware wavefront parallelization, submitted to Parallel CFD Conference,
Moffett Field, California, USA, May 18-22, 2009. - J. Habich, T. Zeiser, G. Hager, G. Wellein: Speeding up a Lattice Boltzmann Kernel on nVIDIA GPUs,
First International Conference on Parallel, Distributed and Grid Computing for Engineering (PARENG09-S01), Pecs, Hungary, April 2009. - J. Habich, G. Hager: Erfahrungsbericht Windows HPC in Erlangen,
WindowsHPC User Group 2nd Meeting, Dresden, Germany, March 2009. - J. Habich, G. Hager: Windows CCS im Produktionsbetrieb und erste Erfahrungen mit HPC Server 2008,
WindowsHPC User Group 1st Meeting, Aachen, Germany, April 2008. - T. Zeiser, J. Habich, G. Hager, G. Wellein: Vector computers in a world of commodity clusters, massively parallel systems and many-core many-threaded CPUs: recent experience based on advanced lattice Boltzmann flow solvers,
HLRS Results and Review Workshop, Stuttgart, Germany, September 2008. - S. Donath, T. Zeiser, G. Hager, J. Habich, G. Wellein: On cache-optimized implementations of the lattice Boltzmann method on complex geometries,
ASIM, Erlangen, Germany, September 2005.
Conference, workshop and tutorial participation without own presentation
- WindowsHPC User Group 3rd Meeting, St. Augustin, March 2010.
- WindowsHPC User Group 2nd Meeting, Dresden, March 2009.
- Introduction to Unified Parallel C (UPC) and Co-array Fortran (CAF) HLRS, October 2008
- Course on Microfluidics University of Erlangen-Nuremberg Computer Science 10, System Simulation, October 2008
- IBM Power6 Programming Workshop at RZG, September, 2008
- PRACE Petascale Summer School (P2S2), Stockholm, Sweden, August, 2008.
Wednesday, September 10, 2008
PCI express bandwidth measurements
However, transmitting more than 4 MB but with 4 MB data packets (let's call it blocked copy) does leave a gap in performance.
Although the performance is regained at the end with almost filling the whole GPU memory, the question is what causes the performance to drop to 2GB/s in the first place.
Another interesting question is the jump in performance at 1e6 bytes. Possibly a switch in protocols
HPC Server 2008 launch
Tuesday, September 2, 2008
Towards Teraflops for Games
Until now I was not able to test my own algorithms, the Streambenchmarks and the lattice Boltzmann method (see my Thesis for more details ), on the new NVIDIA GPUs.
Double precision also made its way into the GPU circuits, unfortunately with a huge performance loss to around a tenth of single precision performance.
In contrast to that current CPUs lose only about 50% of performance, which comes obvious from the doubled computational work.
Here a little demonstration about the key difference between CPU and GPU NVISION
Windows HPC Deployment
The Windows High Performance Cluster Competence Center located at the RWTH Aachen is giving tutorials for administrators on Windows HPC 2008 deployment. Please find more detailed information on their webpage.
Monday, September 1, 2008
Windows HPC Event at RWTH Aachen
The Windows High Performance Cluster Competence Center located at the RWTH Aachen is giving tutorials on using Windows HPC 2008, the upcoming version of Windows Compute Cluster Server. Please find more detailed information on their webpage.
PRACE Petascale Summer School
Taking place this week (25th to 29th of August) in Stockholm, Sweden, the Prace Summer School tries to evaluate the needs of the current academic HPC user community. The general aim is to get benchmarks and metrics for future petascale systems.
Current surveys show, that only a small portion of the overall leading HPC systems are used with large massive parallel jobs. A great deal stays under 10% of one supercomputers resources, not utilizing the parallel abilities of such a machine.
On the other side, profound knowledge is needed to implement common algorithms to scale accros 64 to 128 nodes towards 1K or even more nodes. Therefore a lot of stress is put on techniques and hands-on sessions to teach more knowledge about that topic.
Summed up it was a great event to get additional skills and training and as well as to get to know the different kinds of algorithms and user expectations in HPC.
First Shot
So any remarks or hints? Drop them here!
<% image name="cuda hpc" %>
Thanks
/edit 01.07.08
Thesis finished :-)