Tuesday, December 16, 2008

Windows CCS Cluster Upgrade

Recently the Windows CCS Cluster of the RRZE got a small upgrade.

One of the initial nodes rejoined the cluster and there are now 28 Opteron Cores available again.
Due to the usage of CFD for production runs, the user home was recently upgraded and the quota was extended to 10 GB per user.
Furthermore for special purposes and a limited amount of time there is an extra project home available with up to 120 GB space for extensive usage.

Monday, December 15, 2008

PCI express revisited

Test results with the new generation, i.e. GT 200 based and PCIe Generation 2.0 with doubled performance, show that general naive implemented copys do not get any speedups.
Blocked copys however, climb up to 4.5 GB/s when writing data to GPU memory.
Data copy back to the host is still relatively low at 2 GB/s.

pcix bandwidth measurements 8800 gtx vs. gtx 280



Link to first article

Monday, December 8, 2008

Fast Network, Fast disconnects (Linksys WRT610N )

Fast Network, Fast disconnects (Linksys WRT610N )
Looking forward to fast streaming HD Media over my new wireless router (WRT610N) I got into serious trouble on having a stable connection at all.

Having my network set up for WPA2 and TKIP for compatibility reasons, I got random disconnects of the whole 5 GHz band, while 2.4 GHz performed flawlessly. Searching the internet I stumbled across some serious accusations, that the WRT610N is a flawed design and overheats a lot.
Whether this is right or not I cannot say for sure, however I expected much more from Linksys and a home premium line product.

Searching a little more I came across another users experience that a change from the TKIP encryption to AES solved the problem of occuring disconnects.

And voila the problems seems to be solved.

So for everyone who can live with an AES only encryption on the 5 GHz 11N band and TKIP or AES on the 2.4 GHz 11g band the router is a perfect catch in both performance and appearance.

Monday, November 24, 2008

Yeehhaa: NVIDIA GT200 rocks

An exemplar of the new NVIDIA Series GT200 based GTX280 Graphics card arrived at our Computing Center last Friday . The card was installed and set up right away and the first benchmark ran on Saturday 22nd of November and finished today.

Some preliminary figures show the great improvement of this new generation as I expected from the data sheets. Soon I will post some verified results here and some about the changes from the G80 generation to the current GT200 chip.

Friday, November 7, 2008

Running MPI Jobs on Windows CCS

In order to run only one MPI process per allocated node on the Windows CCS Cluster, you have to tweak the system variable set by the scheduler. For each allocated processor the system variable (CCP_NODES) contains the associated hostname once.
As a consequence, four MPI processes are started.

In order to remove the redundant hostnames you call your program the following way from inside the scheduler:
mpiexec.exe -hosts %CCP_NODES: 4= 1%

%CCP_NODES: 4= 1% removes three out of four lines, which reduces each hostname down to one occurence, as the same hostnames are always consecutive.

Tuesday, October 21, 2008

Distributed Revision System Mercurial

Converting CVS to HG



To get hand on knowledge on the distributed revision systems like Mercurial,
just export one of your CVS Reps to a test HG Rep. Important for any repository, the history should stay intact (and hopefully will)!

A more complete guide can be found here:

Generate the repository folder and enter:
mkdir -p /path/to/hg/repo
cd /path/to/hg/repo

Generate the config file:

tailor -v --source-kind cvs --target-kind hg --repository /path/to/CVS/REP --module YourModuleName -r INITIAL >Config.tailor


for SSH access to the repository change /path/to/CVS/REP to :
:ext:USERNAME@YOURSERVER:/path/to/cvsrep


Change configfile to your needs
vi Config.tailor

Now you will at least need to change subdir from . to MODULENAME, and remove /MODULENAME from root-directory in the MODULENAME.tailor file (if it is really there).

Add the line:

patch-name-format =

Generate Mercurial project

tailor --configfile MODULENAME.tailor




Cloning repositories with ssh



To clone the repository, ssh can be used easily.
Just type the following hg clone ssh://yourlogin@yourhost/
or insert ssh://yourlogin@yourhost// in your client program as the source path.

Distributed Revision System Mercurial

Converting CVS to HG



To get hand on knowledge on the distributed revision systems like Mercurial,
just export one of your CVS Reps to a test HG Rep. Important for any repository, the history should stay intact (and hopefully will)!

A more complete guide can be found here:

Generate the repository folder and enter:
mkdir -p /path/to/hg/repo
cd /path/to/hg/repo

Generate the config file:

tailor -v --source-kind cvs --target-kind hg --repository /path/to/CVS/REP --module YourModuleName -r INITIAL >Config.tailor


for SSH access to the repository change /path/to/CVS/REP to :
:ext:USERNAME@YOURSERVER:/path/to/cvsrep


Change configfile to your needs
vi Config.tailor

Now you will at least need to change subdir from . to MODULENAME, and remove /MODULENAME from root-directory in the MODULENAME.tailor file (if it is really there).

Add the line:

patch-name-format =

Generate Mercurial project

tailor --configfile MODULENAME.tailor




Cloning repositories with ssh



To clone the repository, ssh can be used easily.
Just type the following hg clone ssh://yourlogin@yourhost/
or insert ssh://yourlogin@yourhost// in your client program as the source path.

Thursday, October 16, 2008

Co-array Fortran and UPC


CAF and UPC are Fortran and C extensions for the Partitioned Global Adress Space (PGAS) model.
So independent of the hardware restrictions, each processor can access (read and write) data from other processors, without the need of additional communication libraries, e.g. MPI.

HLRS provided an introductory course about this.
At the current development stage I do not clearly see the benefit for production codes. However, some ideas might be implemented more quickly with these paradigms than with ordinary MPI for testing purposes.

Monday, October 13, 2008

Theses


  • Johannes Habich: Performance Evaluation of Numeric Compute Kernels on NVIDIA GPUs, Master's Thesis , RRZE-Erlangen, LSS-Erlangen, 2008.

  • Johannes Habich: Improving computational efficiency of Lattice Boltzmann methods on complex geometries , Bachelor's Thesis , RRZE-Erlangen, LSS-Erlangen, 2006.

Other publications (not fully reviewed)

  • G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Submitted. Preprint: arXiv:1208.2908

  • J. Habich, C. Feichtinger, G. Wellein: GPGPU implementation of the LBM: Architectural Requirements and Performance Result,
    Parallel CFD Conference 2011, BSC, Barcelona, Spain, May 2011.

  • G. Wellein, J. Habich, G. Hager, T. Zeiser: Node-level performance of the lattice Boltzmann method on recent multicore CPUs,
    Parallel CFD Conference 2011, BSC, Barcelona, Spain, May 2011.

  • C. Feichtinger, J. Habich, H. Köstler, U. Rüde,  G. Wellein: WaLBerla: Heterogeneous Simulation of Particulate Flows on GPU Clusters,
    Parallel CFD Conference 2011, BSC, Barcelona, Spain, May 2011.

  • J. Habich, C. Feichtinger, G. Hager, G. Wellein: Poster: Parallelizing Lattice Boltzmann Simulations on Heterogeneous GPU&CPU Clusters. 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing '10, New Orleans, 13.11. -- 19.11.2010) , 2010.

  • J. Habich, T. Zeiser, G. Hager, G. Wellein: Enabling temporal blocking for a lattice Boltzmann flow solver through multicore aware wavefront parallelization. Parallel CFD Conference 2009, NASA AMES, Moffet Field (CA, USA), Mai, 2009.

  • S. Donath, T. Zeiser, G. Hager, J. Habich, G. Wellein: Optimizing performance of the lattice Boltzmann method for complex geometries on cache-based architectures, (In: F. Hülsemann, M. Kowarschik, U. Rüde (editors), Frontiers in Simulation -- Simulationstechnique, 18th Symposium in Erlangen, September 2005 (ASIM)), SCS Publishing, Fortschritte in der Simulationstechnik, ISBN 3-936150-41-9, (2005) 728-735.

Given or co-authored talks and presentations (see also section on lectures below)


  • J. Habich, C. Feichtinger, G. Wellein, waLBerla: MPI parallele Implementierung eines LBM Lösers auf dem Tsubame 2.0 GPU Cluster, Seminar Talk, Leibniz Rechenzentrum, München, Germany, Feb. 29th 2012.

  • J. Habich, C. Feichtinger, G. Wellein, Hochskalierbarer Lattice Boltzmann Löser für GPGPU Cluster , High Performance Computing Workshop , Leogang, Austria, Feb. 27th 2012.

  • G. Wellein, J.Habich, G. Hager, T. Zeiser, Node-level performance of the lattice Boltzmann method on recent multicore CPUs I,
    Parallel CFD Conference 2011, Barcelona, Spain, May 2011.

  • G. Wellein, J.Habich, G. Hager, T. Zeiser, Node-level performance of the lattice Boltzmann method on recent multicore CPUs II,
    Parallel CFD Conference 2011, Barcelona, Spain, May 2011.

  • J.Habich, C. Feichtinger, G. Wellein, GPGPU implementation of the LBM: Architectural Requirements and Performance Result,
    Parallel CFD Conference 2011, Barcelona, Spain, May 2011.

  • C. Feichtinger, J. Habich, H. Köstler, U. Rüde G. Wellein, WaLBerla: Heterogeneous Simulation of Particulate Flows on GPU Clusters,
    Parallel CFD Conference 2011, Barcelona, Spain, May 2011.

  • J.Habich, Ch. Feichtinger and G. Wellein, GPU optimizations at RRZE,
    invited Talk, ZISC GPU Workshop, Erlangen, Germany, April, 2011.

  • G. Wellein, G. Hager and J.Habich, The Lattice Boltzmann Method: Basic Performance Characteristics and Performance Modeling,
    invited Minisymposia talk, SIAM CSE 2011, Reno, Nevada, USA, March, 2011.

  • J.Habich and Ch. Feichtinger, Performance Optimizations for Heterogeneous and Hybrid 3D Lattice Boltzmann Simulations on Highly Parallel On-Chip Architectures,
    invited Minisymposia talk, SIAM CSE 2011, Reno, Nevada, USA, March, 2011.

  • J.Habich, Ch. Feichtinger, T. Zeiser, G. Wellein, Optimizations on Highly Parallel On-Chip Architectures: GPUs vs. Multi-Core CPUs (for stencil codes),
    iRMB TU-Braunschweig, invited Seminar talk, Braunschweig, Germany, July 2010.

  • J.Habich, Ch. Feichtinger, T. Zeiser, G. Wellein, Optimizations on Highly Parallel On-Chip Architectures: GPUs vs. Multi-Core CPUs (for stencil codes),
    iRMB TU-Braunschweig, invited Seminar talk, Braunschweig, Germany, July 2010.


  • J.Habich, Ch. Feichtinger, T. Zeiser, G. Hager, G. Wellein, Performance Modeling and Optimization for 3D Lattice Boltzmann Simulations on Highly Parallel On-Chip Architectures: GPUs Vs. Multi-Core CPUs,
    ECCOMAS CFD Lisboa, Lisbon, Portugal, June 2010.


  • J.Habich, T. Zeiser, G. Hager, G. Wellein, Performance Modeling and Multicore-aware Optimization for 3D Parallel Lattice Boltzmann Simulations,
    Facing the Multicore-Challenge, Heidelberger Akademie der Wissenschaften, Heidelberg, Germany, March 2010.


  • J. Habich, T. Zeiser, G. Hager, G. Wellein: Performance Evaluation of Numerical Compute Kernels on GPUs,
    First International Workshop on Computational Engineering - Special Topic Fluid-Structure Interaction, Herrsching am Ammersee, Germany, October, 2009.


  • J.Habich, T. Zeiser, G. Hager, G. Wellein: Towards multicore-aware wavefront parallelization of a lattice Boltzmann flow solver,
    5th Erlangen High-End-Computing Symposium, Erlangen, Germany, June 2009.


  • J. Habich, T. Zeiser, G. Hager, G. Wellein: Enabling temporal blocking for a lattice Boltzmann flow solver through multicore-aware wavefront parallelization, submitted to Parallel CFD Conference,
    Moffett Field, California, USA, May 18-22, 2009.


  • J. Habich, T. Zeiser, G. Hager, G. Wellein: Speeding up a Lattice Boltzmann Kernel on nVIDIA GPUs,
    First International Conference on Parallel, Distributed and Grid Computing for Engineering (PARENG09-S01), Pecs, Hungary, April 2009.


  • J. Habich, G. Hager: Erfahrungsbericht Windows HPC in Erlangen,
    WindowsHPC User Group 2nd Meeting, Dresden, Germany, March 2009.


  • J. Habich, G. Hager: Windows CCS im Produktionsbetrieb und erste Erfahrungen mit HPC Server 2008,
    WindowsHPC User Group 1st Meeting, Aachen, Germany, April 2008.

  • T. Zeiser, J. Habich, G. Hager, G. Wellein: Vector computers in a world of commodity clusters, massively parallel systems and many-core many-threaded CPUs: recent experience based on advanced lattice Boltzmann flow solvers,
    HLRS Results and Review Workshop, Stuttgart, Germany, September 2008.

  • S. Donath, T. Zeiser, G. Hager, J. Habich, G. Wellein: On cache-optimized implementations of the lattice Boltzmann method on complex geometries,
    ASIM, Erlangen, Germany, September 2005.

Conference, workshop and tutorial participation without own presentation


  • WindowsHPC User Group 3rd Meeting, St. Augustin, March 2010.

  • WindowsHPC User Group 2nd Meeting, Dresden, March 2009.

  • Introduction to Unified Parallel C (UPC) and Co-array Fortran (CAF) HLRS, October 2008

  • Course on Microfluidics University of Erlangen-Nuremberg Computer Science 10, System Simulation, October 2008

  • IBM Power6 Programming Workshop at RZG, September, 2008

  • PRACE Petascale Summer School (P2S2), Stockholm, Sweden, August, 2008.


Wednesday, September 10, 2008

PCI express bandwidth measurements

Benchmarking the PCI express capabilities with CUDA I stumbled across the weird behaviour that a 4 MB block seems to achieve the best sustainable bandwidth. At least when writing to the host.
However, transmitting more than 4 MB but with 4 MB data packets (let's call it blocked copy) does leave a gap in performance.
Although the performance is regained at the end with almost filling the whole GPU memory, the question is what causes the performance to drop to 2GB/s in the first place.

Another interesting question is the jump in performance at 1e6 bytes. Possibly a switch in protocols
Performance of PCI Express transfers to NVIDIA G80 8800 GTX card

HPC Server 2008 launch

The official launch of HPC Server 2008 is on 16th of October 2008, at the Frankfurt Rhein-Main Airport. More information on the official HPC 2008 launch website .

Tuesday, September 2, 2008

Towards Teraflops for Games

With the release of the next generation of GPUs, NVIDIA and AMD (former ATI) graphic boards deliver now performance in the order of one teraflop in single precision accuracy. NVIDIA nearly doubled both the count of processors and the memory bus width. Interesting for research is now, how the sustainable performance of programs and algorithms scales with the new platform.
Until now I was not able to test my own algorithms, the Streambenchmarks and the lattice Boltzmann method (see my Thesis for more details ), on the new NVIDIA GPUs.

Double precision also made its way into the GPU circuits, unfortunately with a huge performance loss to around a tenth of single precision performance.
In contrast to that current CPUs lose only about 50% of performance, which comes obvious from the doubled computational work.

Here a little demonstration about the key difference between CPU and GPU NVISION

Windows HPC Deployment

The Windows High Performance Cluster Competence Center located at the RWTH Aachen is giving tutorials for administrators on Windows HPC 2008 deployment. Please find more detailed information on their webpage.

Monday, September 1, 2008

Windows HPC Event at RWTH Aachen

The Windows High Performance Cluster Competence Center located at the RWTH Aachen is giving tutorials on using Windows HPC 2008, the upcoming version of Windows Compute Cluster Server. Please find more detailed information on their webpage.

PRACE Petascale Summer School

PRACE Summer School website

Taking place this week (25th to 29th of August) in Stockholm, Sweden, the Prace Summer School tries to evaluate the needs of the current academic HPC user community. The general aim is to get benchmarks and metrics for future petascale systems.

Current surveys show, that only a small portion of the overall leading HPC systems are used with large massive parallel jobs. A great deal stays under 10% of one supercomputers resources, not utilizing the parallel abilities of such a machine.

On the other side, profound knowledge is needed to implement common algorithms to scale accros 64 to 128 nodes towards 1K or even more nodes. Therefore a lot of stress is put on techniques and hands-on sessions to teach more knowledge about that topic.

Summed up it was a great event to get additional skills and training and as well as to get to know the different kinds of algorithms and user expectations in HPC.

First Shot

I'm currently with the HPC group @ RRZE and working on my master thesis about HPC on graphic cards regarding benchmark kernels and flow solvers.



So any remarks or hints? Drop them here!


<% image name="cuda hpc" %>

Thanks



/edit 01.07.08

Thesis finished :-)