Thursday, December 23, 2010

GIT: Working on Branches: Sharing files and commits

Motivation:
You have several development branches but basically use the same files to test or benchmark all versions. Then there is no sense in developing different code in all these branches for the same purpose.
You can easily merge single files from one branch to the other by cherry-pick.

Example:
1. First commit a file and only the file you want to share, nothing else:
git commit bench.sh -m"Newest fancy script which does all tests for all versions"
2. Save the key somewhere.
3. Go to the other branch folder or checkout the other branch in the same folder, as you like.
4. git cherry-pick 123784hash

You should now have the same file in the other branch as well.


Thanks to Markus for pointing me towards this.

Wednesday, December 22, 2010

GIT: Tagging

Tags are nothing else then names for certain revision.
So called annotated tags store the status of the project at a specific point in time for future use.

Create a tag:
git tag -a v1.0 -m"Your comment for version 1.0"

Push tag to repository:
git push --tags

Contact Information

Johannes Habich M. Sc.

* E-Mail: Johannes("at")Habich.info
* Homepage: http://www.johannes-habich.de



Official RRZE contact site

Personal Homepage and private Contact

Monday, December 20, 2010

Windows7 / Vista / Server2008 unidentified network

Although a network consisting of a switch and 2 computers would be considered as very private, Win7(aso.) will consider it as unidentifiable and will not allow you to set the network to private.
Here's a howto on solving this issue:

1. Start –> run –> MMC –> press enter

2. In MMC console , from menu file select Add/Remove Snap-in

3. Select Group Policy Object editor –> Press Add –> select Local computer –> press OK –>press OK

4. Open Computer configration –>Windows Settings –>Security Settings –>select Network list manager policies
on the right Side you will see options:
double click –>Unidentified networks

5. Then you can select the option to consider the Unidentified networks as private and if user can change the
location


Source

Friday, December 17, 2010

CUDA Windows Development Environment

Supported by the new NVIDIA Tesla Compute Cluster (TCC) driver we offer now an integrated development environment for CUDA on basis of Windows HPC2008 and Visual Studio.


For nearly 2 years now RRZE provides development resources for CUDA GPGPU computing under the Linux OS. In December 2009 a GPU cluster joined the two development machines in order to support production runs on up to 16 GPUs. Please contact us if you are interested in GPGPU computing, whether for Linux/Windows based development or production runs.

GIT: bookmark collection

Just some links:
Tutorial

Wednesday, December 15, 2010

GIT: Branching

To create a branch for some development work in git is really easy.
First clone an existing repository like described here.

In the new directory simple execute: git branch newbranch.
However, if you execute git status you will see that you are still on the master branch or the other branch you worked before (then the question is how did you get there
??).
Execute git checkout newbranch an d git status will show you that you are now working on branch .
In order to push this information to another repository or a git server, git push per se will not do the trick. First you need to execute git push origin newbranch and now your branch is committed to the other repository.

Friday, December 10, 2010

Windows and CUDA; enabling TCC with nvidia-smi

Like in Linux you can use nvidia-smi to set different modes on the TESLA GPUs.
nvidia-smi is located usually at: C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe

Go there with a command prompt and administrative privileges and type nvidia-smi -s. This gives you the current status and the status of TCC mode after reboot.
Set exclusive compute mode enabled for the first GPU by nvidia-smi -g 0 -c 1
Set exclusive compute mode disabled for the first GPU by nvidia-smi -g 0 -c 0
For other GPUs increment the number after -g

/edit 24.12.2010:
Also look at the first comment on how to change between WDDM and TCC driver model.
Thanky Francoise for reporting my mistake. I corrected it above

Thursday, December 9, 2010

Win2008 HPC Server and CUDA TCC revisited

The release of the stable NVIDIA Driver 260.83 broke my Windows CUDA programming environment.
With the currently newst driver, 263.06, I gave it another shoot. Initially the CUDASDK sample programs did not recognize the GPU as CUDA capable and there was just some babbling about DRIVER and TK mismatch.
However this time searching the web got me to an IBM webpage which got a solution for their servers running Windows 2008 R2.
I tried this in Win2008 and it works like charm:

  • Enter the registry edit utility typing regedit in the run dialog and navigate to:


[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4D36E968-E325-11CE-BFC1-08002BE10318}\

  • You will find subfolders named 0001 0002 aso. depending on the number of GPUs in your system.

  • For each card you want to enable CUDA go to that 000X directory and add the following reg key (32bit dword worked for me):


"AdapterType"=dword:00000002

If you access the system via RDP read my blog entry on Using nvidia-smi for TCC on how to set this up!

Source of this information is IBM and can be found here for further reference and even more details: IBM Support Site

Saturday, November 6, 2010

Embedding fonts in Latex / XMgrace / PDF toolchain

How to embed fonts in pdf files which are generated from latex and eps source under linux is nicely described here:
http://www.wkiri.com/today/?p=60

Be sure to disable the field "use device fonts" in the xmgrace printer settings when printing to eps

Friday, November 5, 2010

GIT: Distributed Revision System

Distributed Revision Systems have the advantage that a central server is not necessary in contrast to CVS or SVN. Furthermore commits and even more important diffs to other versions can be made  with a local repository only. This is an adanvantage when working offline, e.g. on journeys, on a plane etc. Furthermore a new branch for testing can quickly be made by simply cloning the repository once more and work in the new directory.

To setup up git client on your ubuntu linux just type aptitude install git-core and you're done.

For windows you need to download 2 packages:

MSysGIT
Tortoise GIT

If you install in this order everything should be fine. If Tortoise later on complains about not finding git, you get to Tortoise' Settings and point the git path to the directory where you installed msysgit +\bin e.g. c:\Program Files\git\bin

Working together on a project can be done like CVS and SVN. But in contrast to these central methods, you do not commit your files to the central repository but first to your local rep.

The central repository is then updated by a so called push.

To get the updates from the central server you do a pull.

Basic tasks are:

  1. Clone the repository to your local workstation: git clone git@someserver.com:projectname

  2. Add new files: git add xyz

  3. Commit the new files or changes is already added files: git commit xyz

  4. Send changes to server: git push

  5. Get changes from server to rep only: git fetch

  6. Get changes from server to rep and update checked out files: git pull


This is just for starting with git, please consult the manuals and documentation on more advanced topics.
Thanks to Thomas for this information for a quick start.

Another Distributed Revision System is mercurial where you find a small tutorial here: MercurialHG

GIT: Distributed Revision System

Distributed Revision Systems have the advantage that a central server is not necessary in contrast to CVS or SVN. Furthermore commits and even more important diffs to other versions can be made  with a local repository only. This is an adanvantage when working offline, e.g. on journeys, on a plane etc. Furthermore a new branch for testing can quickly be made by simply cloning the repository once more and work in the new directory.

To setup up git client on your ubuntu linux just type aptitude install git-core and you're done.

For windows you need to download 2 packages:

MSysGIT
Tortoise GIT

If you install in this order everything should be fine. If Tortoise later on complains about not finding git, you get to Tortoise' Settings and point the git path to the directory where you installed msysgit +\bin e.g. c:\Program Files\git\bin

Working together on a project can be done like CVS and SVN. But in contrast to these central methods, you do not commit your files to the central repository but first to your local rep.

The central repository is then updated by a so called push.

To get the updates from the central server you do a pull.

Basic tasks are:

  1. Clone the repository to your local workstation: git clone git@someserver.com:projectname

  2. Add new files: git add xyz

  3. Commit the new files or changes is already added files: git commit xyz

  4. Send changes to server: git push

  5. Get changes from server to rep only: git fetch

  6. Get changes from server to rep and update checked out files: git pull


This is just for starting with git, please consult the manuals and documentation on more advanced topics.
Thanks to Thomas for this information for a quick start.

Another Distributed Revision System is mercurial where you find a small tutorial here: MercurialHG

Friday, October 22, 2010

[Winhpcug] Einladung: Windows-HPC in Wirtschafts- und Finanzwiss. am 05.11. in Köln

Am 05.11.2010 findet in Köln ein eintägiges Treffen (10:00 - 17:00 Uhr) der
Benutzergruppe mit Schwerpunkt auf den Wirtschafts- und Finanzwissenschaften
statt. Dazu kommt die Diskussion neuer Technologien wie dem Windows HPC
Server 2008 R2. Mehr Informationen finden Sie im angehängten Flyer sowie auf
der Webseite , die auch eine Anmeldemöglichkeit bietet.

Anmeldung and die WINHPCUG Mailingliste:
Winhpcug@lists.rwth-aachen.de
https://mailman.rwth-aachen.de/mailman/listinfo/winhpcug

Friday, October 15, 2010

Lectures and seminars of the HPC group

Interested in topics in and around HPC for your studies?


Than have a look at the official homepage .


Find all lectures and seminars here.

Wednesday, October 13, 2010

NVIDIA CUDA TCC Driver Released 260.83

Just today Nvidia released the WHQL certified Tesla Compute Cluster driver TCC 260.83 for usage in e.g. Windows 2008 Server/HPC.
Till now only a beta version was available
With that special driver you have the ability to use GPGPU compute resources via RDP or via WindowsHPC batch processing mode.

Download the driver here


/edit:
Actually installing this driver broke my working environment. So be sure to keep a backup of the system. Even reinstalling the beta version did not solve the problem.

Tuesday, October 12, 2010

Win2008 HPC Server and CUDA TCC

Nvidia now provides a beta driver called Tesla Compute Cluster (TCC) in order to use CUDA GPUs within a windows cluster environment. Not only remotely via RDP but also in batch processing. Till now, the HPCServer lacked this ability, as Windows did not fire up the graphics driver inside the limited batch logon mode.

My first steps with TCC took a little bit longer than estimated.

First of all It is not possible to have a NVIDIA and AMD or INTEL GPU side by side as Windows needs to use one unified WDM and thats either one or the other vendor. This was completely new to me.

After this first minor setback and reequipped with only the tesla C2050 the BIOS did not finish, so be sure to be up to date with your BIOS revision.
Another NVIDIA card was the quick fix on my side.

Next thing is the setup. Install the 260 (currently beta) drivers and the exclamation mark in the device manager should vanish.
After that install the toolkit and SDK if you like.
With the nvidia-smi tool, which you find in one of the uncountable NVIDIA folders which are there now, you can have a look if the card is initally correctly recognized.
As well set the TCC mode of the Tesla card to enabled if you want to have remote cuda capabilities:

nvidia-smi -s --> shows the current status
nvidia-smi -g 0 -c 1 --> enables TCC on GPU 0


Next thing you want to test the device query coming with the SDK.
If it runs and everything looks fine, feel gifted!

Nothing did run on my setup. So first of all I tried to build the SDK example myself. Therefore first of all build the Cuda utilities, lying somewhere in the SDK within the folder "common".
Depending on the Nsight or TK version you have installed you get an error when opening the VS project fles . The you need to edit the visual studio with a text editor of your choice and replace the outdated build rule with the one actually installed.

  • In the error message get to the folder where VS does not find the file.

  • Copy the path and go there with your file browser

  • Find the file most equal to the one in the VS error message.

  • Once found open the VS file and replace the wrong filename there with the correct one

  • VS should open



In order to compile, add the correct include and library directories to the VS project.
Finally you can build deviceQuery or any other program.

Still this setup gave me the same error as the precompiled deviceQuery:
cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.

With the help of the DependencyWalker i found out that a missing DLL was the problem, namely:
linkinfo.dll.

You can get this by adding the feature named "Desktop Experience" through the server manager.
Once installed and rebooted the device query worked.

Friday, September 3, 2010

TinyGPU offers new hardware



TinyGPU has new hardware: tg010. The hardware configuration and the currently deployed software are different to the non-Fermi nodes:

  • Ubuntu 10.04 LTS (instead of 8.04 LTS) as OS.
    Note: For using the Intel Compiler <= 11.1 locally on tg010, you have to use gcc/3.3.6 Module [currently]. If not,  libstdc++.so.5 is missing , as Ubuntu 10.04 does no longer contain this version. This is necessary only for compilation. Compiled Intel binaries will run as expected.

  • /home/hpc and /home/vault are mounted [only] through NFS  (and natively via GPFS-Cross-Cluster-Mount)

  • Dual-Socket-System with  Intel Westmere X5650 (2.66 GHz) processor, having 6 native cores per socket (instead of Dual-Socket-System with  Intel Nehalem X5550 (2.66 GHz), having  4 native cores per socket)

  • 48 GB DDR3 RAM (instead of  24 GB DDR3 RAM)

  • 1x NVidia Tesla C250 (“Fermi” with  3 GB GDDR5 featuring ECC)

  • 1x NVidia GTX 280 (Consumer-Card with 1 GB RAM – formerly know as F22)

  • 2 further PCIe2.0 16x slots will be equipped with  NVidia C2070 Cards (“Fermi” with  6 GB GDDR5 featuring ECC) in Q4, instead of  2x NVidia Tesla M1060 (“Tesla” with  4 GB RAM) as in the remaining cluster nodes

  • SuperServer 7046GT-TRF / X8DTG-QF with  dual Intel 5520 (Tylersburg) chipset instead of  SuperServer 6016GT-TF-TM2 / X8DTG-DF with  Intel 5520 (Tylersburg) chipset


To allocate the fermi node, specify  :ppn=24 with your job  (instead of  :ppn=16) and explicitly submit to  the  TinyGPU-Queue fermi. The wallclock limit is set to the default of 24h . The ECC Memory status is shown on job startup.
This article tries to be a translation from the original posted here: Zuwachs im TinyGPU-Cluster

Wednesday, August 4, 2010

Intel 5300 AGN Wireless Network adapter drops connection in Win7

When connecting to a wireless N-Network my Xt2 drops quite often connection if certain amounts of data are transferred, i.e. mostly when the data rate is quite high > 5 MB/s.
Following different suggestions about disabling powermanagement, N-Network compatibility and even encryption, I tried different settings in the device driver in Win7.
Basically I changed just one entry:
Fat Channel Intolerant
and the disconnects went away.
Intels states to Fat Cannel Intolerant:
Fat Channel Intolerant
This setting communicates to surrounding networks that this Wi-Fi adapter is not tolerant of 40MHz channels in the 2.4GHz band. The default setting is for this to be disabled (turned off) so that the adapter does not send this notification.
Note These settings are available only if the adapter is an Intel® WiMAX/WiFi Link 5350, Intel® WiMAX/WiFi Link 5150, Intel® WiFi Link 5300, Intel® WiFi Link 5100 or Intel® Wireless WiFi Link 4965AGN.

Source:

My opinion is, that the driver did not send this properly, got a wide 40 Mhz Channel and then dropped out.
This often occurs, if the data rate is increasing and quite large in order to ensure maximum bandwidth.
Any comments appreciated that would clear up this behaviour. Even more interesting would be some insights how this behaves in Linux.

/edit

Still the disconnects occur, but not so often.

/edit

With the newest driver from the dell Homepage the problems are gone till now. (27.08.2010)

/edit

It works till now with the current driver 3.15.0 directly from the Intel Homepage

/edit 19.09.2011

Finally I did a reinstall  of Win7 due to some other concerns (Partitioning etc.) and with the default Win7 or Windows Update Driver it works completely flawless since 2 weeks.

Maybe the installs and uninstalls flawed something in the networking, basically, I don't care.

 

 

 

 

 

 

 

 

 

 

 

 

Wednesday, July 21, 2010

Windows HPC and Ganglia Monitoring

We had the common problem, that we needed to restart ganglia a lot lately as several nodes did not report their data anymore and furthermore the service never came up in the first place.

We did several tests and it seems that the ganglia clients send a specific packet at startup; only at startup and only once. If this packet is not recieved than the server does not display any data of this client, although the data is actually collected and sent.

Randomly, some nodes cannot get the inital packet through and are not displayed.

Therefore we start the clients with a time displacement and ensure that all clients can report to the server in a fair fashion.


Still the restart issue remains but we could extend the restart period to 6h.

Thanks to my colleague Kosta G. for finding this issue.

Friday, June 11, 2010

Thread Pinning/Affinity

Thread pinning is very important in order to get feasible and reliable results on todays multi and manycore architectures. Otherwise threads will migrate from one core to another, causing the waste of clock cycles. Even more important, if you placed your memory correctly by first touch on ccNUMA systems, e.g. SGI Altix or every dual socket Intel XEON Core i7, the thread accessing the memory has to go over the QPI interface connecting the two sockets to access the memory if it is migrated to another socket.

Jan Treibig developed a tool for that called likwid-pin.

A sample usage would be as follows:

likwid-pin -s 0 -c"0,1,2,3,4,5,6,7" ./Exec

This would pin 8 Threads of the executable of the cores 0 to 8.
For information about the topology, just use the other tool, called likwid-topology which gives you cache and core hierarchy.
The skipmask is important and thread implementation specific. Also consider, that in hybrid programs, e.g. OpenMP and MPI, multiple shepard threads are present.

Friday, May 28, 2010

Single Precision: Friend or Foe

The recent developments of so called disruptive technologies always lead to some kind of everlasting discussion.
Today I want to say something about the hassle whether GPUs are feasible in any way for scientific computing as their double precision Performance is nowadays not too far away from standard CPUs. And single precision is not worth the discussion, as nobody wants to board a plane or a ship which was simulated just in single precision.

Detour
So for non-simulators first some explanation: single precision means a floating point representation of a given number using up to 4 bytes. Double precision uses up to 8 bytes and can therefore provide much more accuracy.

GPUs are originally designed for graphics applications that do not actually need single precision. There is a bunch of very fast FLOP commands just working on 24 bits instead of 32 bits (again 32 bits = 4 byte = single precision).
E.g. current NVIDIA cards just have 1 dp FLOP unit per 8 sp FLOP unit.

Till now its obvious why everyone complains about the worse dp performance in contrast to sp performance. However, nobody (well I do) complains about the low dp performance I actually get off a current x86 processor. There are some kinds of system configuration were you will just get about 10% or even less the performance.

This comes as data is brought much slower to the computing units than it is computed on there.
This is true for most scientific codes, e.g. stencil codes. Therefore you will see the usual breakdown to 50% of performance when switching from sp to dp on GPUs as you see on CPUs, because you simply transfer twice the data over the same system bus.
So, the dp units are most often not the limit of compute performance.


Thursday, May 20, 2010

LaTex: Floatflt.sty missing on ubuntu lucid 10.04

The recent upgrade to the new ubuntu stable version missed installing all tex-live ressources, I thought at first.
However license of floatflt.sty has been changed, thus it is no longer in ubuntu or tex-live.

Here's a quick guide to reenable it.



Problem:



LaTeX Error: File `floatflt.sty' not found

Solution (to be run as root ):



sudo mkdir -p /usr/share/texmf-texlive/tex/latex/floatflt
cd /usr/share/texmf-texlive/tex/latex/floatflt
sudo rm -f floatflt.* float*.tex
sudo wget http://mirror.ctan.org/macros/latex/contrib/floatflt/floatflt.ins
sudo wget http://mirror.ctan.org/macros/latex/contrib/floatflt/floatflt.dtx
sudo latex floatflt.ins
sudo texhash /usr/share/texmf-texlive


Source of suggestion with discussion wether to use backport ........

I would appreciate any hint to any solution which does this more automatically please drop me a comment on your solution.

Monday, May 10, 2010

JUROPA MPI Buffer on demand

To enable huge runs with lots of MPI ranks you have to disable the per default allocated all-to-all send buffer on the NEC- Nehalem Cluster Juropa at FZ Jülich.


Here is an excerpt from the official docu:







Most MPI programs do not need every connection



  • Nearest neighbor communication
  • Scatter/Gather and Allreduce based on binary trees
  • Typically just a few dozen connections when having hundreds of
    processes
  • ParaStation MPI supports this with ?on demand connections?
  • export PSP_ONDEMAND=1
  • was used for the Linpack runs (np > 24000)
  • mpiexec --ondemand
  • Backdraw
  • Late all-to-all communication might fail due to short memory
  • Default on JuRoPA is not to use ?on demand connections?

Links


Juropa Introduction @ FZJ