Tuesday, October 12, 2010

Win2008 HPC Server and CUDA TCC

Nvidia now provides a beta driver called Tesla Compute Cluster (TCC) in order to use CUDA GPUs within a windows cluster environment. Not only remotely via RDP but also in batch processing. Till now, the HPCServer lacked this ability, as Windows did not fire up the graphics driver inside the limited batch logon mode.

My first steps with TCC took a little bit longer than estimated.

First of all It is not possible to have a NVIDIA and AMD or INTEL GPU side by side as Windows needs to use one unified WDM and thats either one or the other vendor. This was completely new to me.

After this first minor setback and reequipped with only the tesla C2050 the BIOS did not finish, so be sure to be up to date with your BIOS revision.
Another NVIDIA card was the quick fix on my side.

Next thing is the setup. Install the 260 (currently beta) drivers and the exclamation mark in the device manager should vanish.
After that install the toolkit and SDK if you like.
With the nvidia-smi tool, which you find in one of the uncountable NVIDIA folders which are there now, you can have a look if the card is initally correctly recognized.
As well set the TCC mode of the Tesla card to enabled if you want to have remote cuda capabilities:

nvidia-smi -s --> shows the current status
nvidia-smi -g 0 -c 1 --> enables TCC on GPU 0

Next thing you want to test the device query coming with the SDK.
If it runs and everything looks fine, feel gifted!

Nothing did run on my setup. So first of all I tried to build the SDK example myself. Therefore first of all build the Cuda utilities, lying somewhere in the SDK within the folder "common".
Depending on the Nsight or TK version you have installed you get an error when opening the VS project fles . The you need to edit the visual studio with a text editor of your choice and replace the outdated build rule with the one actually installed.

  • In the error message get to the folder where VS does not find the file.

  • Copy the path and go there with your file browser

  • Find the file most equal to the one in the VS error message.

  • Once found open the VS file and replace the wrong filename there with the correct one

  • VS should open

In order to compile, add the correct include and library directories to the VS project.
Finally you can build deviceQuery or any other program.

Still this setup gave me the same error as the precompiled deviceQuery:
cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.

With the help of the DependencyWalker i found out that a missing DLL was the problem, namely:

You can get this by adding the feature named "Desktop Experience" through the server manager.
Once installed and rebooted the device query worked.

No comments:

Post a Comment