Wednesday, July 21, 2010

Windows HPC and Ganglia Monitoring

We had the common problem, that we needed to restart ganglia a lot lately as several nodes did not report their data anymore and furthermore the service never came up in the first place.

We did several tests and it seems that the ganglia clients send a specific packet at startup; only at startup and only once. If this packet is not recieved than the server does not display any data of this client, although the data is actually collected and sent.

Randomly, some nodes cannot get the inital packet through and are not displayed.

Therefore we start the clients with a time displacement and ensure that all clients can report to the server in a fair fashion.

Still the restart issue remains but we could extend the restart period to 6h.

Thanks to my colleague Kosta G. for finding this issue.