Wednesday, July 21, 2010

Windows HPC and Ganglia Monitoring

We had the common problem, that we needed to restart ganglia a lot lately as several nodes did not report their data anymore and furthermore the service never came up in the first place.

We did several tests and it seems that the ganglia clients send a specific packet at startup; only at startup and only once. If this packet is not recieved than the server does not display any data of this client, although the data is actually collected and sent.

Randomly, some nodes cannot get the inital packet through and are not displayed.

Therefore we start the clients with a time displacement and ensure that all clients can report to the server in a fair fashion.


Still the restart issue remains but we could extend the restart period to 6h.

Thanks to my colleague Kosta G. for finding this issue.

2 comments:

  1. The issue you are describing sounds like you didn't set send_metadata_interval to "sane" number (eg. 300s) in an unicast environment. Have a look at the release notes, and try adding that setting to gmond.conf and hopefully you won't need to randomly re-start the daemons:

    https://sourceforge.net/apps/trac/ganglia/wiki/ganglia_release_notes#ImportantNotes:

    ReplyDelete
  2. Thanks for that suggestion. However, we have set send_metadata_interval to 1200. Whether that is sane or not I do not know. But we will try 300.

    Cheers,
    Johannes

    ReplyDelete