Discussion:
[Beowulf] Monitoring and Metrics
Josh Catana
2017-10-07 12:21:08 UTC
Permalink
This may have been brought up in the past, but I couldn't find much in my
message archive.
What are people using for HPC cluster monitoring and metrics lately? I've
been low on time to add features to my home grown solution and looking at
some OTS products.
I'm looking for something that can do monitoring, alert on condition,
broken hardware, etc.
Also something that does system resource utilization metrics. If it has a
plug-in for a scheduling system like PBS where I can correlate a job ID to
the metrics of the systems it is currently running on or previously ran on
at the time, that would be an amazing plus.
Any of you beowulfers have any suggestions?
Paul Edmon
2017-10-07 13:13:18 UTC
Permalink
So for general monitoring of the cluster usage we use:

https://github.com/fasrc/slurm-diamond-collector

and pipe to Graphana.  We also use XDMod:

http://open.xdmod.org/7.0/index.html

As for specific node alerting, we use the old standby of Nagios.

-Paul Edmon-
Post by Josh Catana
This may have been brought up in the past, but I couldn't find much in
my message  archive.
What are people using for HPC cluster monitoring and metrics lately?
I've been low on time to add features to my home grown solution and
looking at some OTS products.
I'm looking for something that can do monitoring, alert on condition,
broken hardware, etc.
Also something that does system resource utilization metrics. If it
has a plug-in for a scheduling system like PBS where I can correlate a
job ID to the metrics of the systems it is currently running on or
previously ran on at the time, that would be an amazing plus.
Any of you beowulfers have any suggestions?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Lachlan Musicman
2017-10-08 05:19:43 UTC
Permalink
Post by Josh Catana
This may have been brought up in the past, but I couldn't find much in my
message archive.
What are people using for HPC cluster monitoring and metrics lately? I've
been low on time to add features to my home grown solution and looking at
some OTS products.
I'm looking for something that can do monitoring, alert on condition,
broken hardware, etc.
Also something that does system resource utilization metrics. If it has a
plug-in for a scheduling system like PBS where I can correlate a job ID to
the metrics of the systems it is currently running on or previously ran on
at the time, that would be an amazing plus.
Any of you beowulfers have any suggestions?
We use XDMoD and Zabbix for per machine monitoring. Logwatch as well, but
not as comprehensively.

Tried Grafana, InfluxDB and this plugin (
http://slurm.schedmd.com/SLUG16/monitoring_influxdb_slug.pdf ) but we
didn't find it as useful as we would have liked. It's a great plugin, we
just didn't need it.

cheers
L.


------
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared consciousness that our institutions have failed
and our ecosystem is collapsing, yet we are still here — and we are
creative agents who can shape our destinies. Apocalyptic civics is the
conviction that the only way out is through, and the only way through is
together. "

*Greg Bloom* @greggish https://twitter.com/greggish/
status/873177525903609857
Benson Muite
2017-10-08 09:24:16 UTC
Permalink
May also be of interest:

JobDigest – Detailed System Monitoring-Based Supercomputer Application
Behavior Analysis

Dmitry Nikitenko, Alexander Antonov, Pavel Shvets, Sergey Sobolev,
Konstantin Stefanov, Vadim Voevodin, Vladimir Voevodin and Sergey Zhumatiy

http://russianscdays.org/files/pdf17/185.pdf
Post by Paul Edmon
https://github.com/fasrc/slurm-diamond-collector
http://open.xdmod.org/7.0/index.html
As for specific node alerting, we use the old standby of Nagios.
-Paul Edmon-
Post by Josh Catana
This may have been brought up in the past, but I couldn't find much in
my message  archive.
What are people using for HPC cluster monitoring and metrics lately?
I've been low on time to add features to my home grown solution and
looking at some OTS products.
I'm looking for something that can do monitoring, alert on condition,
broken hardware, etc.
Also something that does system resource utilization metrics. If it
has a plug-in for a scheduling system like PBS where I can correlate a
job ID to the metrics of the systems it is currently running on or
previously ran on at the time, that would be an amazing plus.
Any of you beowulfers have any suggestions?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
----
Hajussüsteemide Teadur
Arvutiteaduse Instituut
Tartu Ülikool
J. Liivi 2, 50409
Tartu
http://kodu.ut.ee/~benson
----
Research Fellow of Distributed Systems
Institute of Computer Science
University of Tartu
J. Liivi 2 50409
Tartu, Estonia
http://kodu.ut.ee/~benson
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowul
l***@debian.org
2017-10-07 12:29:16 UTC
Permalink
This may have been brought up in the past, but I couldn't find much in my message  archive.
What are people using for HPC cluster monitoring and metrics lately? I've been low on time to add features to my home grown solution and looking at
I'm using ganglia for monitoring. No alerts, just node metrics like
cpu + network load but nice to look at what happened in the past.
--
regards Thomas
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.b
Continue reading on narkive:
Loading...