[Beowulf] Monitoring and Metrics

Discussion:

Josh Catana

2017-10-07 12:21:08 UTC

This may have been brought up in the past, but I couldn't find much in my
message archive.
What are people using for HPC cluster monitoring and metrics lately? I've
been low on time to add features to my home grown solution and looking at
some OTS products.
I'm looking for something that can do monitoring, alert on condition,
broken hardware, etc.
Also something that does system resource utilization metrics. If it has a
plug-in for a scheduling system like PBS where I can correlate a job ID to
the metrics of the systems it is currently running on or previously ran on
at the time, that would be an amazing plus.
Any of you beowulfers have any suggestions?

Paul Edmon

2017-10-07 13:13:18 UTC

Permalink

So for general monitoring of the cluster usage we use:

https://github.com/fasrc/slurm-diamond-collector

and pipe to Graphana.Â We also use XDMod:

http://open.xdmod.org/7.0/index.html

As for specific node alerting, we use the old standby of Nagios.

-Paul Edmon-

Post by Josh Catana
This may have been brought up in the past, but I couldn't find much in
my messageÂ archive.
What are people using for HPC cluster monitoring and metrics lately?
I've been low on time to add features to my home grown solution and
looking at some OTS products.
I'm looking for something that can do monitoring, alert on condition,
broken hardware, etc.
Also something that does system resource utilization metrics. If it
has a plug-in for a scheduling system like PBS where I can correlate a
job ID to the metrics of the systems it is currently running on or
previously ran on at the time, that would be an amazing plus.
Any of you beowulfers have any suggestions?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Lachlan Musicman

2017-10-08 05:19:43 UTC

Permalink

Post by Josh Catana
This may have been brought up in the past, but I couldn't find much in my
message archive.
What are people using for HPC cluster monitoring and metrics lately? I've
been low on time to add features to my home grown solution and looking at
some OTS products.
I'm looking for something that can do monitoring, alert on condition,
broken hardware, etc.
Also something that does system resource utilization metrics. If it has a
plug-in for a scheduling system like PBS where I can correlate a job ID to
the metrics of the systems it is currently running on or previously ran on
at the time, that would be an amazing plus.
Any of you beowulfers have any suggestions?

We use XDMoD and Zabbix for per machine monitoring. Logwatch as well, but
not as comprehensively.

Tried Grafana, InfluxDB and this plugin (
http://slurm.schedmd.com/SLUG16/monitoring_influxdb_slug.pdf ) but we
didn't find it as useful as we would have liked. It's a great plugin, we
just didn't need it.

cheers
L.

------
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared consciousness that our institutions have failed
and our ecosystem is collapsing, yet we are still here â and we are
creative agents who can shape our destinies. Apocalyptic civics is the
conviction that the only way out is through, and the only way through is
together. "

*Greg Bloom* @greggish https://twitter.com/greggish/
status/873177525903609857

Benson Muite

2017-10-08 09:24:16 UTC

Permalink

May also be of interest:

JobDigest – Detailed System Monitoring-Based Supercomputer Application
Behavior Analysis

Dmitry Nikitenko, Alexander Antonov, Pavel Shvets, Sergey Sobolev,
Konstantin Stefanov, Vadim Voevodin, Vladimir Voevodin and Sergey Zhumatiy

http://russianscdays.org/files/pdf17/185.pdf

Post by Paul Edmon
https://github.com/fasrc/slurm-diamond-collector
http://open.xdmod.org/7.0/index.html
As for specific node alerting, we use the old standby of Nagios.
-Paul Edmon-

Post by Josh Catana
This may have been brought up in the past, but I couldn't find much in
my message archive.
What are people using for HPC cluster monitoring and metrics lately?
I've been low on time to add features to my home grown solution and
looking at some OTS products.
I'm looking for something that can do monitoring, alert on condition,
broken hardware, etc.
Also something that does system resource utilization metrics. If it
has a plug-in for a scheduling system like PBS where I can correlate a
job ID to the metrics of the systems it is currently running on or
previously ran on at the time, that would be an amazing plus.
Any of you beowulfers have any suggestions?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

----
Hajussüsteemide Teadur
Arvutiteaduse Instituut
Tartu Ülikool
J. Liivi 2, 50409
Tartu
http://kodu.ut.ee/~benson
----
Research Fellow of Distributed Systems
Institute of Computer Science
University of Tartu
J. Liivi 2 50409
Tartu, Estonia
http://kodu.ut.ee/~benson
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowul

l***@debian.org

2017-10-07 12:29:16 UTC

Permalink

I'm using ganglia for monitoring. No alerts, just node metrics like
cpu + network load but nice to look at what happened in the past.

--
regards Thomas
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.b

Continue reading on narkive:

Search results for '[Beowulf] Monitoring and Metrics' (Questions and Answers)

replies

Which Oracle Database Performance Monitoring tool is good?

started 2010-12-04 05:57:43 UTC

software

replies

what is "achieving metrics"? i'm applying for a job!?

started 2011-08-17 07:24:05 UTC

financial services

replies

Why are new born babies always measured in pounds? (in Metric countries)?