Discussion:
[Beowulf] Cluster Hat
John Hearns via Beowulf
2017-08-04 13:16:56 UTC
Permalink
Reading the Register article in the IBM IData system which was moved from
Daresbury Labs to Durham Uni, one of the comments flagged this up:

http://climbers.net/sbc/clusterhat-review-raspberry-pi-zero/

That's rather neat I think!
So how wmany of these can we get in a 42U rack ;-)
Faraz Hussain
2017-08-04 13:42:57 UTC
Permalink
Last year I built a 2-node cluster using Pi 3's. I setup a nfs
filesystem using a thumb-drive. I even installed SLURM scheduler. I
use it as a webserver through my home internet.

My to-do's are to make it publicly available for people to run small
jobs on it. I also want to install openmpi but I am not sure that is
possible.
Post by John Hearns via Beowulf
Reading the Register article in the IBM IData system which was moved from
http://climbers.net/sbc/clusterhat-review-raspberry-pi-zero/
That's rather neat I think!
So how wmany of these can we get in a 42U rack ;-)
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listin
Gus Correa
2017-08-04 20:03:30 UTC
Permalink
Post by Faraz Hussain
Last year I built a 2-node cluster using Pi 3's. I setup a nfs
filesystem using a thumb-drive. I even installed SLURM scheduler. I use
it as a webserver through my home internet.
My to-do's are to make it publicly available for people to run small
jobs on it. I also want to install openmpi but I am not sure that is
possible.
There is a long 2015 thread on building Open MPI on Raspberry Pi2
in the Open MPI mailing list.
Not conclusive, apparently not successful,
Post by Faraz Hussain
Post by John Hearns via Beowulf
Reading the Register article in the IBM IData system which was moved from
http://climbers.net/sbc/clusterhat-review-raspberry-pi-zero/
That's rather neat I think!
So how wmany of these can we get in a 42U rack ;-)
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/li
Christopher Samuel
2017-08-07 02:02:20 UTC
Permalink
Post by Gus Correa
There is a long 2015 thread on building Open MPI on Raspberry Pi2
in the Open MPI mailing list.
Not conclusive, apparently not successful,
I contacted Paul Hargrove from the OMPI devel list who does a lot of
testing on many architectures (including RPi) about his recent
experiences with RPi testing there and he wrote back saying:

# I test Open MPI release candidates on my Raspberry Pi.
# To the best of my knowledge the 1.10, 2.0, 2.1 and
# (pending) 3.0 branches all work.
# I am not using any special configure arguments.
#
# This is with Raspbian (Debian Jessie), in case that
# makes a difference.

Hope that helps..

Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe
Faraz Hussain
2017-08-10 14:39:07 UTC
Permalink
One of our compute nodes runs ~30% slower than others. It has the
exact same image so I am baffled why it is running slow . I have
tested OMP and MPI benchmarks. Everything runs slower. The cpu usage
goes to 2000%, so all looks normal there.

I thought it may have to do with cpu scaling, i.e when the kernel
changes the cpu speed depending on the workload. But we do not have
that enabled on these machines.

Here is a snippet from "cat /proc/cpuinfo". Everything is identical to
our other nodes. Any suggestions on what else to check? I have tried
rebooting it.

processor : 19
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping : 4
cpu MHz : 2500.098
cache size : 25600 KB
physical id : 1
siblings : 10
core id : 12
cpu cores : 10
apicid : 56
initial apicid : 56
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida
arat xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
fsgsbase smep erms
bogomips : 5004.97
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman
John Hearns via Beowulf
2017-08-10 14:59:27 UTC
Permalink
Faraz,
I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically the
DIMMS. I have seen this before!
If you have some DIMMS which are faulty and are generating ECC errors, then
if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is spending
time servicing these interrupts.

So: look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors: ipmitool sel elist

I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester will discover
it within minutes.

Or it could be something else - in which case I get no coffee.

Also Intel cluster checker is intended to exacly deal with these situations.
What is your cluster manager, and is Intel CLuster Checker available to you?
I would seriously look at getting this installed.
One of our compute nodes runs ~30% slower than others. It has the exact
same image so I am baffled why it is running slow . I have tested OMP and
MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
looks normal there.
I thought it may have to do with cpu scaling, i.e when the kernel changes
the cpu speed depending on the workload. But we do not have that enabled on
these machines.
Here is a snippet from "cat /proc/cpuinfo". Everything is identical to our
other nodes. Any suggestions on what else to check? I have tried rebooting
it.
processor : 19
vendor_id : GenuineIntel
cpu family : 6
model : 62
stepping : 4
cpu MHz : 2500.098
cache size : 25600 KB
physical id : 1
siblings : 10
core id : 12
cpu cores : 10
apicid : 56
initial apicid : 56
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 5004.97
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
John Hearns via Beowulf
2017-08-10 15:00:11 UTC
Permalink
ps. Look at watch cat /proc/interrupts also
You might get a qualitative idea of a huge rate of interrupts.
Post by John Hearns via Beowulf
Faraz,
I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically the
DIMMS. I have seen this before!
If you have some DIMMS which are faulty and are generating ECC errors,
then if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is spending
time servicing these interrupts.
So: look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors: ipmitool sel elist
I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester will discover
it within minutes.
Or it could be something else - in which case I get no coffee.
Also Intel cluster checker is intended to exacly deal with these situations.
What is your cluster manager, and is Intel CLuster Checker available to you?
I would seriously look at getting this installed.
One of our compute nodes runs ~30% slower than others. It has the exact
same image so I am baffled why it is running slow . I have tested OMP and
MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
looks normal there.
I thought it may have to do with cpu scaling, i.e when the kernel changes
the cpu speed depending on the workload. But we do not have that enabled on
these machines.
Here is a snippet from "cat /proc/cpuinfo". Everything is identical to
our other nodes. Any suggestions on what else to check? I have tried
rebooting it.
processor : 19
vendor_id : GenuineIntel
cpu family : 6
model : 62
stepping : 4
cpu MHz : 2500.098
cache size : 25600 KB
physical id : 1
siblings : 10
core id : 12
cpu cores : 10
apicid : 56
initial apicid : 56
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 5004.97
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
John Hearns via Beowulf
2017-08-10 15:17:20 UTC
Permalink
Another thing to perhaps look at. Are you seeing messages abotu thermal
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?

I dont think that will produce a 30% drop in perfoemance. But I have caught
compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)
Post by John Hearns via Beowulf
ps. Look at watch cat /proc/interrupts also
You might get a qualitative idea of a huge rate of interrupts.
Post by John Hearns via Beowulf
Faraz,
I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically the
DIMMS. I have seen this before!
If you have some DIMMS which are faulty and are generating ECC errors,
then if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is spending
time servicing these interrupts.
So: look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors: ipmitool sel elist
I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester will
discover it within minutes.
Or it could be something else - in which case I get no coffee.
Also Intel cluster checker is intended to exacly deal with these situations.
What is your cluster manager, and is Intel CLuster Checker available to you?
I would seriously look at getting this installed.
One of our compute nodes runs ~30% slower than others. It has the exact
same image so I am baffled why it is running slow . I have tested OMP and
MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
looks normal there.
I thought it may have to do with cpu scaling, i.e when the kernel
changes the cpu speed depending on the workload. But we do not have that
enabled on these machines.
Here is a snippet from "cat /proc/cpuinfo". Everything is identical to
our other nodes. Any suggestions on what else to check? I have tried
rebooting it.
processor : 19
vendor_id : GenuineIntel
cpu family : 6
model : 62
stepping : 4
cpu MHz : 2500.098
cache size : 25600 KB
physical id : 1
siblings : 10
core id : 12
cpu cores : 10
apicid : 56
initial apicid : 56
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 5004.97
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Gus Correa
2017-08-10 17:45:11 UTC
Permalink
+ Leftover processes from previous jobs hogging resources.
That's relatively common.
That can trigger swapping, the ultimate performance killer.
"top" or "htop" on the node should show something.
(Will go away with a reboot, of course.)

Less likely, but possible:

+ Different BIOS configuration w.r.t. the other nodes.

+ Poorly sat memory, IB card, etc, or cable connections.

+ IPMI may need a hard reset.
Power down, remove the power cable, wait several minutes,
put the cable back, power on.

Gus Correa
Post by John Hearns via Beowulf
Another thing to perhaps look at. Are you seeing messages abotu thermal
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?
I dont think that will produce a 30% drop in perfoemance. But I have
caught compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)
ps. Look at watch cat /proc/interrupts also
You might get a qualitative idea of a huge rate of interrupts.
Faraz,
I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically
the DIMMS. I have seen this before!
If you have some DIMMS which are faulty and are generating ECC
errors, then if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is
spending time servicing these interrupts.
So: look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors: ipmitool sel elist
I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester
will discover it within minutes.
Or it could be something else - in which case I get no coffee.
Also Intel cluster checker is intended to exacly deal with these
situations.
What is your cluster manager, and is Intel CLuster Checker
available to you?
I would seriously look at getting this installed.
One of our compute nodes runs ~30% slower than others. It
has the exact same image so I am baffled why it is running
slow . I have tested OMP and MPI benchmarks. Everything runs
slower. The cpu usage goes to 2000%, so all looks normal there.
I thought it may have to do with cpu scaling, i.e when the
kernel changes the cpu speed depending on the workload. But
we do not have that enabled on these machines.
Here is a snippet from "cat /proc/cpuinfo". Everything is
identical to our other nodes. Any suggestions on what else
to check? I have tried rebooting it.
processor : 19
vendor_id : GenuineIntel
cpu family : 6
model : 62
stepping : 4
cpu MHz : 2500.098
cache size : 25600 KB
physical id : 1
siblings : 10
core id : 12
cpu cores : 10
apicid : 56
initial apicid : 56
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand
lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi
flexpriority ept vpid fsgsbase smep erms
bogomips : 5004.97
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) v
Andrew Holway
2017-08-10 18:04:15 UTC
Permalink
I put €10 on the nose for a faulty power supply.
Post by Gus Correa
+ Leftover processes from previous jobs hogging resources.
That's relatively common.
That can trigger swapping, the ultimate performance killer.
"top" or "htop" on the node should show something.
(Will go away with a reboot, of course.)
+ Different BIOS configuration w.r.t. the other nodes.
+ Poorly sat memory, IB card, etc, or cable connections.
+ IPMI may need a hard reset.
Power down, remove the power cable, wait several minutes,
put the cable back, power on.
Gus Correa
Post by John Hearns via Beowulf
Another thing to perhaps look at. Are you seeing messages abotu thermal
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?
I dont think that will produce a 30% drop in perfoemance. But I have
caught compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)
ps. Look at watch cat /proc/interrupts also
You might get a qualitative idea of a huge rate of interrupts.
Faraz,
I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically
the DIMMS. I have seen this before!
If you have some DIMMS which are faulty and are generating ECC
errors, then if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is
spending time servicing these interrupts.
So: look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors: ipmitool sel elist
I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester
will discover it within minutes.
Or it could be something else - in which case I get no coffee.
Also Intel cluster checker is intended to exacly deal with these
situations.
What is your cluster manager, and is Intel CLuster Checker
available to you?
I would seriously look at getting this installed.
One of our compute nodes runs ~30% slower than others. It
has the exact same image so I am baffled why it is running
slow . I have tested OMP and MPI benchmarks. Everything runs
slower. The cpu usage goes to 2000%, so all looks normal there.
I thought it may have to do with cpu scaling, i.e when the
kernel changes the cpu speed depending on the workload. But
we do not have that enabled on these machines.
Here is a snippet from "cat /proc/cpuinfo". Everything is
identical to our other nodes. Any suggestions on what else
to check? I have tried rebooting it.
processor : 19
vendor_id : GenuineIntel
cpu family : 6
model : 62
stepping : 4
cpu MHz : 2500.098
cache size : 25600 KB
physical id : 1
siblings : 10
core id : 12
cpu cores : 10
apicid : 56
initial apicid : 56
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand
lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi
flexpriority ept vpid fsgsbase smep erms
bogomips : 5004.97
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
John Hearns via Beowulf
2017-08-10 18:33:37 UTC
Permalink
Ten euros for me on a faulty DIMM


Sent from Mail for Windows 10

From: Andrew Holway
Sent: Thursday, 10 August 2017 20:05
To: Gus Correa
Cc: Beowulf Mailing List
Subject: Re: [Beowulf] How to debug slow compute node?

I put €10 on the nose for a faulty power supply.

On 10 August 2017 at 19:45, Gus Correa <***@ldeo.columbia.edu> wrote:
+ Leftover processes from previous jobs hogging resources.
That's relatively common.
That can trigger swapping, the ultimate performance killer.
"top" or "htop" on the node should show something.
(Will go away with a reboot, of course.)

Less likely, but possible:

+ Different BIOS configuration w.r.t. the other nodes.

+ Poorly sat memory, IB card, etc, or cable connections.

+ IPMI may need a hard reset.
Power down, remove the power cable, wait several minutes,
put the cable back, power on.

Gus Correa

On 08/10/2017 11:17 AM, John Hearns via Beowulf wrote:
Another thing to perhaps look at. Are you seeing messages abotu thermal throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?

I dont think that will produce a 30% drop in perfoemance. But I have caught compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)




On 10 August 2017 at 17:00, John Hearns <***@googlemail.com <mailto:***@googlemail.com>> wrote:

    ps.   Look at   watch  cat /proc/interrupts   also
    You might get a qualitative idea of a huge rate of interrupts.


    On 10 August 2017 at 16:59, John Hearns <***@googlemail.com
    <mailto:***@googlemail.com>> wrote:

        Faraz,
            I think you might have to buy me a virtual coffee. Or a beer!
        Please look at the hardware health of that machine. Specifically
        the DIMMS.  I have seen this before!
        If you have some DIMMS which are faulty and are generating ECC
        errors, then if the mcelog service is enabled
        an interrupt is generated for every ECC event. SO the system is
        spending time servicing these interrupts.

        So:   look in your /var/log/mcelog for hardware errors
        Look in your /var/log/messages for hardware errors also
        Look in the IPMI event logs for ECC errors:    ipmitool sel elist

        I would also bring that node down and boot it with memtester.
        If there is a DIMM which is that badly faulty then memtester
        will discover it within minutes.

        Or it could be something else - in which case I get no coffee.

        Also Intel cluster checker is intended to exacly deal with these
        situations.
        What is your cluster manager, and is Intel CLuster Checker
        available to you?
        I would seriously look at getting this installed.







        On 10 August 2017 at 16:39, Faraz Hussain <***@feacluster.com
        <mailto:***@feacluster.com>> wrote:

            One of our compute nodes runs ~30% slower than others. It
            has the exact same image so I am baffled why it is running
            slow . I have tested OMP and MPI benchmarks. Everything runs
            slower. The cpu usage goes to 2000%, so all looks normal there.

            I thought it may have to do with cpu scaling, i.e when the
            kernel changes the cpu speed depending on the workload. But
            we do not have that enabled on these machines.

            Here is a snippet from "cat /proc/cpuinfo". Everything is
            identical to our other nodes. Any suggestions on what else
            to check? I have tried rebooting it.

            processor       : 19
            vendor_id       : GenuineIntel
            cpu family      : 6
            model           : 62
            model name      : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
            stepping        : 4
            cpu MHz         : 2500.098
            cache size      : 25600 KB
            physical id     : 1
            siblings        : 10
            core id         : 12
            cpu cores       : 10
            apicid          : 56
            initial apicid  : 56
            fpu             : yes
            fpu_exception   : yes
            cpuid level     : 13
            wp              : yes
            flags           : fpu vme de pse tsc msr pae mce cx8 apic
            sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
            sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
            constant_tsc arch_perfmon pebs bts rep_good xtopology
            nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
            vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
            x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand
            lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi
            flexpriority ept vpid fsgsbase smep erms
            bogomips        : 5004.97
            clflush size    : 64
            cache_alignment : 64
            address sizes   : 46 bits physical, 48 bits virtual
            power management:



            _______________________________________________
            Beowulf mailing list, ***@beowulf.org
            <mailto:***@beowulf.org> sponsored by Penguin Computing
            To change your subscription (digest mode or unsubscribe)
            visit http://www.beowulf.org/mailman/listinfo/beowulf
            <http://www.beowulf.org/mailman/listinfo/beowulf>






_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Lance Wilson
2017-08-10 22:59:22 UTC
Permalink
Hi Faraz,
Another one that we have seen was a difference in power profile of the
node. It caused the node in certain situations to keep the cpu speed low,
so top looked fine and everything looked fine, just slow. It was a Dell box
as well. It was interesting that there were so many power settings that
caused slow downs with Centos 7.

Cheers,

Lance
--
Dr Lance Wilson
Senior HPC Consultant
Ph: 03 99055942 (+61 3 99055942
Mobile: 0437414123 (+61 4 3741 4123)
Multi-modal Australian ScienceS Imaging and Visualisation Environment
(www.massive.org.au)
Monash University
Post by John Hearns via Beowulf
Ten euros for me on a faulty DIMM
Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
Windows 10
*Sent: *Thursday, 10 August 2017 20:05
*Subject: *Re: [Beowulf] How to debug slow compute node?
I put €10 on the nose for a faulty power supply.
+ Leftover processes from previous jobs hogging resources.
That's relatively common.
That can trigger swapping, the ultimate performance killer.
"top" or "htop" on the node should show something.
(Will go away with a reboot, of course.)
+ Different BIOS configuration w.r.t. the other nodes.
+ Poorly sat memory, IB card, etc, or cable connections.
+ IPMI may need a hard reset.
Power down, remove the power cable, wait several minutes,
put the cable back, power on.
Gus Correa
Another thing to perhaps look at. Are you seeing messages abotu thermal
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?
I dont think that will produce a 30% drop in perfoemance. But I have
caught compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)
ps. Look at watch cat /proc/interrupts also
You might get a qualitative idea of a huge rate of interrupts.
Faraz,
I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically
the DIMMS. I have seen this before!
If you have some DIMMS which are faulty and are generating ECC
errors, then if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is
spending time servicing these interrupts.
So: look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors: ipmitool sel elist
I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester
will discover it within minutes.
Or it could be something else - in which case I get no coffee.
Also Intel cluster checker is intended to exacly deal with these
situations.
What is your cluster manager, and is Intel CLuster Checker
available to you?
I would seriously look at getting this installed.
One of our compute nodes runs ~30% slower than others. It
has the exact same image so I am baffled why it is running
slow . I have tested OMP and MPI benchmarks. Everything runs
slower. The cpu usage goes to 2000%, so all looks normal there.
I thought it may have to do with cpu scaling, i.e when the
kernel changes the cpu speed depending on the workload. But
we do not have that enabled on these machines.
Here is a snippet from "cat /proc/cpuinfo". Everything is
identical to our other nodes. Any suggestions on what else
to check? I have tried rebooting it.
processor : 19
vendor_id : GenuineIntel
cpu family : 6
model : 62
stepping : 4
cpu MHz : 2500.098
cache size : 25600 KB
physical id : 1
siblings : 10
core id : 12
cpu cores : 10
apicid : 56
initial apicid : 56
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand
lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi
flexpriority ept vpid fsgsbase smep erms
bogomips : 5004.97
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Skylar Thompson
2017-08-11 03:17:00 UTC
Permalink
We ran into something similar, though it turned out being a microcode bug
in the CPU that caused it to remain stuck in its lowest power state.
Fortunately it was easily testable with "perf stat" so it was pretty clear
which nodes were impacted, which also happened to be bought as a batch with
a unique CPU version. By the time we did our legwork, the vendor had
independently announced a fix for the problem, so I guess we could have
just saved ourselves some work and waited...

Skylar
Post by Lance Wilson
Hi Faraz,
Another one that we have seen was a difference in power profile of the
node. It caused the node in certain situations to keep the cpu speed low,
so top looked fine and everything looked fine, just slow. It was a Dell box
as well. It was interesting that there were so many power settings that
caused slow downs with Centos 7.
Cheers,
Lance
--
Dr Lance Wilson
Senior HPC Consultant
Ph: 03 99055942 (+61 3 99055942 <+61%203%209905%205942>
Mobile: 0437414123 (+61 4 3741 4123)
Multi-modal Australian ScienceS Imaging and Visualisation Environment
(www.massive.org.au)
Monash University
Post by John Hearns via Beowulf
Ten euros for me on a faulty DIMM
Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
Windows 10
*Sent: *Thursday, 10 August 2017 20:05
*Subject: *Re: [Beowulf] How to debug slow compute node?
I put €10 on the nose for a faulty power supply.
+ Leftover processes from previous jobs hogging resources.
That's relatively common.
That can trigger swapping, the ultimate performance killer.
"top" or "htop" on the node should show something.
(Will go away with a reboot, of course.)
+ Different BIOS configuration w.r.t. the other nodes.
+ Poorly sat memory, IB card, etc, or cable connections.
+ IPMI may need a hard reset.
Power down, remove the power cable, wait several minutes,
put the cable back, power on.
Gus Correa
Another thing to perhaps look at. Are you seeing messages abotu thermal
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?
I dont think that will produce a 30% drop in perfoemance. But I have
caught compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)
ps. Look at watch cat /proc/interrupts also
You might get a qualitative idea of a huge rate of interrupts.
Faraz,
I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically
the DIMMS. I have seen this before!
If you have some DIMMS which are faulty and are generating ECC
errors, then if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is
spending time servicing these interrupts.
So: look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors: ipmitool sel elist
I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester
will discover it within minutes.
Or it could be something else - in which case I get no coffee.
Also Intel cluster checker is intended to exacly deal with these
situations.
What is your cluster manager, and is Intel CLuster Checker
available to you?
I would seriously look at getting this installed.
One of our compute nodes runs ~30% slower than others. It
has the exact same image so I am baffled why it is running
slow . I have tested OMP and MPI benchmarks. Everything runs
slower. The cpu usage goes to 2000%, so all looks normal there.
I thought it may have to do with cpu scaling, i.e when the
kernel changes the cpu speed depending on the workload. But
we do not have that enabled on these machines.
Here is a snippet from "cat /proc/cpuinfo". Everything is
identical to our other nodes. Any suggestions on what else
to check? I have tried rebooting it.
processor : 19
vendor_id : GenuineIntel
cpu family : 6
model : 62
stepping : 4
cpu MHz : 2500.098
cache size : 25600 KB
physical id : 1
siblings : 10
core id : 12
cpu cores : 10
apicid : 56
initial apicid : 56
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand
lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi
flexpriority ept vpid fsgsbase smep erms
bogomips : 5004.97
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Faraz Hussain
2017-08-10 18:29:46 UTC
Permalink
Thanks for the tips! Unfortunately, I am not seeing anything in
/var/log of interest. The mcelog service is not enabled. I do not see
anything /proc/interrupts either.

I will look into full power down , memtester and firmare update. It is
a blade. We do not have Intel cluster checker, but we have DRAC ( Dell
Remote Access Controller ). I just logged in there and everything
checks out, i.e memory, power etc.
Post by John Hearns via Beowulf
Another thing to perhaps look at. Are you seeing messages abotu thermal
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?
I dont think that will produce a 30% drop in perfoemance. But I have caught
compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)
Post by John Hearns via Beowulf
ps. Look at watch cat /proc/interrupts also
You might get a qualitative idea of a huge rate of interrupts.
Post by John Hearns via Beowulf
Faraz,
I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically the
DIMMS. I have seen this before!
If you have some DIMMS which are faulty and are generating ECC errors,
then if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is spending
time servicing these interrupts.
So: look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors: ipmitool sel elist
I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester will
discover it within minutes.
Or it could be something else - in which case I get no coffee.
Also Intel cluster checker is intended to exacly deal with these situations.
What is your cluster manager, and is Intel CLuster Checker available to you?
I would seriously look at getting this installed.
One of our compute nodes runs ~30% slower than others. It has the exact
same image so I am baffled why it is running slow . I have tested OMP and
MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
looks normal there.
I thought it may have to do with cpu scaling, i.e when the kernel
changes the cpu speed depending on the workload. But we do not have that
enabled on these machines.
Here is a snippet from "cat /proc/cpuinfo". Everything is identical to
our other nodes. Any suggestions on what else to check? I have tried
rebooting it.
processor : 19
vendor_id : GenuineIntel
cpu family : 6
model : 62
stepping : 4
cpu MHz : 2500.098
cache size : 25600 KB
physical id : 1
siblings : 10
core id : 12
cpu cores : 10
apicid : 56
initial apicid : 56
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 5004.97
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
Rushat Rai
2017-08-10 15:25:20 UTC
Permalink
Hi, my first post here.

Anyways, I agree with John, I've seen debris caught up in intakes causing some performance drop. 30% does seem a little excessive, but you should check first.

I don't know if this has been mentioned, but ECC could be slowing down that specific node if it has a faulty stick.

I would also like to know if it is in the exact same environment as the rest. Is it close to an Air conditioner exhaust, or something similar? Have you checked the thermals for that specific node compared to others?

Let me know

On Thursday 10 August 2017 08:47 PM, John Hearns via Beowulf wrote:
Another thing to perhaps look at. Are you seeing messages abotu thermal throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?

I dont think that will produce a 30% drop in perfoemance. But I have caught compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)




On 10 August 2017 at 17:00, John Hearns <***@googlemail.com<mailto:***@googlemail.com>> wrote:
ps. Look at watch cat /proc/interrupts also
You might get a qualitative idea of a huge rate of interrupts.


On 10 August 2017 at 16:59, John Hearns <***@googlemail.com<mailto:***@googlemail.com>> wrote:
Faraz,
I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically the DIMMS. I have seen this before!
If you have some DIMMS which are faulty and are generating ECC errors, then if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is spending time servicing these interrupts.

So: look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors: ipmitool sel elist

I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester will discover it within minutes.

Or it could be something else - in which case I get no coffee.

Also Intel cluster checker is intended to exacly deal with these situations.
What is your cluster manager, and is Intel CLuster Checker available to you?
I would seriously look at getting this installed.







On 10 August 2017 at 16:39, Faraz Hussain <***@feacluster.com<mailto:***@feacluster.com>> wrote:
One of our compute nodes runs ~30% slower than others. It has the exact same image so I am baffled why it is running slow . I have tested OMP and MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all looks normal there.

I thought it may have to do with cpu scaling, i.e when the kernel changes the cpu speed depending on the workload. But we do not have that enabled on these machines.

Here is a snippet from "cat /proc/cpuinfo". Everything is identical to our other nodes. Any suggestions on what else to check? I have tried rebooting it.

processor : 19
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping : 4
cpu MHz : 2500.098
cache size : 25600 KB
physical id : 1
siblings : 10
core id : 12
cpu cores : 10
apicid : 56
initial apicid : 56
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 5004.97
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:



_______________________________________________
Beowulf mailing list, ***@beowulf.org<mailto:***@beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf






_______________________________________________
Beowulf mailing list, ***@beowulf.org<mailto:***@beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Robert Horton
2017-08-10 15:16:51 UTC
Permalink
As John says, I'd start by checking the health of things like memory,
power supplies etc.

I've seen things like this which go away after a firmware update, so
I'd suggest updating the bios etc if you can.

Have you tried completely removing the power for a few minutes then
booting up again?

Any idea when the problem started? I presume from the cpu it's not a
new system. What physical form is it (1u server / blade etc)?

Rob
One of our compute nodes runs ~30% slower than others. It has the  
exact same image so I am baffled why it is running slow . I have  
tested OMP and MPI benchmarks. Everything runs slower. The cpu
usage  
goes to 2000%, so all looks normal there.
I thought it may have to do with cpu scaling, i.e when the kernel  
changes the cpu speed depending on the workload. But we do not have  
that enabled on these machines.
Here is a snippet from "cat /proc/cpuinfo". Everything is identical
to  
our other nodes. Any suggestions on what else to check? I have
tried  
rebooting it.
processor       : 19
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
stepping        : 4
cpu MHz         : 2500.098
cache size      : 25600 KB
physical id     : 1
siblings        : 10
core id         : 12
cpu cores       : 10
apicid          : 56
initial apicid  : 56
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge  
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe  
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts  
rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64
monitor  
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2  
x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm
ida  
arat xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid  
fsgsbase smep erms
bogomips        : 5004.97
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
_______________________________________________
Computing
To change your subscription (digest mode or unsubscribe) visit http:/
/www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman
Andrew Latham
2017-08-10 18:28:30 UTC
Permalink
In general if you have a snowflake you need to take some steps.
1. Unrack and remove it from the population
2. Image, document the system
3. Sniff test, visual test, power on fans spinning test in a lab
4. Understand that it is ok for one system out of X (where X could be 1000)
can fail
5. Return the system to rack if drive/image replacement resolves issue
6. Return system to supplier if above fails
7. Keep moving, don't spend the hours that equate to the cost of the node
troubleshooting it unless capital budget is super tricky
8. Keep dialog with supplier all the time to say that everything is awesome
so they are interested in the change of status
9. Don't troubleshoot in production ever....
One of our compute nodes runs ~30% slower than others. It has the exact
same image so I am baffled why it is running slow . I have tested OMP and
MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
looks normal there.
I thought it may have to do with cpu scaling, i.e when the kernel changes
the cpu speed depending on the workload. But we do not have that enabled on
these machines.
Here is a snippet from "cat /proc/cpuinfo". Everything is identical to our
other nodes. Any suggestions on what else to check? I have tried rebooting
it.
processor : 19
vendor_id : GenuineIntel
cpu family : 6
model : 62
stepping : 4
cpu MHz : 2500.098
cache size : 25600 KB
physical id : 1
siblings : 10
core id : 12
cpu cores : 10
apicid : 56
initial apicid : 56
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 5004.97
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
- Andrew "lathama" Latham ***@gmail.com http://lathama.com
<http://lathama.org> -
Chris Samuel
2017-08-12 03:35:57 UTC
Permalink
Post by Faraz Hussain
I thought it may have to do with cpu scaling, i.e when the kernel
changes the cpu speed depending on the workload. But we do not have
that enabled on these machines.
Just to add to the excellent suggestions from others: have you compared BIOS/
UEFI settings & versions across these nodes to ensure they're identical?

Also remember that the kernel can enable C states that hurt performance even
if they are disabled in the BIOS/UEFI. This was painfully apparent on our
first SandyBridge cluster that almost failed the performance part of acceptance
testing until it got found.

Now we boot all nodes with this in the kernel cmdline:

intel_idle.max_cstate=0 processor.max_cstate=1 intel_pstate=disable

Best of luck!
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/
William Johnson
2017-08-12 07:35:46 UTC
Permalink
This may be a long shot, especially in a server room where everything else
is working as expected.
It may be the case that there is nothing wrong with the machine itself,
but rather with the level of power supplied to the machine by the
building's wiring.
I have seen incorrectly supplied power levels cause unpredictable
behaviour.in a machine.
But, as I said, it is a long shot.
Post by Chris Samuel
Post by Faraz Hussain
I thought it may have to do with cpu scaling, i.e when the kernel
changes the cpu speed depending on the workload. But we do not have
that enabled on these machines.
Just to add to the excellent suggestions from others: have you compared BIOS/
UEFI settings & versions across these nodes to ensure they're identical?
Also remember that the kernel can enable C states that hurt performance even
if they are disabled in the BIOS/UEFI. This was painfully apparent on our
first SandyBridge cluster that almost failed the performance part of acceptance
testing until it got found.
intel_idle.max_cstate=0 processor.max_cstate=1 intel_pstate=disable
Best of luck!
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Christopher Samuel
2017-08-14 01:10:36 UTC
Permalink
Post by William Johnson
This may be a long shot, especially in a server room where everything
else is working as expected.
Oh agreed! But given people have covered a lot of other bases I thought
I'd throw something in from my own experience. If all nodes boot the
same OS image then you'd not expect the kernel command lines etc to
differ, but the UEFI settings might (depending on how they are
configured usually).

cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/be
Lachlan Musicman
2017-08-13 22:17:38 UTC
Permalink
Post by Chris Samuel
Also remember that the kernel can enable C states that hurt performance even
if they are disabled in the BIOS/UEFI. This was painfully apparent on our
first SandyBridge cluster that almost failed the performance part of acceptance
testing until it got found.
intel_idle.max_cstate=0 processor.max_cstate=1 intel_pstate=disable
Chris,

Can you point to some good documentation on this?

cheers
L.



------
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared consciousness that our institutions have failed
and our ecosystem is collapsing, yet we are still here — and we are
creative agents who can shape our destinies. Apocalyptic civics is the
conviction that the only way out is through, and the only way through is
together. "

*Greg Bloom* @greggish
https://twitter.com/greggish/status/873177525903609857
Christopher Samuel
2017-08-14 01:07:26 UTC
Permalink
Post by Lachlan Musicman
Can you point to some good documentation on this?
There is some on Mellanox's website:

http://www.mellanox.com/related-docs/prod_software/Mellanox_EN_for_Linux_User_Manual_v2_0-3_0_0.pdf

But it it took weeks for $VENDOR to figure out what was
going on and why performance was so bad. It wasn't until
they got Mellanox into the calls that Mellanox pointed
this out to them.

cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http
Bill Broadley via Beowulf
2017-08-17 00:14:59 UTC
Permalink
Post by Rushat Rai
One of our compute nodes runs ~30% slower than others. It has the exact same
image so I am baffled why it is running slow . I have tested OMP and MPI
benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all looks
normal there.
We got some supermicro dual socket nodes without the little plastic air guides.
They thermally throttled really quickly.

I've also seen nodes that fall back to 1 channel because the dimms were in the
wrong slots.

I suggest comparing the physical nodes, double check fans (which should be
spinning), air conduits, dimm placement, etc. Then check dmesg, syslog,
temperatures, and compare a fast node to a slow node.


_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.be
mathog
2017-08-11 21:42:05 UTC
Permalink
Post by Rushat Rai
I don't know if this has been mentioned, but ECC could be slowing down
that specific node if it has a faulty stick.
To find the bad stick one often must disable ECC, at least that was the
case many years ago the last time I ran into that. If ECC is enabled,
even if the stick is somewhat defective, it may still pass memtest86+.
That utility will show if ECC is enabled or not, and the ECC disable, if
there is one, is set in the motherboard BIOS.

I'm late to this thread, does this node have a local disk? Failing
disks can really slow things down if the device has to read the same
block many times before it succeeds. That usually shows up in smartctl.

What sort of network connect? Try swapping those cables. Also run the
network throughput test of your choice. If the problem is there those
tests will reveal it.

"sensors" should show roughly the same values as the other nodes, if
not, figure out why. As others have suggested that could be blocked
ventilation, but more often in my experience it is a fan on the way
out.

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit htt
Loading...