Discussion:
[Beowulf] Varying performance across identical cluster nodes.
Prentice Bisbal
2017-09-08 18:41:14 UTC
Permalink
Beowulfers,

I need your assistance debugging a problem:

I have a dozen servers that are all identical hardware: SuperMicro
servers with AMD Opteron 6320 processors. Every since we upgraded to
CentOS 6, the users have been complaining of wildly inconsistent
performance across these 12 nodes. I ran LINPACK on these nodes, and was
able to duplicate the problem, with performance varying from ~14 GFLOPS
to 64 GFLOPS.

I've identified that performance on the slower nodes starts off fine,
and then slowly degrades throughout the LINPACK run. For example, on a
node with this problem, during first LINPACK test, I can see the
performance drop from 115 GFLOPS down to 11.3 GFLOPS. That constant,
downward trend continues throughout the remaining tests. At the start of
subsequent tests, performance will jump up to about 9-10 GFLOPS, but
then drop to 5-6 GLOPS at the end of the test.

Because of the nature of this problem, I suspect this might be a thermal
issue. My guess is that the processor speed is being throttled to
prevent overheating on the "bad" nodes.

But here's the thing: this wasn't a problem until we upgraded to CentOS
6. Where I work, we use a read-only NFSroot filesystem for our cluster
nodes, so all nodes are mounting and using the same exact read-only
image of the operating system. This only happens with these SuperMicro
nodes, and only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked
fine, and when I installed CentOS 6 on a local disk, the nodes worked fine.

Any ideas where to look or what to tweak to fix this? Any idea why this
is only occuring with RHEL 6 w/ NFS root OS?
--
Prentice

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) vi
Andrew Latham
2017-09-08 18:56:23 UTC
Permalink
Shooting from hip
1. BIOS identical version and settings
2. Firmware on device (I assume nothing just thinking out loud)
3. Re-seat fans/replace (oxidized contacts - silly but why not)
4. Verify the power supplies are identical (various watts etc... maybe swap
out and test)
5. Memory cooling heat-sinks? (have seen identical orders with different
memory some with heatsinks)
6. Thermal paste
7. Blank panels on empty drive bays
8. Location in rack/room
9. Blanking on rack

Shared to promote thought
Post by Prentice Bisbal
Beowulfers,
I have a dozen servers that are all identical hardware: SuperMicro servers
with AMD Opteron 6320 processors. Every since we upgraded to CentOS 6, the
users have been complaining of wildly inconsistent performance across these
12 nodes. I ran LINPACK on these nodes, and was able to duplicate the
problem, with performance varying from ~14 GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes starts off fine, and
then slowly degrades throughout the LINPACK run. For example, on a node
with this problem, during first LINPACK test, I can see the performance
drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
continues throughout the remaining tests. At the start of subsequent tests,
performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
at the end of the test.
Because of the nature of this problem, I suspect this might be a thermal
issue. My guess is that the processor speed is being throttled to prevent
overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded to CentOS 6.
Where I work, we use a read-only NFSroot filesystem for our cluster nodes,
so all nodes are mounting and using the same exact read-only image of the
operating system. This only happens with these SuperMicro nodes, and only
with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
installed CentOS 6 on a local disk, the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why this is
only occuring with RHEL 6 w/ NFS root OS?
--
Prentice
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
- Andrew "lathama" Latham ***@gmail.com http://lathama.com
<http://lathama.org> -
Lux, Jim (337C)
2017-09-08 23:15:01 UTC
Permalink
Do you have a temperature probe? One of those IR thermometers?
A FLIR One camera for your phone?

Then you can quickly check things like heat sink temperatures and surroundings. Air temp is hard to measure quickly and accurately.

Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)

From: Beowulf [mailto:beowulf-***@beowulf.org] On Behalf Of Andrew Latham
Sent: Friday, September 08, 2017 11:56 AM
To: Prentice Bisbal <***@pppl.gov>
Cc: Beowulf List <***@beowulf.org>
Subject: Re: [Beowulf] Varying performance across identical cluster nodes.

Shooting from hip
1. BIOS identical version and settings
2. Firmware on device (I assume nothing just thinking out loud)
3. Re-seat fans/replace (oxidized contacts - silly but why not)
4. Verify the power supplies are identical (various watts etc... maybe swap out and test)
5. Memory cooling heat-sinks? (have seen identical orders with different memory some with heatsinks)
6. Thermal paste
7. Blank panels on empty drive bays
8. Location in rack/room
9. Blanking on rack

Shared to promote thought

On Fri, Sep 8, 2017 at 1:41 PM, Prentice Bisbal <***@pppl.gov<mailto:***@pppl.gov>> wrote:
Beowulfers,

I need your assistance debugging a problem:

I have a dozen servers that are all identical hardware: SuperMicro servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS 6, the users have been complaining of wildly inconsistent performance across these 12 nodes. I ran LINPACK on these nodes, and was able to duplicate the problem, with performance varying from ~14 GFLOPS to 64 GFLOPS.

I've identified that performance on the slower nodes starts off fine, and then slowly degrades throughout the LINPACK run. For example, on a node with this problem, during first LINPACK test, I can see the performance drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend continues throughout the remaining tests. At the start of subsequent tests, performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS at the end of the test.

Because of the nature of this problem, I suspect this might be a thermal issue. My guess is that the processor speed is being throttled to prevent overheating on the "bad" nodes.

But here's the thing: this wasn't a problem until we upgraded to CentOS 6. Where I work, we use a read-only NFSroot filesystem for our cluster nodes, so all nodes are mounting and using the same exact read-only image of the operating system. This only happens with these SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I installed CentOS 6 on a local disk, the nodes worked fine.

Any ideas where to look or what to tweak to fix this? Any idea why this is only occuring with RHEL 6 w/ NFS root OS?
--
Prentice

_______________________________________________
Beowulf mailing list, ***@beowulf.org<mailto:***@beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
- Andrew "lathama" Latham ***@gmail.com<mailto:***@gmail.com> http://lathama.com<http://lathama.org> -
Skylar Thompson
2017-09-08 19:42:08 UTC
Permalink
I would also suspect a thermal issue, though it could also be firmware. To
verify a temperature problem, you might try setting up lm_sensors or
scraping "ipmitool sdr" output (whichever is easier) regularly and try to
make a performance-vs-temperature plot for each node. As Andrew mentioned,
it could also be firmware/CPU microcode. We recently tracked down a problem
with some of our nodes that ended up being microcode-related; the CPUs
would start in a high-power state, but end up getting stuck in a low-power
state, regardless of what power management settings we had set in the BIOS.

Skylar
Post by Prentice Bisbal
Beowulfers,
I have a dozen servers that are all identical hardware: SuperMicro servers
with AMD Opteron 6320 processors. Every since we upgraded to CentOS 6, the
users have been complaining of wildly inconsistent performance across these
12 nodes. I ran LINPACK on these nodes, and was able to duplicate the
problem, with performance varying from ~14 GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes starts off fine, and
then slowly degrades throughout the LINPACK run. For example, on a node
with this problem, during first LINPACK test, I can see the performance
drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
continues throughout the remaining tests. At the start of subsequent tests,
performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
at the end of the test.
Because of the nature of this problem, I suspect this might be a thermal
issue. My guess is that the processor speed is being throttled to prevent
overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded to CentOS 6.
Where I work, we use a read-only NFSroot filesystem for our cluster nodes,
so all nodes are mounting and using the same exact read-only image of the
operating system. This only happens with these SuperMicro nodes, and only
with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
installed CentOS 6 on a local disk, the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why this is
only occuring with RHEL 6 w/ NFS root OS?
--
Prentice
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Bill Broadley
2017-09-09 02:28:30 UTC
Permalink
Last time I saw this problem was because the chassis was missing the air
redirection guides, and not enough air was getting to the CPUs.

The OS upgrade might actually be enabling better throttling to keep the CPU cooler.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe
Joe Landman
2017-09-09 03:56:28 UTC
Permalink
Post by Prentice Bisbal
But here's the thing: this wasn't a problem until we upgraded to
CentOS 6. Where I work, we use a read-only NFSroot filesystem for our
cluster nodes, so all nodes are mounting and using the same exact
read-only image of the operating system. This only happens with these
SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on
NFSroot worked fine, and when I installed CentOS 6 on a local disk,
the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why
this is only occuring with RHEL 6 w/ NFS root OS?
Sounds suspiciously like a network or other driver running hard in a
tight polling mode causing a growing number of CSW/Ints over time. Since
these are opteron (really? still in use?) chances are you might have a
firmware issue on the set of slower nodes, that had been corrected on
the other nodes. With NFS root, if you have a node locking a
particular file that the other nodes want to write to, the node can
appear slow while it waits on the IO.

You might try running dstat and saving output into a file from boot
onwards. Then run the tests, and see if the int or CSW are being driven
very high. Pay attention to the usr/idl and other percentages.

You can also grab temperature stats. Helps if you have ipmi.

ipmitool sdr

ipmitool sdr | grep Temp
CPU1 Temp | 35 degrees C | ok
CPU2 Temp | 35 degrees C | ok
System Temp | 35 degrees C | ok
Peripheral Temp | 38 degrees C | ok
PCH Temp | 43 degrees C | ok

If not, sensors

sensors
Package id 1: +35.0°C (high = +82.0°C, crit = +92.0°C)
Core 0: +35.0°C (high = +82.0°C, crit = +92.0°C)
Core 1: +35.0°C (high = +82.0°C, crit = +92.0°C)
Core 2: +33.0°C (high = +82.0°C, crit = +92.0°C)
Core 3: +34.0°C (high = +82.0°C, crit = +92.0°C)
...
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) vi
Christopher Samuel
2017-09-10 23:53:00 UTC
Permalink
Post by Prentice Bisbal
Any ideas where to look or what to tweak to fix this? Any idea why this
is only occuring with RHEL 6 w/ NFS root OS?
No ideas, but in addition to what others have suggested:

1) diff the output of dmidecode between 4 nodes, 2 OK and 2 slow to see
what differences there are in common (if any) between the OK & slow
nodes. I would think you would only see serial number and UUID
differences (certainly that's what I see here for our gear).

2) reboot an idle OK and slow node node and immediately capture the
output of dmesg on both and then diff that. Hopefully that will reveal
any differences in kernel boot options, driver messages, power saving
settings, etc, that might be implicated.

Good luck!
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/ma
Prentice Bisbal
2017-09-13 17:48:24 UTC
Permalink
Okay, based on the various responses I've gotten here and on other
lists, I feel I need to clarify things:

This problem only occurs when I'm running our NFSroot based version of
the OS (CentOS 6). When I run the same OS installed on a local disk, I
do not have this problem, using the same exact server(s).  For testing
purposes, I'm using LINPACK, and running the same executable  with the
same HPL.dat file in both instances.

Because I'm testing the same hardware using different OSes, this
(should) eliminate the problem being in the BIOS, and faulty hardware.
This leads me to believe it's most likely a software configuration
issue, like a kernel tuning parameter, or some other software
configuration issue.

These are Supermicro servers, and it seems they do not provide CPU
temps. I do see a chassis temp, but not the temps of the individual
CPUs. While I agree that should be the first thing I look at, it's not
an option for me. Other tools like FLIR and Infrared thermometers aren't
really an option for me, either.

What software configuration, either a kernel a parameter, configuration
of numad or cpuspeed, or some other setting, could affect this?

Prentice
Post by Prentice Bisbal
Beowulfers,
I have a dozen servers that are all identical hardware: SuperMicro
servers with AMD Opteron 6320 processors. Every since we upgraded to
CentOS 6, the users have been complaining of wildly inconsistent
performance across these 12 nodes. I ran LINPACK on these nodes, and
was able to duplicate the problem, with performance varying from ~14
GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes starts off fine,
and then slowly degrades throughout the LINPACK run. For example, on a
node with this problem, during first LINPACK test, I can see the
performance drop from 115 GFLOPS down to 11.3 GFLOPS. That constant,
downward trend continues throughout the remaining tests. At the start
of subsequent tests, performance will jump up to about 9-10 GFLOPS,
but then drop to 5-6 GLOPS at the end of the test.
Because of the nature of this problem, I suspect this might be a
thermal issue. My guess is that the processor speed is being throttled
to prevent overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded to
CentOS 6. Where I work, we use a read-only NFSroot filesystem for our
cluster nodes, so all nodes are mounting and using the same exact
read-only image of the operating system. This only happens with these
SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on
NFSroot worked fine, and when I installed CentOS 6 on a local disk,
the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why
this is only occuring with RHEL 6 w/ NFS root OS?
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf
Andrew Latham
2017-09-13 18:14:32 UTC
Permalink
ack, so maybe validate you can reproduce with another nfs root. Maybe a lab
setup where a single server is serving nfs root to the node. If you could
reproduce in that way then it would give some direction. Beyond that it
sounds like an interesting problem.
Okay, based on the various responses I've gotten here and on other lists,
This problem only occurs when I'm running our NFSroot based version of the
OS (CentOS 6). When I run the same OS installed on a local disk, I do not
have this problem, using the same exact server(s). For testing purposes,
I'm using LINPACK, and running the same executable with the same HPL.dat
file in both instances.
Because I'm testing the same hardware using different OSes, this (should)
eliminate the problem being in the BIOS, and faulty hardware. This leads me
to believe it's most likely a software configuration issue, like a kernel
tuning parameter, or some other software configuration issue.
These are Supermicro servers, and it seems they do not provide CPU temps.
I do see a chassis temp, but not the temps of the individual CPUs. While I
agree that should be the first thing I look at, it's not an option for me.
Other tools like FLIR and Infrared thermometers aren't really an option for
me, either.
What software configuration, either a kernel a parameter, configuration of
numad or cpuspeed, or some other setting, could affect this?
Prentice
Post by Prentice Bisbal
Beowulfers,
I have a dozen servers that are all identical hardware: SuperMicro
servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS
6, the users have been complaining of wildly inconsistent performance
across these 12 nodes. I ran LINPACK on these nodes, and was able to
duplicate the problem, with performance varying from ~14 GFLOPS to 64
GFLOPS.
I've identified that performance on the slower nodes starts off fine, and
then slowly degrades throughout the LINPACK run. For example, on a node
with this problem, during first LINPACK test, I can see the performance
drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
continues throughout the remaining tests. At the start of subsequent tests,
performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
at the end of the test.
Because of the nature of this problem, I suspect this might be a thermal
issue. My guess is that the processor speed is being throttled to prevent
overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded to CentOS
6. Where I work, we use a read-only NFSroot filesystem for our cluster
nodes, so all nodes are mounting and using the same exact read-only image
of the operating system. This only happens with these SuperMicro nodes, and
only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
installed CentOS 6 on a local disk, the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why this
is only occuring with RHEL 6 w/ NFS root OS?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
- Andrew "lathama" Latham ***@gmail.com http://lathama.com
<http://lathama.org> -
Scott Atchley
2017-09-13 18:15:52 UTC
Permalink
Are you swapping?
Post by Andrew Latham
ack, so maybe validate you can reproduce with another nfs root. Maybe a
lab setup where a single server is serving nfs root to the node. If you
could reproduce in that way then it would give some direction. Beyond that
it sounds like an interesting problem.
Okay, based on the various responses I've gotten here and on other lists,
This problem only occurs when I'm running our NFSroot based version of
the OS (CentOS 6). When I run the same OS installed on a local disk, I do
not have this problem, using the same exact server(s). For testing
purposes, I'm using LINPACK, and running the same executable with the same
HPL.dat file in both instances.
Because I'm testing the same hardware using different OSes, this (should)
eliminate the problem being in the BIOS, and faulty hardware. This leads me
to believe it's most likely a software configuration issue, like a kernel
tuning parameter, or some other software configuration issue.
These are Supermicro servers, and it seems they do not provide CPU temps.
I do see a chassis temp, but not the temps of the individual CPUs. While I
agree that should be the first thing I look at, it's not an option for me.
Other tools like FLIR and Infrared thermometers aren't really an option for
me, either.
What software configuration, either a kernel a parameter, configuration
of numad or cpuspeed, or some other setting, could affect this?
Prentice
Post by Prentice Bisbal
Beowulfers,
I have a dozen servers that are all identical hardware: SuperMicro
servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS
6, the users have been complaining of wildly inconsistent performance
across these 12 nodes. I ran LINPACK on these nodes, and was able to
duplicate the problem, with performance varying from ~14 GFLOPS to 64
GFLOPS.
I've identified that performance on the slower nodes starts off fine,
and then slowly degrades throughout the LINPACK run. For example, on a node
with this problem, during first LINPACK test, I can see the performance
drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
continues throughout the remaining tests. At the start of subsequent tests,
performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
at the end of the test.
Because of the nature of this problem, I suspect this might be a thermal
issue. My guess is that the processor speed is being throttled to prevent
overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded to CentOS
6. Where I work, we use a read-only NFSroot filesystem for our cluster
nodes, so all nodes are mounting and using the same exact read-only image
of the operating system. This only happens with these SuperMicro nodes, and
only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
installed CentOS 6 on a local disk, the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why this
is only occuring with RHEL 6 w/ NFS root OS?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
<http://lathama.org> -
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Scott Atchley
2017-09-13 18:16:56 UTC
Permalink
Are you logging something goes to the disk in the local case, but that is
competing for network bandwidth when NFS mounting?
Post by Scott Atchley
Are you swapping?
Post by Andrew Latham
ack, so maybe validate you can reproduce with another nfs root. Maybe a
lab setup where a single server is serving nfs root to the node. If you
could reproduce in that way then it would give some direction. Beyond that
it sounds like an interesting problem.
Post by Prentice Bisbal
Okay, based on the various responses I've gotten here and on other
This problem only occurs when I'm running our NFSroot based version of
the OS (CentOS 6). When I run the same OS installed on a local disk, I do
not have this problem, using the same exact server(s). For testing
purposes, I'm using LINPACK, and running the same executable with the same
HPL.dat file in both instances.
Because I'm testing the same hardware using different OSes, this
(should) eliminate the problem being in the BIOS, and faulty hardware. This
leads me to believe it's most likely a software configuration issue, like a
kernel tuning parameter, or some other software configuration issue.
These are Supermicro servers, and it seems they do not provide CPU
temps. I do see a chassis temp, but not the temps of the individual CPUs.
While I agree that should be the first thing I look at, it's not an option
for me. Other tools like FLIR and Infrared thermometers aren't really an
option for me, either.
What software configuration, either a kernel a parameter, configuration
of numad or cpuspeed, or some other setting, could affect this?
Prentice
Post by Prentice Bisbal
Beowulfers,
I have a dozen servers that are all identical hardware: SuperMicro
servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS
6, the users have been complaining of wildly inconsistent performance
across these 12 nodes. I ran LINPACK on these nodes, and was able to
duplicate the problem, with performance varying from ~14 GFLOPS to 64
GFLOPS.
I've identified that performance on the slower nodes starts off fine,
and then slowly degrades throughout the LINPACK run. For example, on a node
with this problem, during first LINPACK test, I can see the performance
drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
continues throughout the remaining tests. At the start of subsequent tests,
performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
at the end of the test.
Because of the nature of this problem, I suspect this might be a
thermal issue. My guess is that the processor speed is being throttled to
prevent overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded to CentOS
6. Where I work, we use a read-only NFSroot filesystem for our cluster
nodes, so all nodes are mounting and using the same exact read-only image
of the operating system. This only happens with these SuperMicro nodes, and
only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
installed CentOS 6 on a local disk, the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why this
is only occuring with RHEL 6 w/ NFS root OS?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
<http://lathama.org> -
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Prentice Bisbal
2017-09-14 13:24:16 UTC
Permalink
Another good question. The systems with the nfsroot os still have a
local disk. That local disk has a /var partition where logs are written.
Both system do send some logs to a remote log server. While
/etc/rsyslog.conf files were almost identical, I copied the one from the
nfsroot system to the local-os system to make sure they were identical.
This has had no impact on the performance of xhpl.

Prentice
Post by Scott Atchley
Are you logging something goes to the disk in the local case, but that
is competing for network bandwidth when NFS mounting?
On Wed, Sep 13, 2017 at 2:15 PM, Scott Atchley
Are you swapping?
ack, so maybe validate you can reproduce with another nfs
root. Maybe a lab setup where a single server is serving nfs
root to the node. If you could reproduce in that way then it
would give some direction. Beyond that it sounds like an
interesting problem.
On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal
Okay, based on the various responses I've gotten here and
This problem only occurs when I'm running our NFSroot
based version of the OS (CentOS 6). When I run the same OS
installed on a local disk, I do not have this problem,
using the same exact server(s).  For testing purposes, I'm
using LINPACK, and running the same executable  with the
same HPL.dat file in both instances.
Because I'm testing the same hardware using different
OSes, this (should) eliminate the problem being in the
BIOS, and faulty hardware. This leads me to believe it's
most likely a software configuration issue, like a kernel
tuning parameter, or some other software configuration issue.
These are Supermicro servers, and it seems they do not
provide CPU temps. I do see a chassis temp, but not the
temps of the individual CPUs. While I agree that should be
the first thing I look at, it's not an option for me.
Other tools like FLIR and Infrared thermometers aren't
really an option for me, either.
What software configuration, either a kernel a parameter,
configuration of numad or cpuspeed, or some other setting,
could affect this?
Prentice
Beowulfers,
I have a dozen servers that are all identical
hardware: SuperMicro servers with AMD Opteron 6320
processors. Every since we upgraded to CentOS 6, the
users have been complaining of wildly inconsistent
performance across these 12 nodes. I ran LINPACK on
these nodes, and was able to duplicate the problem,
with performance varying from ~14 GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes
starts off fine, and then slowly degrades throughout
the LINPACK run. For example, on a node with this
problem, during first LINPACK test, I can see the
performance drop from 115 GFLOPS down to 11.3 GFLOPS.
That constant, downward trend continues throughout the
remaining tests. At the start of subsequent tests,
performance will jump up to about 9-10 GFLOPS, but
then drop to 5-6 GLOPS at the end of the test.
Because of the nature of this problem, I suspect this
might be a thermal issue. My guess is that the
processor speed is being throttled to prevent
overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we
upgraded to CentOS 6. Where I work, we use a read-only
NFSroot filesystem for our cluster nodes, so all nodes
are mounting and using the same exact read-only image
of the operating system. This only happens with these
SuperMicro nodes, and only with the CentOS 6 on
NFSroot. RHEL5 on NFSroot worked fine, and when I
installed CentOS 6 on a local disk, the nodes worked fine.
Any ideas where to look or what to tweak to fix this?
Any idea why this is only occuring with RHEL 6 w/ NFS
root OS?
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
<http://lathama.org> -
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
Prentice Bisbal
2017-09-14 13:14:39 UTC
Permalink
Good question. I just checked using vmstat. When running xhpl on both
systems, vmstat shows only zeros for si and so, even long after the
performance degrades on the nfsroot instance. Just to be sure, I
double-checked with top, which shows 0k of swap being used.

Prentice
Post by Scott Atchley
Are you swapping?
ack, so maybe validate you can reproduce with another nfs root.
Maybe a lab setup where a single server is serving nfs root to the
node. If you could reproduce in that way then it would give some
direction. Beyond that it sounds like an interesting problem.
On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal
Okay, based on the various responses I've gotten here and on
This problem only occurs when I'm running our NFSroot based
version of the OS (CentOS 6). When I run the same OS installed
on a local disk, I do not have this problem, using the same
exact server(s).  For testing purposes, I'm using LINPACK, and
running the same executable  with the same HPL.dat file in
both instances.
Because I'm testing the same hardware using different OSes,
this (should) eliminate the problem being in the BIOS, and
faulty hardware. This leads me to believe it's most likely a
software configuration issue, like a kernel tuning parameter,
or some other software configuration issue.
These are Supermicro servers, and it seems they do not provide
CPU temps. I do see a chassis temp, but not the temps of the
individual CPUs. While I agree that should be the first thing
I look at, it's not an option for me. Other tools like FLIR
and Infrared thermometers aren't really an option for me, either.
What software configuration, either a kernel a parameter,
configuration of numad or cpuspeed, or some other setting,
could affect this?
Prentice
Beowulfers,
SuperMicro servers with AMD Opteron 6320 processors. Every
since we upgraded to CentOS 6, the users have been
complaining of wildly inconsistent performance across
these 12 nodes. I ran LINPACK on these nodes, and was able
to duplicate the problem, with performance varying from
~14 GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes
starts off fine, and then slowly degrades throughout the
LINPACK run. For example, on a node with this problem,
during first LINPACK test, I can see the performance drop
from 115 GFLOPS down to 11.3 GFLOPS. That constant,
downward trend continues throughout the remaining tests.
At the start of subsequent tests, performance will jump up
to about 9-10 GFLOPS, but then drop to 5-6 GLOPS at the
end of the test.
Because of the nature of this problem, I suspect this
might be a thermal issue. My guess is that the processor
speed is being throttled to prevent overheating on the
"bad" nodes.
But here's the thing: this wasn't a problem until we
upgraded to CentOS 6. Where I work, we use a read-only
NFSroot filesystem for our cluster nodes, so all nodes are
mounting and using the same exact read-only image of the
operating system. This only happens with these SuperMicro
nodes, and only with the CentOS 6 on NFSroot. RHEL5 on
NFSroot worked fine, and when I installed CentOS 6 on a
local disk, the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any
idea why this is only occuring with RHEL 6 w/ NFS root OS?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
John Hearns via Beowulf
2017-09-14 13:25:58 UTC
Permalink
Prentice, as I understand it the problem here is that with the same OS
and IB drivers, there is a big difference in performance between stateful
and NFS root nodes.
Throwing my hat into the ring, try looking ot see if there is an
excessive rate of interrupts in the nfsroot case, coming from the network
card:

watch cat /proc/interrupts

You will probably need a large terminal window for this (or probably there
is a way to filter the output)
Post by Prentice Bisbal
Good question. I just checked using vmstat. When running xhpl on both
systems, vmstat shows only zeros for si and so, even long after the
performance degrades on the nfsroot instance. Just to be sure, I
double-checked with top, which shows 0k of swap being used.
Prentice
Are you swapping?
Post by Andrew Latham
ack, so maybe validate you can reproduce with another nfs root. Maybe a
lab setup where a single server is serving nfs root to the node. If you
could reproduce in that way then it would give some direction. Beyond that
it sounds like an interesting problem.
Post by Prentice Bisbal
Okay, based on the various responses I've gotten here and on other
This problem only occurs when I'm running our NFSroot based version of
the OS (CentOS 6). When I run the same OS installed on a local disk, I do
not have this problem, using the same exact server(s). For testing
purposes, I'm using LINPACK, and running the same executable with the same
HPL.dat file in both instances.
Because I'm testing the same hardware using different OSes, this
(should) eliminate the problem being in the BIOS, and faulty hardware. This
leads me to believe it's most likely a software configuration issue, like a
kernel tuning parameter, or some other software configuration issue.
These are Supermicro servers, and it seems they do not provide CPU
temps. I do see a chassis temp, but not the temps of the individual CPUs.
While I agree that should be the first thing I look at, it's not an option
for me. Other tools like FLIR and Infrared thermometers aren't really an
option for me, either.
What software configuration, either a kernel a parameter, configuration
of numad or cpuspeed, or some other setting, could affect this?
Prentice
Post by Prentice Bisbal
Beowulfers,
I have a dozen servers that are all identical hardware: SuperMicro
servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS
6, the users have been complaining of wildly inconsistent performance
across these 12 nodes. I ran LINPACK on these nodes, and was able to
duplicate the problem, with performance varying from ~14 GFLOPS to 64
GFLOPS.
I've identified that performance on the slower nodes starts off fine,
and then slowly degrades throughout the LINPACK run. For example, on a node
with this problem, during first LINPACK test, I can see the performance
drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
continues throughout the remaining tests. At the start of subsequent tests,
performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
at the end of the test.
Because of the nature of this problem, I suspect this might be a
thermal issue. My guess is that the processor speed is being throttled to
prevent overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded to CentOS
6. Where I work, we use a read-only NFSroot filesystem for our cluster
nodes, so all nodes are mounting and using the same exact read-only image
of the operating system. This only happens with these SuperMicro nodes, and
only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
installed CentOS 6 on a local disk, the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why this
is only occuring with RHEL 6 w/ NFS root OS?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
<http://lathama.org> -
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Joe Landman
2017-09-14 13:29:17 UTC
Permalink
Post by John Hearns via Beowulf
Prentice, as I understand it the problem here is that with the same
OS and IB drivers, there is a big difference in performance between
stateful and NFS root nodes.
Throwing my hat into the ring, try looking ot see if there is an
excessive rate of interrupts in the nfsroot case, coming from the
watch cat /proc/interrupts
You will probably need a large terminal window for this (or probably
there is a way to filter the output)
dstat is helpful here.
Post by John Hearns via Beowulf
Good question. I just checked using vmstat. When running xhpl on
both systems, vmstat shows only zeros for si and so, even long
after the performance degrades on the nfsroot instance. Just to be
sure, I double-checked with top, which shows 0k of swap being used.
Prentice
Post by Scott Atchley
Are you swapping?
ack, so maybe validate you can reproduce with another nfs
root. Maybe a lab setup where a single server is serving nfs
root to the node. If you could reproduce in that way then it
would give some direction. Beyond that it sounds like an
interesting problem.
On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal
Okay, based on the various responses I've gotten here and
This problem only occurs when I'm running our NFSroot
based version of the OS (CentOS 6). When I run the same
OS installed on a local disk, I do not have this problem,
using the same exact server(s). For testing purposes,
I'm using LINPACK, and running the same executable with
the same HPL.dat file in both instances.
Because I'm testing the same hardware using different
OSes, this (should) eliminate the problem being in the
BIOS, and faulty hardware. This leads me to believe it's
most likely a software configuration issue, like a kernel
tuning parameter, or some other software configuration issue.
These are Supermicro servers, and it seems they do not
provide CPU temps. I do see a chassis temp, but not the
temps of the individual CPUs. While I agree that should
be the first thing I look at, it's not an option for me.
Other tools like FLIR and Infrared thermometers aren't
really an option for me, either.
What software configuration, either a kernel a parameter,
configuration of numad or cpuspeed, or some other
setting, could affect this?
Prentice
Beowulfers,
I have a dozen servers that are all identical
hardware: SuperMicro servers with AMD Opteron 6320
processors. Every since we upgraded to CentOS 6, the
users have been complaining of wildly inconsistent
performance across these 12 nodes. I ran LINPACK on
these nodes, and was able to duplicate the problem,
with performance varying from ~14 GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes
starts off fine, and then slowly degrades throughout
the LINPACK run. For example, on a node with this
problem, during first LINPACK test, I can see the
performance drop from 115 GFLOPS down to 11.3 GFLOPS.
That constant, downward trend continues throughout
the remaining tests. At the start of subsequent
tests, performance will jump up to about 9-10 GFLOPS,
but then drop to 5-6 GLOPS at the end of the test.
Because of the nature of this problem, I suspect this
might be a thermal issue. My guess is that the
processor speed is being throttled to prevent
overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we
upgraded to CentOS 6. Where I work, we use a
read-only NFSroot filesystem for our cluster nodes,
so all nodes are mounting and using the same exact
read-only image of the operating system. This only
happens with these SuperMicro nodes, and only with
the CentOS 6 on NFSroot. RHEL5 on NFSroot worked
fine, and when I installed CentOS 6 on a local disk,
the nodes worked fine.
Any ideas where to look or what to tweak to fix this?
Any idea why this is only occuring with RHEL 6 w/ NFS
root OS?
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
<http://lathama.org> -
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf
Faraz Hussain
2017-09-14 15:34:59 UTC
Permalink
Earlier I had posted about one of our blades running 30-50% slower
than other ones despite having identical hardware and OS. I followed
the suggestions and compared cpu temperature, memory, dmesg and
sysctl. Everything looks the same.

I then used "perf stat" to compare speed of pigz ( parralel gzip ).
The results are quite interesting. Using one cpu, the slow blade is as
fast as the rest! But as I use more cpus, the speed decreases linearly
from 3.1Ghz to 0.4 Ghz. See snippets from "perf stat" command below.
All tests were on /tmp to eliminate any nfs issue. And same behavior
is observed with any multi-threaded program.

Healthy blade 1 cpu:

Performance counter stats for './pigz -p 1 some200MBfile':

6441.560969 task-clock # 1.001 CPUs utilized
21,230,248,729 cycles # 3.296 GHz
6.435670580 seconds time elapsed

Slow blade 1 cpu:

Performance counter stats for './pigz -p 1 some200MBfile':

6857.933315 task-clock # 1.001 CPUs utilized
21,412,281,401 cycles # 3.122 GHz
6.851644289 seconds time elapsed

Healthy blade 20 cpus:

Performance counter stats for './pigz -p 1 some200MBfile':

7570.967306 task-clock # 16.367 CPUs utilized
21,913,797,346 cycles # 2.894 GHz
0.462575439 seconds time elapsed

Slow blade 20 cpus:

Performance counter stats for './pigz -p 1 some200MBfile':

63404.802003 task-clock # 19.524 CPUs utilized
24,834,879,081 cycles # 0.392 GHz
3.247597619 seconds time elapsed



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.or
Joe Landman
2017-09-14 15:42:44 UTC
Permalink
Post by Faraz Hussain
Earlier I had posted about one of our blades running 30-50% slower
than other ones despite having identical hardware and OS. I followed
the suggestions and compared cpu temperature, memory, dmesg and
sysctl. Everything looks the same.
I then used "perf stat" to compare speed of pigz ( parralel gzip ).
The results are quite interesting. Using one cpu, the slow blade is as
fast as the rest! But as I use more cpus, the speed decreases linearly
from 3.1Ghz to 0.4 Ghz. See snippets from "perf stat" command below.
All tests were on /tmp to eliminate any nfs issue. And same behavior
is observed with any multi-threaded program.
What does numastat report? /tmp is a ramdisk or tmpfs? Are the
nodes/cpus otherwise idle? What does lscpu on a good/bad node report?

If it decreases on a 1/Ncpu curve, then you have a fixed sized resource
bandwidth contention issue you are fighting. The question is what.
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listi
Faraz Hussain
2017-09-14 18:29:22 UTC
Permalink
Post by Joe Landman
What does numastat report? /tmp is a ramdisk or tmpfs? Are the
nodes/cpus otherwise idle? What does lscpu on a good/bad node report?
/tmp is tmpfs. The node is completely idle. lscpu is identical for the
slow and normal ones as shown below. The numastat output is shown
after that.

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Stepping: 4
CPU MHz: 2499.897
BogoMIPS: 4999.25
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18

Slow blade
==========
node0 node1
numa_hit 4791488397 3134270297
numa_miss 0 0
numa_foreign 0 0
interleave_hit 20751 20698
local_node 4791480891 3134244680
other_node 7506 25617

Normal blade
============
node0 node1
numa_hit 148398986 104992773
numa_miss 0 0
numa_foreign 0 0
interleave_hit 20737 20712
local_node 148396757 104968673
other_node 2229 24100




_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowu
Faraz Hussain
2017-09-27 01:28:52 UTC
Permalink
The issue seems is now resolved after I did a full power down ( cold
boot )! No idea what caused the issue in the first place.
Post by Joe Landman
Post by Faraz Hussain
Earlier I had posted about one of our blades running 30-50% slower
than other ones despite having identical hardware and OS. I
followed the suggestions and compared cpu temperature, memory,
dmesg and sysctl. Everything looks the same.
I then used "perf stat" to compare speed of pigz ( parralel gzip ).
The results are quite interesting. Using one cpu, the slow blade is
as fast as the rest! But as I use more cpus, the speed decreases
linearly from 3.1Ghz to 0.4 Ghz. See snippets from "perf stat"
command below. All tests were on /tmp to eliminate any nfs issue.
And same behavior is observed with any multi-threaded program.
What does numastat report? /tmp is a ramdisk or tmpfs? Are the
nodes/cpus otherwise idle? What does lscpu on a good/bad node report?
If it decreases on a 1/Ncpu curve, then you have a fixed sized
resource bandwidth contention issue you are fighting. The question
is what.
--
Joe Landman
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowu
Joe Landman
2017-09-13 18:45:41 UTC
Permalink
FWIW: I gave up on NFS boot a while ago, due in part to problems with
performance that were hard to track down. The environment I created to
do completely ramboot boots at scale, allows me to pivot to NFS if
desired (boot time switch). But I rarely use that. Pure ramboot has
been a joy to work with as compared to NFS.
Post by Prentice Bisbal
Okay, based on the various responses I've gotten here and on other
This problem only occurs when I'm running our NFSroot based version of
the OS (CentOS 6). When I run the same OS installed on a local disk, I
do not have this problem, using the same exact server(s). For testing
purposes, I'm using LINPACK, and running the same executable with the
same HPL.dat file in both instances.
Because I'm testing the same hardware using different OSes, this
(should) eliminate the problem being in the BIOS, and faulty hardware.
This leads me to believe it's most likely a software configuration
issue, like a kernel tuning parameter, or some other software
configuration issue.
These are Supermicro servers, and it seems they do not provide CPU
temps. I do see a chassis temp, but not the temps of the individual
CPUs. While I agree that should be the first thing I look at, it's not
an option for me. Other tools like FLIR and Infrared thermometers
aren't really an option for me, either.
What software configuration, either a kernel a parameter,
configuration of numad or cpuspeed, or some other setting, could
affect this?
Prentice
Post by Prentice Bisbal
Beowulfers,
I have a dozen servers that are all identical hardware: SuperMicro
servers with AMD Opteron 6320 processors. Every since we upgraded to
CentOS 6, the users have been complaining of wildly inconsistent
performance across these 12 nodes. I ran LINPACK on these nodes, and
was able to duplicate the problem, with performance varying from ~14
GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes starts off fine,
and then slowly degrades throughout the LINPACK run. For example, on
a node with this problem, during first LINPACK test, I can see the
performance drop from 115 GFLOPS down to 11.3 GFLOPS. That constant,
downward trend continues throughout the remaining tests. At the start
of subsequent tests, performance will jump up to about 9-10 GFLOPS,
but then drop to 5-6 GLOPS at the end of the test.
Because of the nature of this problem, I suspect this might be a
thermal issue. My guess is that the processor speed is being
throttled to prevent overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded to
CentOS 6. Where I work, we use a read-only NFSroot filesystem for our
cluster nodes, so all nodes are mounting and using the same exact
read-only image of the operating system. This only happens with these
SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on
NFSroot worked fine, and when I installed CentOS 6 on a local disk,
the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why
this is only occuring with RHEL 6 w/ NFS root OS?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beow
Michael Di Domenico
2017-09-13 18:53:54 UTC
Permalink
Post by Joe Landman
FWIW: I gave up on NFS boot a while ago, due in part to problems with
performance that were hard to track down. The environment I created to do
completely ramboot boots at scale, allows me to pivot to NFS if desired
(boot time switch). But I rarely use that. Pure ramboot has been a joy to
work with as compared to NFS.
seconded. i switched to a custom ramdisk based image with
copy-on-write a few years ago and never looked back... nfsroot that's
so 1990's... :)
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
Prentice Bisbal
2017-09-14 13:26:21 UTC
Permalink
Switching away from NFS root is not something I can change right now.

Prentice
FWIW:  I gave up on NFS boot a while ago, due in part to problems with
performance that were hard to track down.  The environment I created
to do completely ramboot boots at scale, allows me to pivot to NFS if
desired (boot time switch).  But I rarely use that.  Pure ramboot has
been a joy to work with as compared to NFS.
Post by Prentice Bisbal
Okay, based on the various responses I've gotten here and on other
This problem only occurs when I'm running our NFSroot based version
of the OS (CentOS 6). When I run the same OS installed on a local
disk, I do not have this problem, using the same exact server(s). 
For testing purposes, I'm using LINPACK, and running the same
executable  with the same HPL.dat file in both instances.
Because I'm testing the same hardware using different OSes, this
(should) eliminate the problem being in the BIOS, and faulty
hardware. This leads me to believe it's most likely a software
configuration issue, like a kernel tuning parameter, or some other
software configuration issue.
These are Supermicro servers, and it seems they do not provide CPU
temps. I do see a chassis temp, but not the temps of the individual
CPUs. While I agree that should be the first thing I look at, it's
not an option for me. Other tools like FLIR and Infrared thermometers
aren't really an option for me, either.
What software configuration, either a kernel a parameter,
configuration of numad or cpuspeed, or some other setting, could
affect this?
Prentice
Post by Prentice Bisbal
Beowulfers,
I have a dozen servers that are all identical hardware: SuperMicro
servers with AMD Opteron 6320 processors. Every since we upgraded to
CentOS 6, the users have been complaining of wildly inconsistent
performance across these 12 nodes. I ran LINPACK on these nodes, and
was able to duplicate the problem, with performance varying from ~14
GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes starts off
fine, and then slowly degrades throughout the LINPACK run. For
example, on a node with this problem, during first LINPACK test, I
can see the performance drop from 115 GFLOPS down to 11.3 GFLOPS.
That constant, downward trend continues throughout the remaining
tests. At the start of subsequent tests, performance will jump up to
about 9-10 GFLOPS, but then drop to 5-6 GLOPS at the end of the test.
Because of the nature of this problem, I suspect this might be a
thermal issue. My guess is that the processor speed is being
throttled to prevent overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded to
CentOS 6. Where I work, we use a read-only NFSroot filesystem for
our cluster nodes, so all nodes are mounting and using the same
exact read-only image of the operating system. This only happens
with these SuperMicro nodes, and only with the CentOS 6 on NFSroot.
RHEL5 on NFSroot worked fine, and when I installed CentOS 6 on a
local disk, the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why
this is only occuring with RHEL 6 w/ NFS root OS?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beo
Christopher Samuel
2017-09-14 00:21:19 UTC
Permalink
Post by Prentice Bisbal
What software configuration, either a kernel a parameter, configuration
of numad or cpuspeed, or some other setting, could affect this?
Hmm, how about diff'ing "sysctl -a" between the systems too?

Does one load new CPU microcode in whereas another doesn't?

Still curious to know if there are any major differences between dmesg
between the boxes.

For monitoring CPU settings I tend to use "cpupower monitor", here's an
example from one of our SandyBridge boxes.

# cpupower monitor
|Nehalem || SandyBridge || Mperf
PKG |CORE|CPU | C3 | C6 | PC3 | PC6 || C7 | PC2 | PC7 || C0 | Cx | Freq
0| 0| 0| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.98| 0.02| 3100
0| 1| 1| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.98| 0.02| 3100
0| 2| 2| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.98| 0.02| 3099
0| 3| 3| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.98| 0.02| 3100
0| 4| 4| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.98| 0.02| 3100
0| 5| 5| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.99| 0.01| 3100
0| 6| 6| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.98| 0.02| 3100
0| 7| 7| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.98| 0.02| 3100
1| 0| 8| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.99| 0.01| 3100
1| 1| 9| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.99| 0.01| 3100
1| 2| 10| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.99| 0.01| 3100
1| 3| 11| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.99| 0.01| 3099
1| 4| 12| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.99| 0.01| 3100
1| 5| 13| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.99| 0.01| 3100
1| 6| 14| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.99| 0.01| 3099
1| 7| 15| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 99.99| 0.01| 3100

...and for a Haswell box:

[***@snowy001 ~]# cpupower monitor
|Nehalem || Mperf
PKG |CORE|CPU | C3 | C6 | PC3 | PC6 || C0 | Cx | Freq
0| 0| 0| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 1| 1| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 2| 2| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 3| 3| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 4| 4| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 5| 5| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 6| 6| 0.00| 0.00| 0.00| 0.00|| 99.95| 0.05| 2503
0| 7| 7| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 8| 8| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 9| 9| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 10| 10| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 11| 11| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 12| 12| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 13| 13| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 14| 14| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
0| 15| 15| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
1| 0| 16| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
1| 1| 17| 0.00| 0.00| 0.00| 0.00|| 99.58| 0.42| 2503
1| 2| 18| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
1| 3| 19| 0.00| 0.00| 0.00| 0.00|| 99.58| 0.42| 2503
1| 4| 20| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
1| 5| 21| 0.00| 0.00| 0.00| 0.00|| 99.57| 0.43| 2503
1| 6| 22| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
1| 7| 23| 0.00| 0.00| 0.00| 0.00|| 99.57| 0.43| 2503
1| 8| 24| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
1| 9| 25| 0.00| 0.00| 0.00| 0.00|| 99.58| 0.42| 2503
1| 10| 26| 0.00| 0.00| 0.00| 0.00|| 99.95| 0.05| 2503
1| 11| 27| 0.00| 0.00| 0.00| 0.00|| 99.58| 0.42| 2503
1| 12| 28| 0.00| 0.00| 0.00| 0.00|| 99.95| 0.05| 2503
1| 13| 29| 0.00| 0.00| 0.00| 0.00|| 99.57| 0.43| 2503
1| 14| 30| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503
1| 15| 31| 0.00| 0.00| 0.00| 0.00|| 99.58| 0.42| 2503


cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beow
Prentice Bisbal
2017-09-14 18:45:21 UTC
Permalink
Beowulfers,


I'm happy to announce that I finally found the cause this problem:
numad. On these particular systems, numad was having a catastrophic
effect on the performance. As the jobs ran GFLOPS would steadily
decrease in a monotonic fashion, watching the output of turbostat and
'cpupower monitor' I could see more and more cores becoming idle as the
job ran. As soon as I turned off numad and restarted my LINPACK jobs,
the performance went back up,  and now it stayed there for the duration
of the job.

To make sure I wasn't completely crazy for having numad enabled on these
systems, I did a google search and came across the paper below, which
indicates that in some cases having numad is helpful, and in other
cases, it isn't:

http://iopscience.iop.org/article/10.1088/1742-6596/664/9/092010/pdf

To verify this fix, I ran LINPACK again across all the nodes in this
cluster (well, all the nodes that weren't running user jobs at the
time), in addition to the Supermicro nodes. I found that on the
non-Supermicro nodes, which are Proliant servers with different Opteron
processors, turning numad off actually decreased performance by about 5% .

Have any of you had similar problems with numad? Do you leave it on or
off on your cluster nodes? Feedback is greatly appreciated. I did a
google search of 'Linux numad HPC performance' (or something like that),
and the link above was I could find on this topic.

For now, I think I'm going to leave numad enabled on the non-Supermicro
nodes until I can do more research/testing.

Prentice
Post by Prentice Bisbal
Okay, based on the various responses I've gotten here and on other
This problem only occurs when I'm running our NFSroot based version of
the OS (CentOS 6). When I run the same OS installed on a local disk, I
do not have this problem, using the same exact server(s).  For testing
purposes, I'm using LINPACK, and running the same executable  with the
same HPL.dat file in both instances.
Because I'm testing the same hardware using different OSes, this
(should) eliminate the problem being in the BIOS, and faulty hardware.
This leads me to believe it's most likely a software configuration
issue, like a kernel tuning parameter, or some other software
configuration issue.
These are Supermicro servers, and it seems they do not provide CPU
temps. I do see a chassis temp, but not the temps of the individual
CPUs. While I agree that should be the first thing I look at, it's not
an option for me. Other tools like FLIR and Infrared thermometers
aren't really an option for me, either.
What software configuration, either a kernel a parameter,
configuration of numad or cpuspeed, or some other setting, could
affect this?
Prentice
Post by Prentice Bisbal
Beowulfers,
I have a dozen servers that are all identical hardware: SuperMicro
servers with AMD Opteron 6320 processors. Every since we upgraded to
CentOS 6, the users have been complaining of wildly inconsistent
performance across these 12 nodes. I ran LINPACK on these nodes, and
was able to duplicate the problem, with performance varying from ~14
GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes starts off fine,
and then slowly degrades throughout the LINPACK run. For example, on
a node with this problem, during first LINPACK test, I can see the
performance drop from 115 GFLOPS down to 11.3 GFLOPS. That constant,
downward trend continues throughout the remaining tests. At the start
of subsequent tests, performance will jump up to about 9-10 GFLOPS,
but then drop to 5-6 GLOPS at the end of the test.
Because of the nature of this problem, I suspect this might be a
thermal issue. My guess is that the processor speed is being
throttled to prevent overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded to
CentOS 6. Where I work, we use a read-only NFSroot filesystem for our
cluster nodes, so all nodes are mounting and using the same exact
read-only image of the operating system. This only happens with these
SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on
NFSroot worked fine, and when I installed CentOS 6 on a local disk,
the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why
this is only occuring with RHEL 6 w/ NFS root OS?
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mail
Christopher Samuel
2017-09-18 01:09:28 UTC
Permalink
I'm happy to announce that I finally found the cause this problem: numad.
Very interesting, it sounds like it was migrating processes onto a
single core over time! Anything diagnostic in its log?
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mai
Håkon Bugge
2017-09-18 09:20:57 UTC
Permalink
Post by Christopher Samuel
I'm happy to announce that I finally found the cause this problem: numad.
Very interesting, it sounds like it was migrating processes onto a
single core over time! Anything diagnostic in its log?
Any idea how this correlates with NFSroot vs. local disk?


Thxs, Håkon
Post by Christopher Samuel
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://w
Prentice Bisbal
2018-02-19 15:48:06 UTC
Permalink
Finally catching up months and months of beowulf e-mails.
Post by HÃ¥kon Bugge
Post by Christopher Samuel
I'm happy to announce that I finally found the cause this problem: numad.
Very interesting, it sounds like it was migrating processes onto a
single core over time! Anything diagnostic in its log?
Any idea how this correlates with NFSroot vs. local disk?
Yes. The local disk wasn't configured exactly like the NFSroot. The
NFSroot image had numad enabled, and the local disk install did not.

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman
Prentice Bisbal
2018-02-19 15:51:58 UTC
Permalink
I know this is an old topic. I'm catching up on months' worth of mailing
list mail right now.
Post by Christopher Samuel
I'm happy to announce that I finally found the cause this problem: numad.
Very interesting, it sounds like it was migrating processes onto a
single core over time! Anything diagnostic in its log?
That's exactly what it was doing. No, I did not see any diagnostics in
the log files, but in some of the documentation I read on numad at the
time, it stated that numad is not good to have enabled for large
multi-core  jobs that use a lot of memory, like DB servers and HPC jobs.

--
Prentice

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/l
Bogdan Costescu
2017-09-14 05:59:15 UTC
Permalink
Post by Prentice Bisbal
I have a dozen servers that are all identical hardware: SuperMicro servers
with AMD Opteron 6320 processors. Every since we upgraded to CentOS 6, the
users have been complaining of wildly inconsistent performance across these
12 nodes. I ran LINPACK on these nodes, and was able to duplicate the
problem, with performance varying from ~14 GFLOPS to 64 GFLOPS.
Are all these applications using MPI? And do you have /tmp also as
part of the NFS root? If so, try moving /tmp to a local filesystem or
direct the MPI lib to use a local directory instead (f.e. by setting
TMPDIR environment variable on all nodes).

Cheers,
Bogdan
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf
Loading...