[Beowulf] Large Dell, odd IO delays

Discussion:

David Mathog

2018-02-14 22:26:57 UTC

Dell PowerEdge T630, PERC H730P, single 11Tb RAID5 array. Xeon CPU
E5-2650 cpus with 40 total threads. 512Gb RAM. Centos 6.9. Kernel
2.6.32-696.20.1.el6.x86_64. (This machine is basically a small beowulf
in a box.)

Sometimes for no reason that I can discern an IO operation on this
machine will stall. Things that should take seconds will run for
minutes, or at least until I get tired of waiting and kill them. Here
is today's example:

gunzip -c largeFile.gz > largeFile

producing a 24 Gb file. One job running "nice" on 40 threads (which is
all of them) for a few hours, using only 30Gb of RAM. If no other CPU
intensive jobs start "top" shows it at 4700-3800. That job is slowly
reading largeFile sequentially.

About two hours after largeFile was created this was run:

wc -l largeFile

and it just sat there for 10 minutes. top showed 100% CPU for the "wc"
process. There was nothing else using a significant amount of CPU time,
just the one big job and "wc". Killed the wc process and instead did:

dd if=largeFile bs=8192 | wc -l

and it completed in about 20 seconds. After that

wc -l largeFile

also completed, and in only 6.5s.

As far as I can tell largeFile should have been in cache the whole time.
Nothing big enough to force it out ran between when it was created and
when the wc started. "iostat 1" shows negligible disk activity, just
the occasional reads and writes from the long running job, which works
by sucking in a chunk of the file, calculating for a while, then
emitting a chunk of results to an output file (which is only 320Mb).
Using "dd" somehow kicked the system out of this state, forcing
largeFile back into cache if it wasn't already there.

There are no warnings or errors in dmesg or /var/log/messages.

Checked the console yesterday and there are no error messages on the
console display.

Smartctl status from the disks (SAS) last time it was checked were:
trombone Mon Feb 12 10:20:22 PST 2018
SMART status: P P P P
Defect list: 0 1 0 2
Non-medium errors: 1 7 22 3
Corrected write: 6 1 1 0
Corrected read: 0 0 0 0
Uncorrected write: 0 0 0 0
Uncorrected read: 0 0 0 0
Age: 16630 16630 16630 16630

and those values are unchanged after this event. (Another PowerEdge T630
with SAS disks also has the occasional non-medium error and corrected
write.)

A script which dumps pretty much all of the information available from
the RAID using "megacli" is run periodically. The only difference
between a run after the "dd" and one weeks ago are the time stamps, disk
temperatures and battery charge levels (by a few percent).

We have three systems that are fairly similar to this one, but only this
one has this odd behavior. These IO stalls have been seen on it before.
There was a similar issue a couple of days ago, so the system was
rebooted then. Apparently that made no difference.

Examined every value in /var/proc/vm and this an another system differ
in only the max_map_count value. The problem system has 262144 and the
other has 65530. Doesn't seem likely to be the issue.

Checked the hugepage settings and found a difference there. The two
systems that don't do this have
/sys/kernel/mm/redhat_transparent_hugepage/defrag

always madvise [never]

whereas the system with the issue has:

[always] madvise never

I did not see any other jobs using up CPU time when this was going on,
but perhaps the defrag processes sometimes run in a mode where they
don't rise much in "top" yet bogs down the IO. In any case, set the
problem system to match the other two.

Does this sound like a reasonable cause for the slowdown, or might there
be something else going on? (And if so, what?)

Thanks,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.b

Christopher Samuel

2018-02-14 23:00:28 UTC

Permalink

Post by David Mathog
Sometimes for no reason that I can discern an IO operation on this
machine will stall. Things that should take seconds will run for
minutes, or at least until I get tired of waiting and kill them.
gunzip -c largeFile.gz > largeFile

Does "perf top -p ${PID}" show anything useful about where the
processes is spending its time?

Good luck!
Chris
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowu

Kilian Cavalotti

2018-02-14 23:44:01 UTC

Permalink

Checked the hugepage settings and found a difference there. The two systems
that don't do this have /sys/kernel/mm/redhat_transparent_hugepage/defrag
always madvise [never]
[always] madvise never

THP defragmentation is definitely something that has bitten us in the
past, when under memory pressure, and we now default to [madvise]
pretty much everywhere (we're too timid to disable it entirely).

A good way to see if that's really the issue is to "echo never >
/sys/kernel/mm/redhat_transparent_hugepage/defrag" while the problem
is happening, while simultaneously monitoring the processes with htop,
for instance.
It's usually pretty instant: if the issue is really with THP defrag,
then CPU usage for your stalling process should drop pretty much
immediately and things go back to normal.

Cheers,
--
Kilian
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.

John Hearns via Beowulf

2018-02-15 07:04:07 UTC

Permalink

Hmmm... I will also chip in with my favourite tip
Look at the sysctl for min_free_kbytes It is often set very low.
Increase this substantially. It will do no harm to your system (unless you
set it ti an absurd value!)

You should be looking at the vm dirty ratios etc. also

On 15 February 2018 at 00:44, Kilian Cavalotti <

Post by David Mathog

Post by David Mathog
Checked the hugepage settings and found a difference there. The two

systems

Post by David Mathog
that don't do this have /sys/kernel/mm/redhat_

transparent_hugepage/defrag

Post by David Mathog
always madvise [never]
[always] madvise never

THP defragmentation is definitely something that has bitten us in the
past, when under memory pressure, and we now default to [madvise]
pretty much everywhere (we're too timid to disable it entirely).
A good way to see if that's really the issue is to "echo never >
/sys/kernel/mm/redhat_transparent_hugepage/defrag" while the problem
is happening, while simultaneously monitoring the processes with htop,
for instance.
It's usually pretty instant: if the issue is really with THP defrag,
then CPU usage for your stalling process should drop pretty much
immediately and things go back to normal.
Cheers,
--
Kilian
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Gus Correa

2018-02-15 16:29:12 UTC

Permalink

Hmmm... I will also chip in with my favourite tip
Look at the sysctl for min_free_kbytes It is often set very low.
Increase this substantially. It will do no harm to your system (unless
you set it ti an absurd value!)
You should be looking at the vm dirty ratios etc. also

+1
vm.dirty_background_bytes
vm.dirty_bytes
(or the corresponding _ratios)
vm.min_free_kbytes
Defaults are low.
Increasing them improved a lot our compute nodes IO.
https://www.kernel.org/doc/Documentation/sysctl/vm.txt
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-tunables

On 15 February 2018 at 00:44, Kilian Cavalotti

Checked the hugepage settings and found a difference there. The two systems
that don't do this have /sys/kernel/mm/redhat_transparent_hugepage/defrag
always madvise [never]
[always] madvise never

THP defragmentation is definitely something that has bitten us in the
past, when under memory pressure, and we now default to [madvise]
pretty much everywhere (we're too timid to disable it entirely).
A good way to see if that's really the issue is to "echo never >
/sys/kernel/mm/redhat_transparent_hugepage/defrag" while the problem
is happening, while simultaneously monitoring the processes with htop,
for instance.
It's usually pretty instant: if the issue is really with THP defrag,
then CPU usage for your stalling process should drop pretty much
immediately and things go back to normal.
Cheers,
--
Kilian
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) vi

Michael Di Domenico

2018-02-15 12:51:35 UTC

Permalink

On Wed, Feb 14, 2018 at 6:44 PM, Kilian Cavalotti

Post by Kilian Cavalotti

Checked the hugepage settings and found a difference there. The two systems
that don't do this have /sys/kernel/mm/redhat_transparent_hugepage/defrag
always madvise [never]
[always] madvise never

i will second this stance as well. i've seen huge issues with disk
performance when hugepage was enabled. i disable it on all the
machines we have now.

the way i found it was when doing large IO with hugepages enabled, the
khugepage (sp?) process shoots right to the top of a top display. and
the performance you describe was the same.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/list