Discussion:
[Beowulf] Hyperthreading and 'OS jitter'
John Hearns via Beowulf
2017-07-22 10:11:11 UTC
Permalink
Several times in the past I have jokingly asked if there shoudl eb another
lower powered CPU core ina system to run OS tasks (hello Intel - are you
listening?)
Also int he past there was advice to get best possible throughpur on AMD
Bulldozer CPUs to run only on every second core (as they share FPUs).
When I managed a large NUMA system we used cpusets, and the OS ran in a
smal l'boot cpuset' which was physically near the OS disks and IO cards.

I had a thought about hyperthreading though. A few months ago we did a
quick study with Blener rendering, and got 30% more througput with HT
switched on. Also someone who I am workign with now would liek to assess
the effect on their codes of HT on/HT off.
I kow that HT has nromally not had any advantages with HPC type codes - as
the core should be 100% flat out.

I am thinking though - what woud be the effect of enabling HT, and usign a
cgroup to constrain user codes to run on all the odd-numbered CPU cores,
with the OS tasks on the even numbered ones?
I would hope this would be at least performance neutral? Your thoughts
please! Also thoughts on candidate benchmark programs to test this idea.


John Hearns........
....... John Hearns
Scott Atchley
2017-07-22 11:13:26 UTC
Permalink
I would imagine the answer is "It depends". If the application uses the
per-CPU caches effectively, then performance may drop when HT shares the
cache between the two processes.

We are looking at reserving a couple of cores per node on Summit to run
system daemons if the use requests. If the user can effectively use the
GPUs, the CPUs should be idle much of the time anyway. We will see.

I like you idea of a low power core to run OS tasks.

On Sat, Jul 22, 2017 at 6:11 AM, John Hearns via Beowulf <
Post by John Hearns via Beowulf
Several times in the past I have jokingly asked if there shoudl eb another
lower powered CPU core ina system to run OS tasks (hello Intel - are you
listening?)
Also int he past there was advice to get best possible throughpur on AMD
Bulldozer CPUs to run only on every second core (as they share FPUs).
When I managed a large NUMA system we used cpusets, and the OS ran in a
smal l'boot cpuset' which was physically near the OS disks and IO cards.
I had a thought about hyperthreading though. A few months ago we did a
quick study with Blener rendering, and got 30% more througput with HT
switched on. Also someone who I am workign with now would liek to assess
the effect on their codes of HT on/HT off.
I kow that HT has nromally not had any advantages with HPC type codes - as
the core should be 100% flat out.
I am thinking though - what woud be the effect of enabling HT, and usign a
cgroup to constrain user codes to run on all the odd-numbered CPU cores,
with the OS tasks on the even numbered ones?
I would hope this would be at least performance neutral? Your thoughts
please! Also thoughts on candidate benchmark programs to test this idea.
John Hearns........
....... John Hearns
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Evan Burness
2017-07-25 14:31:32 UTC
Permalink
If I recall correctly, IBM did just what you're describing with the
BlueGene CPUs. I believe those were 18-core parts, with 2 of the cores
being reserved to run the OS and as a buffer against jitter. That left a
nice, neat power-of-2 amount of cores for compute tasks.

Re: having a specialized, low-power core, this is clearly something that's
already been successful in the mobile device space. The big.LITTLE
<https://en.wikipedia.org/wiki/ARM_big.LITTLE> ARM architecture is designed
for this kind of thing and has been quite successful. Certainly, now that
Intel and AMD are really designing modular SoC-like products, it wouldn't
be terribly difficult to bake in a couple of low power x86 cores (e.g. Atom
or Xeon-D + larger Skylake die in Intel's case; Jaguar + Zen in AMD's
case). I'm not an expert in fab economics, but I don't believe it would not
significantly add to production costs.

A similar approach to IBM's (with BlueGene) is what the major public Cloud
providers often do these days. AWS' standard approach is to buy CPUs with
1-2 more cores pr socket than they actually intend to expose to users, and
to use those extra cores for managing the hypervisor layer. As an example,
the CPUs in the C4.8xlarge instances are, in reality, custom 10-core Xeon
(Haswell) parts. Yet, AWS only exposes 8 of the cores per socket to the end
user in order to ensure consistent performance and reduce the chance of a
compute intensive workload from interfering with AWS' management of the
physical node via the hypervisor. Microsoft Azure and Google Compute
Platform often (but not always) do the same thing, so it's something of a
"best practice" among the Cloud providers these days. Anecdotally, I can
report that in our (Cycle Computing's) work with customers doing HPC and
"Big Compute" on public Clouds that performance consistency has improved a
lot over time and we've had the Cloud folks tell us that reserving a few
cores/node was a helpful step in that process.

Hope this helps!


Best,

Evan
Post by Scott Atchley
I would imagine the answer is "It depends". If the application uses the
per-CPU caches effectively, then performance may drop when HT shares the
cache between the two processes.
We are looking at reserving a couple of cores per node on Summit to run
system daemons if the use requests. If the user can effectively use the
GPUs, the CPUs should be idle much of the time anyway. We will see.
I like you idea of a low power core to run OS tasks.
On Sat, Jul 22, 2017 at 6:11 AM, John Hearns via Beowulf <
Post by John Hearns via Beowulf
Several times in the past I have jokingly asked if there shoudl eb
another lower powered CPU core ina system to run OS tasks (hello Intel -
are you listening?)
Also int he past there was advice to get best possible throughpur on AMD
Bulldozer CPUs to run only on every second core (as they share FPUs).
When I managed a large NUMA system we used cpusets, and the OS ran in a
smal l'boot cpuset' which was physically near the OS disks and IO cards.
I had a thought about hyperthreading though. A few months ago we did a
quick study with Blener rendering, and got 30% more througput with HT
switched on. Also someone who I am workign with now would liek to assess
the effect on their codes of HT on/HT off.
I kow that HT has nromally not had any advantages with HPC type codes -
as the core should be 100% flat out.
I am thinking though - what woud be the effect of enabling HT, and usign
a cgroup to constrain user codes to run on all the odd-numbered CPU cores,
with the OS tasks on the even numbered ones?
I would hope this would be at least performance neutral? Your thoughts
please! Also thoughts on candidate benchmark programs to test this idea.
John Hearns........
....... John Hearns
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Evan Burness
Director, HPC Solutions
Cycle Computing
***@cyclecomputing.com
(919) 724-9338
Nathan Moore
2017-07-25 22:20:52 UTC
Permalink
​​
Post by Evan Burness
Re: having a specialized, low-power core, this is clearly something that's
already been successful in the mobile device space. The big.LITTLE
<https://en.wikipedia.org/wiki/ARM_big.LITTLE> ARM architecture is
designed for this kind of thing and has been quite successful. Certainly,
now that Intel and AMD are really designing modular SoC-like products, it
wouldn't be terribly difficult to bake in a couple of low power x86 cores
(e.g. Atom or Xeon-D + larger Skylake die in Intel's case; Jaguar + Zen in
AMD's case). I'm not an expert in fab economics, but I don't believe it
would not significantly add to production costs.
T​he​
​ "textbook" answer​ to integrated circuit manufacturing is that there need
be no dependence of device cost on number of gates/device complexity.
Fundamentally, you're just printing/etching a slightly more complicated
mask on a circuit board. The number of gates and the probability of
defects are probably proportional - didn't AMD sell 6 and 3 core processors
for a while? I always assumed those were 4 or 8 core procs that had
critical defects in one of the cores. Sorry, no first-hand knowledge
though.

Jim Lux probably knows the real answer.


​Nathan​
Christopher Samuel
2017-08-02 03:32:23 UTC
Permalink
Post by Evan Burness
If I recall correctly, IBM did just what you're describing with the
BlueGene CPUs. I believe those were 18-core parts, with 2 of the cores
being reserved to run the OS and as a buffer against jitter. That left a
nice, neat power-of-2 amount of cores for compute tasks.
Close, but the 18 cores were for yield, with 1 core of running the
Compute Node Kernel (CNK) and 16 cores for the task that the CNK would
launch. The 18th was inaccessible.

But yes, I think SGI (RIP) pioneered this on Intel with their Altix
systems and was the reason they wrote the original cpuset code in the
Linux kernel so they could constrain a set of cores for the boot
services and the rest were there to run jobs on.

All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf
Evan Burness
2017-08-02 03:37:17 UTC
Permalink
Thanks for the history lessons, Chris! Very interesting indeed.

Would be interesting to take it a step further and measure what the impacts
(good, bad, or otherwise) of picking a specific core on a given CPU uArch
layout for the OS.


Cheers,

Evan
Post by Christopher Samuel
Post by Evan Burness
If I recall correctly, IBM did just what you're describing with the
BlueGene CPUs. I believe those were 18-core parts, with 2 of the cores
being reserved to run the OS and as a buffer against jitter. That left a
nice, neat power-of-2 amount of cores for compute tasks.
Close, but the 18 cores were for yield, with 1 core of running the
Compute Node Kernel (CNK) and 16 cores for the task that the CNK would
launch. The 18th was inaccessible.
But yes, I think SGI (RIP) pioneered this on Intel with their Altix
systems and was the reason they wrote the original cpuset code in the
Linux kernel so they could constrain a set of cores for the boot
services and the rest were there to run jobs on.
All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Evan Burness
Director, HPC Solutions
Cycle Computing
***@cyclecomputing.com
(919) 724-9338
Christopher Samuel
2017-08-02 05:36:49 UTC
Permalink
Post by Evan Burness
Thanks for the history lessons, Chris! Very interesting indeed.
My pleasure, to add to the history here's a paper from the APAC'05
conference 12 years ago that details how the then APAC (now NCI) set up
their SGI Altix cluster, including a discussion on cpusets.

http://www.kev.pulo.com.au/publications/apac05/apac05-apacnf-altix.pdf

Also includes an interesting section on dealing with SGI's proprietary
MPI stack and the problems it caused them.
Post by Evan Burness
Would be interesting to take it a step further and measure what the
impacts (good, bad, or otherwise) of picking a specific core on a given
CPU uArch layout for the OS.
Sadly I was hoping that document would give some indication of the
benefits of reducing jitter via cpusets, but it does not.

I'd be very interested to hear what people have found there - I do know
that Slurm allows you to reserve cores to generic resources like GPUs so
that an administrator can enforce that only certain cores can access
that resource (say the cores closest to a GPU).

https://slurm.schedmd.com/gres.html

It also supports "core specialisation" which is nebulously explained as:

https://slurm.schedmd.com/core_spec.html

# Core specialization is a feature designed to isolate system overhead
# (system interrupts, etc.) to designated cores on a compute node. This
# can reduce applications interrupts ranks to improve completion time.
# The job will be charged for all allocated cores, but will not be able
# to directly use the specialized cores.

Usefully there is a PDF from the 2014 Slurm User Group which goes into
more details about it, and includes references to work done by Cray and
others into the issues about jitter and benefits from reducing it.

https://slurm.schedmd.com/SUG14/process_isolation.pdf

From that description it appears to only put the Slurm daemons for jobs
into the group, but of course there would be nothing to stop you having
a start up script that moved any other existing processes onto that core
first via their own cgroup.

Shame that Bull's test was too small to show any benefit!

All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.o
Loading...