Discussion:
[Beowulf] Intel CPU design bug & security flaw - kernel fix imposes performance penalty
Christopher Samuel
2018-01-03 03:46:07 UTC
Permalink
Hi all,

Just a quick break from my holiday in Philadelphia (swapped forecast 40C
on Saturday in Melbourne for -10C forecast here) to let folks know about
what looks like a longstanding Intel CPU design flaw that has security
implications.

There appears to be no microcode fix possible and the kernel fix will
incur a significant performance penalty, people are talking about in the
range of 5%-30% depending on the generation of the CPU. :-(

https://www.theregister.co.uk/2018/01/02/intel_cpu_design_flaw/

There's a post on the PostgreSQL site that measures the impact, El Reg
summarises the impact as:

https://twitter.com/TheRegister/status/948342806367518720?ref_src=twsrc%5Etfw

Best case: 17% slowdown
Worst case: 23%

Here's the post about the measured impact:

https://www.postgresql.org/message-id/***@alap3.anarazel.de

This is going to be interesting I think...

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org
Christopher Samuel
2018-01-03 03:51:05 UTC
Permalink
Post by Christopher Samuel
This is going to be interesting I think...
Also looks like ARM64 may have a similar issue, a subscriber only
article on LWN points to this patch set being worked on to address the
problem there:

https://lwn.net/Articles/740393/

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.or
Greg Lindahl
2018-01-03 05:56:33 UTC
Permalink
Post by Christopher Samuel
There appears to be no microcode fix possible and the kernel fix will
incur a significant performance penalty, people are talking about in the
range of 5%-30% depending on the generation of the CPU. :-(
The performance hit (at least for the current patches) is related to
system calls, which HPC programs using networking gear like OmniPath
or Infiniband don't do much of.

-- greg


_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/
John Hearns via Beowulf
2018-01-03 08:46:52 UTC
Permalink
Thanks Chris. In the past there have been Intel CPU 'bugs' trumpeted, but
generally these are fixed with a microcode update.
This looks different, as it is a fundamental part of the chips architecture.
However the Register article says: "It allows normal user programs – to
discern to some extent the layout or contents of protected kernel memory
areas"

I guess the phrase "to some extent" is the vital one here. Are there any
security exploits which use this information? I guess it is inevitable that
one will be engineered now that this is known about. The question I am
really asking is should we worry about this for real world systems. And I
guess tha answer is that if the kernel developers are worried enough then
yes we should be too. Comments please.
Post by Greg Lindahl
Post by Christopher Samuel
There appears to be no microcode fix possible and the kernel fix will
incur a significant performance penalty, people are talking about in the
range of 5%-30% depending on the generation of the CPU. :-(
The performance hit (at least for the current patches) is related to
system calls, which HPC programs using networking gear like OmniPath
or Infiniband don't do much of.
-- greg
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Lachlan Musicman
2018-01-03 08:59:39 UTC
Permalink
The origin of the story is from here

http://pythonsweetness.tumblr.com/post/169166980422/the-mysterious-case-of-the-linux-page-table

L.

------
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared consciousness that our institutions have failed
and our ecosystem is collapsing, yet we are still here — and we are
creative agents who can shape our destinies. Apocalyptic civics is the
conviction that the only way out is through, and the only way through is
together. "

*Greg Bloom* @greggish
https://twitter.com/greggish/status/873177525903609857
Post by John Hearns via Beowulf
Thanks Chris. In the past there have been Intel CPU 'bugs' trumpeted, but
generally these are fixed with a microcode update.
This looks different, as it is a fundamental part of the chips
architecture.
However the Register article says: "It allows normal user programs – to
discern to some extent the layout or contents of protected kernel memory
areas"
I guess the phrase "to some extent" is the vital one here. Are there any
security exploits which use this information? I guess it is inevitable that
one will be engineered now that this is known about. The question I am
really asking is should we worry about this for real world systems. And I
guess tha answer is that if the kernel developers are worried enough then
yes we should be too. Comments please.
Post by Greg Lindahl
Post by Christopher Samuel
There appears to be no microcode fix possible and the kernel fix will
incur a significant performance penalty, people are talking about in the
range of 5%-30% depending on the generation of the CPU. :-(
The performance hit (at least for the current patches) is related to
system calls, which HPC programs using networking gear like OmniPath
or Infiniband don't do much of.
-- greg
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Christopher Samuel
2018-01-03 14:59:49 UTC
Permalink
Post by John Hearns via Beowulf
I guess the phrase "to some extent" is the vital one here. Are there
any security exploits which use this information?
It's more the fact that it reduces/negates the protection that existing
kernel address space randomisation gives you, the idea of that being to
make it harder for a wide range of exploits, known and unknown. More
info here:

https://lwn.net/Articles/738975/

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/list
Lux, Jim (337K)
2018-01-03 17:47:42 UTC
Permalink
I should think that in a "dedicated cluster" application, these sorts of security problems are less of an issue - whether a process can figure out what memory space other processes are in is more of an issue for machines "open to the world with heterogeneous applications" (i.e. 99.9% of the machines out there).
The scenario from the article:
"Imagine a piece of JavaScript running in a browser, or malicious software running on a shared public cloud server, able to sniff sensitive kernel-protected data."

I'll bet there's not a whole lot of HPC code written in Javascript running in a browser..
(not that someone hasn't done it, as a stunt.. Is there a MPI library binding for Javascript?)

And, if you're running HPC "in the cloud" on VMs, this is an issue.

I suppose the down side is that if they do kernel mods to fix this for the 99.9%, it adversely affects the performance for the 0.1% (that is, us).

Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)


-----Original Message-----
From: Beowulf [mailto:beowulf-***@beowulf.org] On Behalf Of Christopher Samuel
Sent: Tuesday, January 02, 2018 7:46 PM
To: ***@beowulf.org
Subject: [Beowulf] Intel CPU design bug & security flaw - kernel fix imposes performance penalty

Hi all,

Just a quick break from my holiday in Philadelphia (swapped forecast 40C on Saturday in Melbourne for -10C forecast here) to let folks know about what looks like a longstanding Intel CPU design flaw that has security implications.

There appears to be no microcode fix possible and the kernel fix will incur a significant performance penalty, people are talking about in the range of 5%-30% depending on the generation of the CPU. :-(

https://www.theregister.co.uk/2018/01/02/intel_cpu_design_flaw/

There's a post on the PostgreSQL site that measures the impact, El Reg summarises the impact as:

https://twitter.com/TheRegister/status/948342806367518720?ref_src=twsrc%5Etfw

Best case: 17% slowdown
Worst case: 23%

Here's the post about the measured impact:

https://www.postgresql.org/message-id/***@alap3.anarazel.de

This is going to be interesting I think...

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC _______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/be
Ellis H. Wilson III
2018-01-03 17:57:48 UTC
Permalink
Post by Lux, Jim (337K)
I suppose the down side is that if they do kernel mods to fix this
for the 99.9%, it adversely affects the performance for the 0.1%
(that is, us).
We've been discussing this extensively at my workplace, and the
overwhelming expectation is that at least in Linux the fix should be
configurable such that those operating in non-multitenant systems (such
as scale-out storage appliances) can disable it.

If this ends up not being the case, I would expect it in the short-term
to lock us out of upgrading to newer kernels where the fix and resultant
overheads come into play until we're on newer CPUs where the
architecture deficiency is resolved. This latter part (the expectation
of Intel fixing it in their newer HW) is all the more reason I'm
inclined to believe the fix will be delivered as a tunable.

Best,

ellis
--
Ellis H. Wilson III, Ph.D.
www.ellisv3.com
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.or
Joe Landman
2018-01-03 17:57:48 UTC
Permalink
Looks like it will respond to a 'nopti' boot option (at least the
patches I've seen from 4-Dec)
Post by Ellis H. Wilson III
Post by Lux, Jim (337K)
I suppose the down side is that if they do kernel mods to fix this
for the 99.9%, it adversely affects the performance for the 0.1%
(that is, us).
We've been discussing this extensively at my workplace, and the
overwhelming expectation is that at least in Linux the fix should be
configurable such that those operating in non-multitenant systems
(such as scale-out storage appliances) can disable it.
If this ends up not being the case, I would expect it in the
short-term to lock us out of upgrading to newer kernels where the fix
and resultant overheads come into play until we're on newer CPUs where
the architecture deficiency is resolved.  This latter part (the
expectation of Intel fixing it in their newer HW) is all the more
reason I'm inclined to believe the fix will be delivered as a tunable.
Best,
ellis
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowul
Jones de Andrade
2018-01-03 22:32:47 UTC
Permalink
Meaning:

AMD would also be on the same hook;

We, "non-average computer users", are still [verb of your choice here].
Intel's response: https://www.streetinsider.com/
Corporate+News/Intel+%28INTC%29+Responds+to+Security+
Research+Findings/13648696.html
Looks like it will respond to a 'nopti' boot option (at least the patches
I've seen from 4-Dec)
Post by Ellis H. Wilson III
Post by Lux, Jim (337K)
I suppose the down side is that if they do kernel mods to fix this
for the 99.9%, it adversely affects the performance for the 0.1%
(that is, us).
We've been discussing this extensively at my workplace, and the
overwhelming expectation is that at least in Linux the fix should be
configurable such that those operating in non-multitenant systems (such as
scale-out storage appliances) can disable it.
If this ends up not being the case, I would expect it in the short-term
to lock us out of upgrading to newer kernels where the fix and resultant
overheads come into play until we're on newer CPUs where the architecture
deficiency is resolved. This latter part (the expectation of Intel fixing
it in their newer HW) is all the more reason I'm inclined to believe the
fix will be delivered as a tunable.
Best,
ellis
--
Joe Landman
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Kevin Van Workum, PhD
Sabalcore Computing Inc.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Peter Kjellström
2018-01-04 01:01:09 UTC
Permalink
On Wed, 3 Jan 2018 17:47:42 +0000
Post by Lux, Jim (337K)
I should think that in a "dedicated cluster" application, these sorts
of security problems are less of an issue
Well I sure don't like the idea of random_flow_app.x reading our
slurm/munge secrets for latter convenient usage of resources, or root
equiv access to remote filesystem resources after reading out the
relevant bits, or the consequences of anyone being able to get at the
ssh host keys, etc. etc.

It sure sounds as patch and suffer the slow down for most forms of HPC
center I interact with.

As for the being able to turn (the meltdown) patch off I think there's
even a runtime sysctl option (at least in the redhat patch). And this
one isn't needed at all for AMD.

Seems slow down isn't too bad on codes nearer to HPC interestest such
as speccpu and hpl (from initial sources). I've seen 0-3% mentioned but
that was afaict not for multinode cases, that is, it remains to be seen
how high performace networking is affected (and thereby scaling).

/Peter
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Loading...