Discussion:
[Beowulf] [upgrade strategy] Intel CPU design bug & security flaw - kernel fix imposes performance penalty
Remy Dernat
2018-01-03 12:56:50 UTC
Permalink
Hi,
I renamed that thread because IMHO there is a another issue related to that threat.
Should we upgrade our system and lost a significant amount of XFlops... ?
What should be consider :   - the risk  - your user population (size / type / average "knowledge" of hacking techs...)  - the isolation level from the outside (internet)

So here is me question : if this is not confidential, what will you do ?
I would not patch our little local cluster, contrary to all of our other servers.
Indeed, there is another "little" risk. If our strategy is to always upgrade/patch, in this particular case you can loose many users that will complain about perfs...
So another question : what is your global strategy about upgrades on your clusters ? Do you upgrade it as often as you can ? One upgrade every X months (due to the downtime issue) ... ?

Thanks,
Best regardsRémy.

-------- Message d'origine --------De : John Hearns via Beowulf <***@beowulf.org> Date : 03/01/2018 09:48 (GMT+01:00) À : Beowulf Mailing List <***@beowulf.org> Objet : Re: [Beowulf] Intel CPU design bug & security flaw - kernel fix imposes performance penalty
Thanks Chris.  In the past there have been Intel CPU 'bugs' trumpeted, but generally these are fixed with a microcode update. This looks different, as it is a fundamental part of the chips architecture.However the Register article says: "It allows normal user programs – to discern to some extent the layout or contents of protected kernel memory areas"
I guess the phrase "to some extent" is the vital one here. Are there any security exploits which use this information? I guess it is inevitable that one will be engineered now that this is known about. The question I am really asking is should we worry about this for real world systems. And I guess tha answer is that if the kernel developers are worried enough then yes we should be too. Comments please.
There appears to be no microcode fix possible and the kernel fix will
incur a significant performance penalty, people are talking about in the
range of 5%-30% depending on the generation of the CPU. :-(
The performance hit (at least for the current patches) is related to

system calls, which HPC programs using networking gear like OmniPath

or Infiniband don't do much of.



-- greg
Jörg Saßmannshausen
2018-01-04 23:48:20 UTC
Permalink
Dear all,

that was the question I was pondering about all day today and I tried to read
and digest any information I could get.

In the end, I contacted my friend at CERT and proposed the following:
- upgrade the heanode/login node (name it how you like) as that one is exposed
to the outside world via ssh
- do not upgrade the compute nodes for now until we got more information about
the impact of the patch(es).

It would not be the first time a patch is opening up another can of worms. What
I am hoping for is finding a middle way between security and performance. IF
the patch(es) are save to apply, I still can roll them out to the compute
nodes without loosing too much uptime. IF there is a problem regarding
performance it only affects the headnode which I can ignore on that cluster.

As always, your mileage will vary, specially as different clusters have
different purposes.

What I would like to know is: how about compensation? For me that is the same
as the VW scandal last year. We, the users, have been deceived. Specially if
the 30% performance loss which have been mooted are not special corner cases
but are seen often in HPC. Some of the chemistry code I am supporting relies
on disc I/O, others on InfiniBand and again other is running entirely in
memory.

These are my 2 cents. If somebody has a better idea, please let me know.

All the best from a rainy and windy London

Jörg
Post by Remy Dernat
Hi,
I renamed that thread because IMHO there is a another issue related to that
threat. Should we upgrade our system and lost a significant amount of
XFlops... ? What should be consider : - the risk - your user population
(size / type / average "knowledge" of hacking techs...) - the isolation
level from the outside (internet)
So here is me question : if this is not confidential, what will you do ?
I would not patch our little local cluster, contrary to all of our other
servers. Indeed, there is another "little" risk. If our strategy is to
always upgrade/patch, in this particular case you can loose many users that
will complain about perfs... So another question : what is your global
strategy about upgrades on your clusters ? Do you upgrade it as often as
you can ? One upgrade every X months (due to the downtime issue) ... ?
Thanks,
Best regardsRémy.
-------- Message d'origine --------De : John Hearns via Beowulf
bug & security flaw - kernel fix imposes performance penalty Thanks Chris.
In the past there have been Intel CPU 'bugs' trumpeted, but generally these
are fixed with a microcode update. This looks different, as it is a
fundamental part of the chips architecture.However the Register article
says: "It allows normal user programs – to discern to some extent the
layout or contents of protected kernel memory areas" I guess the phrase "to
some extent" is the vital one here. Are there any security exploits which
use this information? I guess it is inevitable that one will be engineered
now that this is known about. The question I am really asking is should we
worry about this for real world systems. And I guess tha answer is that if
the kernel developers are worried enough then yes we should be too.
Comments please.
There appears to be no microcode fix possible and the kernel fix will
incur a significant performance penalty, people are talking about in the
range of 5%-30% depending on the generation of the CPU. :-(
The performance hit (at least for the current patches) is related to
system calls, which HPC programs using networking gear like OmniPath
or Infiniband don't do much of.
-- greg
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.or
John Hearns via Beowulf
2018-01-05 13:40:38 UTC
Permalink
This seems very relevant

https://security.googleblog.com/2018/01/more-details-about-mitigations-for-cpu_4.html?m=1
Post by Jörg Saßmannshausen
Dear all,
that was the question I was pondering about all day today and I tried to read
and digest any information I could get.
- upgrade the heanode/login node (name it how you like) as that one is exposed
to the outside world via ssh
- do not upgrade the compute nodes for now until we got more information about
the impact of the patch(es).
It would not be the first time a patch is opening up another can of worms. What
I am hoping for is finding a middle way between security and performance. IF
the patch(es) are save to apply, I still can roll them out to the compute
nodes without loosing too much uptime. IF there is a problem regarding
performance it only affects the headnode which I can ignore on that cluster.
As always, your mileage will vary, specially as different clusters have
different purposes.
What I would like to know is: how about compensation? For me that is the same
as the VW scandal last year. We, the users, have been deceived. Specially if
the 30% performance loss which have been mooted are not special corner cases
but are seen often in HPC. Some of the chemistry code I am supporting relies
on disc I/O, others on InfiniBand and again other is running entirely in
memory.
These are my 2 cents. If somebody has a better idea, please let me know.
All the best from a rainy and windy London
Jörg
Post by Remy Dernat
Hi,
I renamed that thread because IMHO there is a another issue related to
that
Post by Remy Dernat
threat. Should we upgrade our system and lost a significant amount of
XFlops... ? What should be consider : - the risk - your user
population
Post by Remy Dernat
(size / type / average "knowledge" of hacking techs...) - the isolation
level from the outside (internet)
So here is me question : if this is not confidential, what will you do ?
I would not patch our little local cluster, contrary to all of our other
servers. Indeed, there is another "little" risk. If our strategy is to
always upgrade/patch, in this particular case you can loose many users
that
Post by Remy Dernat
will complain about perfs... So another question : what is your global
strategy about upgrades on your clusters ? Do you upgrade it as often as
you can ? One upgrade every X months (due to the downtime issue) ... ?
Thanks,
Best regardsRémy.
-------- Message d'origine --------De : John Hearns via Beowulf
design
Post by Remy Dernat
bug & security flaw - kernel fix imposes performance penalty Thanks
Chris.
Post by Remy Dernat
In the past there have been Intel CPU 'bugs' trumpeted, but generally
these
Post by Remy Dernat
are fixed with a microcode update. This looks different, as it is a
fundamental part of the chips architecture.However the Register article
says: "It allows normal user programs – to discern to some extent the
layout or contents of protected kernel memory areas" I guess the phrase
"to
Post by Remy Dernat
some extent" is the vital one here. Are there any security exploits which
use this information? I guess it is inevitable that one will be
engineered
Post by Remy Dernat
now that this is known about. The question I am really asking is should
we
Post by Remy Dernat
worry about this for real world systems. And I guess tha answer is that
if
Post by Remy Dernat
the kernel developers are worried enough then yes we should be too.
Comments please.
There appears to be no microcode fix possible and the kernel fix will
incur a significant performance penalty, people are talking about in
the
Post by Remy Dernat
range of 5%-30% depending on the generation of the CPU. :-(
The performance hit (at least for the current patches) is related to
system calls, which HPC programs using networking gear like OmniPath
or Infiniband don't do much of.
-- greg
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Christopher Samuel
2018-01-05 16:27:33 UTC
Permalink
Post by Jörg Saßmannshausen
What I would like to know is: how about compensation? For me that is
the same as the VW scandal last year. We, the users, have been
deceived.
I think you would be hard pressed to prove that, especially as it seems
that pretty much every mainstream CPU is affected (Intel, AMD, ARM, Power).
Post by Jörg Saßmannshausen
Specially if the 30% performance loss which have been mooted are not
special corner cases but are seen often in HPC. Some of the chemistry
code I am supporting relies on disc I/O, others on InfiniBand and
again other is running entirely in memory.
For RDMA based networks like IB I would suspect that the impact will be
far less as the system calls to set things up will be impacted but that
after that it should be less of an issue (as the whole idea of RDMA was
to get the kernel out of the way as much as possible).

But of course we need real benchmarks to gauge that impact.

Separating out the impact of various updates will also be important,
I've heard that the SLES upgrade to their microcode package includes
disabling branch prediction on AMD k17 family CPUs for instance.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beo
Jonathan Aquilina
2018-01-05 16:46:27 UTC
Permalink
Chris on a number of articles I read they are saying AMD's are not
affected by this.
Post by Christopher Samuel
Post by Jörg Saßmannshausen
What I would like to know is: how about compensation? For me that is
the same as the VW scandal last year. We, the users, have been
deceived.
I think you would be hard pressed to prove that, especially as it seems
that pretty much every mainstream CPU is affected (Intel, AMD, ARM, Power).
Post by Jörg Saßmannshausen
Specially if the 30% performance loss which have been mooted are not
special corner cases but are seen often in HPC. Some of the chemistry
code I am supporting relies on disc I/O, others on InfiniBand and
again other is running entirely in memory.
For RDMA based networks like IB I would suspect that the impact will be
far less as the system calls to set things up will be impacted but that
after that it should be less of an issue (as the whole idea of RDMA was
to get the kernel out of the way as much as possible).
But of course we need real benchmarks to gauge that impact.
Separating out the impact of various updates will also be important,
I've heard that the SLES upgrade to their microcode package includes
disabling branch prediction on AMD k17 family CPUs for instance.
All the best,
Chris
Christopher Samuel
2018-01-05 19:14:14 UTC
Permalink
Post by Jonathan Aquilina
Chris on a number of articles I read they are saying AMD's are not
affected by this.
That's only 1 of the 3 attacks to my understanding. The Spectre paper says:

# Hardware. We have empirically verified the vulnerability of several
# Intel processors to Spectre attacks, including Ivy Bridge, Haswell
# and Skylake based processors. We have also verified the
# attack’s applicability to AMD Ryzen CPUs. Finally, we have
# also successfully mounted Spectre attacks on several Samsung and
# Qualcomm processors (which use an ARM architecture) found in popular
# mobile phones.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) vi
Prentice Bisbal
2018-01-05 22:32:19 UTC
Permalink
Post by Christopher Samuel
Post by Jonathan Aquilina
Chris on a number of articles I read they are saying AMD's are not
affected by this.
# Hardware. We have empirically verified the vulnerability of several
# Intel processors to Spectre attacks,  including Ivy Bridge, Haswell
# and Skylake based processors.    We  have  also  verified  the
# attack’s  applicability to  AMD  Ryzen  CPUs.   Finally,  we have
# also  successfully mounted Spectre attacks on several Samsung and
# Qualcomm processors (which use an ARM architecture) found in popular
# mobile phones.
According to several articles I read today:

Meltdown (1 exploit)is Intel-specific
Spectre  (2 different exploits) affects just about every processor on
the planet.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe
Gerald Henriksen
2018-01-06 00:35:45 UTC
Permalink
Post by Prentice Bisbal
Meltdown (1 exploit)is Intel-specific
Spectre  (2 different exploits) affects just about every processor on
the planet.
This is correct, and the other key difference is that so far there is
only a solution to Meltdown.

The below linked page has this to say about Spectre:

"As it is not easy to fix, it will haunt us for quite some time"

Fedora has linked to this page for more info on the two:

https://spectreattack.com/

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/
Prentice Bisbal
2018-01-08 15:19:22 UTC
Permalink
Post by Gerald Henriksen
Post by Prentice Bisbal
Meltdown (1 exploit)is Intel-specific
Spectre  (2 different exploits) affects just about every processor on
the planet.
This is correct, and the other key difference is that so far there is
only a solution to Meltdown.
"As it is not easy to fix, it will haunt us for quite some time"
https://spectreattack.com/
I haven't checked out that link yet, but from what I read, spectre is
also much harder to exploit, too, which mitigates the risk of spectre to
some extent.

Prentice
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beo
Xingqiu Yuan
2018-01-08 15:31:18 UTC
Permalink
Intel recently claimed that their bug fix gives little affect on the chip's
performances last week.
Post by Gerald Henriksen
Post by Prentice Bisbal
Meltdown (1 exploit)is Intel-specific
Spectre (2 different exploits) affects just about every processor on
the planet.
This is correct, and the other key difference is that so far there is
only a solution to Meltdown.
"As it is not easy to fix, it will haunt us for quite some time"
https://spectreattack.com/
I haven't checked out that link yet, but from what I read, spectre is also
much harder to exploit, too, which mitigates the risk of spectre to some
extent.
Prentice
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Prentice Bisbal
2018-01-08 19:01:31 UTC
Permalink
Any statement from Intel, or any other chip manufacturer regarding the
performance impact of these fixes should be viewed with a healthy dose
of skepticism.  These companies have an interest in downplaying the
impact of these exploits/fixes to protect their stock prices.

I would believe the performance of 3rd-party companies who have measured
the performance, like RHEL who support multiple processor vendors (AMD,
Intel, etc.), and have less interest in a specific processor selling
better or worse. I would believe these numbers from RHEL more than I
would  those from Intel:

https://access.redhat.com/articles/3307751
Post by Xingqiu Yuan
Intel recently claimed that their bug fix gives little affect on the
chip's performances last week.
Meltdown (1 exploit)is Intel-specific
Spectre  (2 different exploits) affects just about every
processor on
the planet.
This is correct, and the other key difference is that so far there is
only a solution to Meltdown.
"As it is not easy to fix, it will haunt us for quite some time"
https://spectreattack.com/
I haven't checked out that link yet, but from what I read, spectre
is also much harder to exploit, too, which mitigates the risk of
spectre to some extent.
Prentice
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
Gerald Henriksen
2018-01-06 01:00:19 UTC
Permalink
Post by Prentice Bisbal
Meltdown (1 exploit)is Intel-specific
Spectre  (2 different exploits) affects just about every processor on
the planet.
For anyone interested this is AMD's response:

https://www.amd.com/en/corporate/speculative-execution


The Google Project Zero number/titles being used correspond to:

1 & 2 - Spectre

3 - Meltdown

Also an explanation in less technical terms from Red Hat:

https://www.redhat.com/en/blog/what-are-meltdown-and-spectre-here%E2%80%99s-what-you-need-know?sc_cid=7016000000127NJAAY
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beo
Christopher Samuel
2018-01-06 01:26:10 UTC
Permalink
Post by Gerald Henriksen
https://www.amd.com/en/corporate/speculative-execution
Cool, so variant 1 is likely the one that SuSE has firmware for to
disable branch prediction on Epyc.

cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
John Hearns via Beowulf
2018-01-06 11:05:35 UTC
Permalink
Disabling branch prediction - that in itself will have an effect on
performance.

One thing I read about the hardware is that the table which holds the
branch predictions is shared between processes running on the same CPU core.
That is part of the attack process - the malicious process has knowledge of
what the 'sharing' process will branch to.

I float the following idea - perhaps this reinforces good practice for
running HPC codes. Meaning cpusets and process pinning,
which we already do for reasons of performance and for better resource
allocation.
I expose my ignorance here, and wonder if we will see more containerised
workloads, which are strictly contained within their own memory space.
I then answer myself by saying I am talking nonsense, because the kernel
routines need to be run somewhere and this exploit is all about being able
to probe
areas of memory which you should not be able to do by speculatively running
some instructions and capturing what effect they have.
And ""their own memory space" is within virtual memory.
Post by Christopher Samuel
Post by Gerald Henriksen
https://www.amd.com/en/corporate/speculative-execution
Cool, so variant 1 is likely the one that SuSE has firmware for to
disable branch prediction on Epyc.
cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
John Hearns via Beowulf
2018-01-07 03:23:20 UTC
Permalink
I guess we have all seen this: https://access.redhat.com/articles/3307751

If not, 'HPC Workloads' (*) such as HPL are 2-5% affected.
However as someone who recently installed a lot of NVMe drives for a fast
filesystem, the 8-19% performance hit on random IO to NVMe drives is not
pleasing.


(*) Quotes are deliberate. We all know that the best benchmarks are your
own applications.
Post by John Hearns via Beowulf
Disabling branch prediction - that in itself will have an effect on
performance.
One thing I read about the hardware is that the table which holds the
branch predictions is shared between processes running on the same CPU core.
That is part of the attack process - the malicious process has knowledge
of what the 'sharing' process will branch to.
I float the following idea - perhaps this reinforces good practice for
running HPC codes. Meaning cpusets and process pinning,
which we already do for reasons of performance and for better resource
allocation.
I expose my ignorance here, and wonder if we will see more containerised
workloads, which are strictly contained within their own memory space.
I then answer myself by saying I am talking nonsense, because the kernel
routines need to be run somewhere and this exploit is all about being able
to probe
areas of memory which you should not be able to do by speculatively
running some instructions and capturing what effect they have.
And ""their own memory space" is within virtual memory.
Post by Christopher Samuel
Post by Gerald Henriksen
https://www.amd.com/en/corporate/speculative-execution
Cool, so variant 1 is likely the one that SuSE has firmware for to
disable branch prediction on Epyc.
cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Tim Cutts
2018-01-07 20:17:41 UTC
Permalink
It seems fairly clear to me that any processor which performs speculative execution will be vulnerable to timing attacks of this nature.

I was pointed to a very much simplified but very clear explanation of this in a blog post by Eben Upton (of Raspberry Pi fame):

https://www.raspberrypi.org/blog/why-raspberry-pi-isnt-vulnerable-to-spectre-or-meltdown/

Well worth reading, and relatively accessible for people who aren’t super-nerds, so a good one to send to friends and family who are interested.

Tim
Post by Jonathan Aquilina
Chris on a number of articles I read they are saying AMD's are not
affected by this.
That's only 1 of the 3 attacks to my understanding. The Spectre paper says:

# Hardware. We have empirically verified the vulnerability of several
# Intel processors to Spectre attacks, including Ivy Bridge, Haswell
# and Skylake based processors. We have also verified the
# attack’s applicability to AMD Ryzen CPUs. Finally, we have
# also successfully mounted Spectre attacks on several Samsung and
# Qualcomm processors (which use an ARM architecture) found in popular
# mobile phones.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Jonathan Engwall
2018-01-05 16:54:34 UTC
Permalink
Trying to fathom all of this I stumbled on:
https://github.com/fanout/pollymer
Pollymer's existence seems to be only to wander the internet looking for a
header to copy.

On Jan 5, 2018 8:27 AM, "Christopher Samuel" <***@csamuel.org> wrote:

On 05/01/18 10:48, Jörg Saßmannshausen wrote:

What I would like to know is: how about compensation? For me that is
Post by Jörg Saßmannshausen
the same as the VW scandal last year. We, the users, have been
deceived.
I think you would be hard pressed to prove that, especially as it seems
that pretty much every mainstream CPU is affected (Intel, AMD, ARM, Power).


Specially if the 30% performance loss which have been mooted are not
Post by Jörg Saßmannshausen
special corner cases but are seen often in HPC. Some of the chemistry
code I am supporting relies on disc I/O, others on InfiniBand and
again other is running entirely in memory.
For RDMA based networks like IB I would suspect that the impact will be
far less as the system calls to set things up will be impacted but that
after that it should be less of an issue (as the whole idea of RDMA was
to get the kernel out of the way as much as possible).

But of course we need real benchmarks to gauge that impact.

Separating out the impact of various updates will also be important,
I've heard that the SLES upgrade to their microcode package includes
disabling branch prediction on AMD k17 family CPUs for instance.


All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Jörg Saßmannshausen
2018-01-07 12:22:40 UTC
Permalink
Dear Chris,

the first court cases against Intel have been filed:

http://www.spiegel.de/netzwelt/web/spectre-meltdown-erste-us-verbraucher-verklagen-intel-wegen-chip-schwachstelle-a-1186595.html

http://docs.dpaq.de/13109-show_temp.pl-27.pdf

http://docs.dpaq.de/13111-show_temp.pl-28.pdf

http://docs.dpaq.de/13110-07316352607.pdf

So, lets hope others are joining in here to get the ball rolling.

Don't get me wrong here, this is nothing against Intel per se. However, and
here I am talking wearing my HPC hat, a performance decrease of up to 30% is
simply not tolerable for me. I am working hard to squeeze the last performance
out of the CPU and using highly optimised libraries and then the hardware has
a flaw which makes all of that useless. I am somewhat surprised that this has
not discovered earlier (both bugs I mean).

I am sure it will be interesting to see how it will be patched and what the
performance penalty will be here.

All the best

Jörg
Post by Christopher Samuel
Post by Jörg Saßmannshausen
What I would like to know is: how about compensation? For me that is
the same as the VW scandal last year. We, the users, have been
deceived.
I think you would be hard pressed to prove that, especially as it seems
that pretty much every mainstream CPU is affected (Intel, AMD, ARM, Power).
Post by Jörg Saßmannshausen
Specially if the 30% performance loss which have been mooted are not
special corner cases but are seen often in HPC. Some of the chemistry
code I am supporting relies on disc I/O, others on InfiniBand and
again other is running entirely in memory.
For RDMA based networks like IB I would suspect that the impact will be
far less as the system calls to set things up will be impacted but that
after that it should be less of an issue (as the whole idea of RDMA was
to get the kernel out of the way as much as possible).
But of course we need real benchmarks to gauge that impact.
Separating out the impact of various updates will also be important,
I've heard that the SLES upgrade to their microcode package includes
disabling branch prediction on AMD k17 family CPUs for instance.
All the best,
Chris
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.b
Christopher Samuel
2018-01-07 21:29:39 UTC
Permalink
These would have to be Meldown related then, given that Spectre is so
widely applicable.

Greg K-H has a useful post up about the state of play with the various
Linux kernel patches for mainline and stable kernels here:

http://kroah.com/log/blog/2018/01/06/meltdown-status/

He also mentioned about the Meltdown patches for ARM64:

# Right now the ARM64 set of patches for the Meltdown issue are not
# merged into Linus’s tree. They are staged and ready to be merged into
# 4.16-rc1 once 4.15 is released in a few weeks. Because these patches
# are not in a released kernel from Linus yet, I can not backport them
# into the stable kernel releases (hey, we have rules for a reason...)
#
# Due to them not being in a released kernel, if you rely on ARM64 for
# your systems (i.e. Android), I point you at the Android Common Kernel
# tree All of the ARM64 fixes have been merged into the 3.18, 4.4, and
# 4.9 branches as of this point in time.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Richard Walsh
2018-01-07 22:18:51 UTC
Permalink
All,

Mmm ... maybe I am missing something, but for an HPC cluster-specific solution ... how about skipping the fixes, and simply requiring all compute node jobs to run in exclusive mode and then zero-ing out user memory between jobs ... ??

Individual job performance would be preserved, while there would be “only” a system throughput performance degradation. Is it clear that this is measurably worse ... ??

Richard Walsh
Thrashing River Computing

Sent from my iPhone
Post by Christopher Samuel
These would have to be Meldown related then, given that Spectre is so
widely applicable.
Greg K-H has a useful post up about the state of play with the various
http://kroah.com/log/blog/2018/01/06/meltdown-status/
# Right now the ARM64 set of patches for the Meltdown issue are not
# merged into Linus’s tree. They are staged and ready to be merged into
# 4.16-rc1 once 4.15 is released in a few weeks. Because these patches
# are not in a released kernel from Linus yet, I can not backport them
# into the stable kernel releases (hey, we have rules for a reason...)
#
# Due to them not being in a released kernel, if you rely on ARM64 for
# your systems (i.e. Android), I point you at the Android Common Kernel
# tree All of the ARM64 fixes have been merged into the 3.18, 4.4, and
# 4.9 branches as of this point in time.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowul
Christopher Samuel
2018-01-07 22:24:12 UTC
Permalink
Post by Richard Walsh
Mmm ... maybe I am missing something, but for an HPC cluster-specific
solution ... how about skipping the fixes, and simply requiring all
compute node jobs to run in exclusive mode and then zero-ing out user
memory between jobs ... ??
If you are running other daemons with important content (say the munge
service that Slurm uses for authentication) then you risk the user being
able to steal the secret key from the daemon.

But it all depends on your risk analysis of course.

All the best!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
Jörg Saßmannshausen
2018-01-07 22:44:33 UTC
Permalink
Dear all,

Chris is right here. It depends on what is running on your HPC cluster. You
might not see a performance degrade at all, or you might see one of 30% (just
to stick with that number).
Also, if you got a cluster which is solely used by one research group the
chances they are hacking each other are slim I would argue. That leave still
the argument about a compromised user account.
If you are running a large, multi user, multi institutional cluster you might
want to put security over performance. This might be especially true if you
are using confidential data like patient data.
So, you will need to set up your own risk matrix and hope you made the right
decision.
For me: we have decided to upgrade the headnode but for now leave the compute
nodes untouched. We then can decide at a later state whether or not we want to
upgrade the compute nodes, maybe after we done some testing of typical
programs. It is not an ideal scenario but we are living in a real and not
ideal world I guess.

All the best from London

Jörg
Post by Christopher Samuel
Post by Richard Walsh
Mmm ... maybe I am missing something, but for an HPC cluster-specific
solution ... how about skipping the fixes, and simply requiring all
compute node jobs to run in exclusive mode and then zero-ing out user
memory between jobs ... ??
If you are running other daemons with important content (say the munge
service that Slurm uses for authentication) then you risk the user being
able to steal the secret key from the daemon.
But it all depends on your risk analysis of course.
All the best!
Chris
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit ht
Christopher Samuel
2018-01-05 16:19:51 UTC
Permalink
Post by Remy Dernat
So here is me question : if this is not confidential, what will you do ?
Any system where you do not have 100% trust in your users, their
passwords and the devices they use will (IMHO) need to be patched.

But as ever this will need to be a site-specific risk assessment.

For sites running Slurm with Munge they might want to consider what
the impact of a user being able to read the munge secret key out of
memory and potentially reusing it, for instance.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mail
Remy Dernat
2018-01-05 18:18:08 UTC
Permalink
Meltdown - AMD not affected - specific to Intel products
Spectre - all cpus
https://blogs.manageengine.com/desktop-mobile/2018/01/05/meltdown-and-spectre-battling-the-bugs-in-intel-amd-and-arm-processors.html
-------- Message d'origine --------De : Jonathan Aquilina <***@eagleeyet.net> Date : 05/01/2018 17:47 (GMT+01:00) À : ***@beowulf.org Objet : Re: [Beowulf] [upgrade strategy] Intel CPU design bug & security flaw - kernel fix imposes performance penalty
Chris on a number of articles I read they are saying AMD's are not
affected by this.
Post by Christopher Samuel
Post by Jörg Saßmannshausen
What I would like to know is: how about compensation? For me that is
the same as the VW scandal last year. We, the users, have been
deceived.
I think you would be hard pressed to prove that, especially as it seems
that pretty much every mainstream CPU is affected (Intel, AMD, ARM, Power).
Post by Jörg Saßmannshausen
Specially if the 30% performance loss which have been mooted are not
special corner cases but are seen often in HPC. Some of the chemistry
code I am supporting relies on disc I/O, others on InfiniBand and
again other is running entirely in memory.
For RDMA based networks like IB I would suspect that the impact will be
far less as the system calls to set things up will be impacted but that
after that it should be less of an issue (as the whole idea of RDMA was
to get the kernel out of the way as much as possible).
But of course we need real benchmarks to gauge that impact.
Separating out the impact of various updates will also be important,
I've heard that the SLES upgrade to their microcode package includes
disabling branch prediction on AMD k17 family CPUs for instance.
All the best,
Chris
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Loading...