Discussion:
[Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
Chris Samuel
2018-08-17 04:47:37 UTC
Permalink
Hi all,

Just a heads up that the 3.10.0-862.11.6.el7.x86_64 kernel from RHEL/CentOS
that was released to address the most recent Intel CPU problem "L1TF" seems to
break RDMA (found by a colleague here at Swinburne). The discovery came
about when testing the new kernel on a system running Lustre.

https://jira.whamcloud.com/browse/LU-11257

Stanford have reported it to Red Hat, but the BZ entry is locked due to its
relationship with L1TF.

https://bugzilla.redhat.com/show_bug.cgi?id=1618452

Hope this helps folks out there..

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowu
Chris Samuel
2018-08-17 05:05:05 UTC
Permalink
Post by Chris Samuel
Just a heads up that the 3.10.0-862.11.6.el7.x86_64 kernel from RHEL/CentOS
that was released to address the most recent Intel CPU problem "L1TF" seems
to break RDMA (found by a colleague here at Swinburne).
There's 6 CVE's addressed in that update from the look of it, so it might not
be the L1TF fix itself that has triggered it.

https://access.redhat.com/errata/RHSA-2018:2384
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listi
Kilian Cavalotti
2018-08-17 14:54:03 UTC
Permalink
Hi Chris,
Post by Chris Samuel
There's 6 CVE's addressed in that update from the look of it, so it might not
be the L1TF fix itself that has triggered it.
https://access.redhat.com/errata/RHSA-2018:2384
That's true: RH mentioned an "embargo'd security fix" but didn't refer
to L1TF explicitly (which I think is not under embargo anymore).

As the reporter of the issue on the Whamcloud JIRA, I also have to
apologize for initially pointing fingers at Lustre, it didn't cross my
mind that this kind of whole RDMA stack breakage would have slipped
past Red Hat's QA.

Cheers,
--
Kilian
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/
Jörg Saßmannshausen
2018-08-17 20:22:08 UTC
Permalink
Hi all,

I came across the 'foreshadow' problem 2 days ago.

This is what I got back from my colleagues:

https://access.redhat.com/security/vulnerabilities/L1TF-perf

This is more a performance investigation though but I thought I might add a
bit more information to the whole problem.

All the best

Jörg
Post by Kilian Cavalotti
Hi Chris,
Post by Chris Samuel
There's 6 CVE's addressed in that update from the look of it, so it might
not be the L1TF fix itself that has triggered it.
https://access.redhat.com/errata/RHSA-2018:2384
That's true: RH mentioned an "embargo'd security fix" but didn't refer
to L1TF explicitly (which I think is not under embargo anymore).
As the reporter of the issue on the Whamcloud JIRA, I also have to
apologize for initially pointing fingers at Lustre, it didn't cross my
mind that this kind of whole RDMA stack breakage would have slipped
past Red Hat's QA.
Cheers,
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf
Chris Samuel
2018-08-18 03:33:55 UTC
Permalink
Post by Kilian Cavalotti
That's true: RH mentioned an "embargo'd security fix" but didn't refer
to L1TF explicitly (which I think is not under embargo anymore).
Agreed, though I'm not sure any of the listed fixes are embargoed now.
Post by Kilian Cavalotti
As the reporter of the issue on the Whamcloud JIRA, I also have to
apologize for initially pointing fingers at Lustre, it didn't cross my
mind that this kind of whole RDMA stack breakage would have slipped
past Red Hat's QA.
Oh I didn't read that as pointing any fingers at Lustre at all, just that the
kernel update broke Lustre for you (and for us!).

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
Jörg Saßmannshausen
2018-08-18 07:22:11 UTC
Permalink
Dear all,

if the problem is RMDA, how about InfiniBand? Will that be broken as well?

All the best

Jörg
Post by Chris Samuel
Post by Kilian Cavalotti
That's true: RH mentioned an "embargo'd security fix" but didn't refer
to L1TF explicitly (which I think is not under embargo anymore).
Agreed, though I'm not sure any of the listed fixes are embargoed now.
Post by Kilian Cavalotti
As the reporter of the issue on the Whamcloud JIRA, I also have to
apologize for initially pointing fingers at Lustre, it didn't cross my
mind that this kind of whole RDMA stack breakage would have slipped
past Red Hat's QA.
Oh I didn't read that as pointing any fingers at Lustre at all, just that
the kernel update broke Lustre for you (and for us!).
All the best,
Chris
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailma
Christopher Samuel
2018-08-18 08:31:35 UTC
Permalink
Post by Jörg Saßmannshausen
if the problem is RMDA, how about InfiniBand? Will that be broken as well?
For RDMA it appears yes, though IPoIB still works for us (though ours is
OPA rather than IB Kilian reported the same).

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
Jörg Saßmannshausen
2018-08-18 10:47:58 UTC
Permalink
Hi Chris,

these are bad news if InfiniBand will be affected here as well as that is what
we need to use for parallel calculations. They make use of RMDA and if that
has a problem..... well, you get the idea I guess.

Has anybody contacted the vendors like Mellanox or Intel regarding this?

All the beset

Jörg
Post by Christopher Samuel
Post by Jörg Saßmannshausen
if the problem is RMDA, how about InfiniBand? Will that be broken as well?
For RDMA it appears yes, though IPoIB still works for us (though ours is
OPA rather than IB Kilian reported the same).
All the best,
Chris
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visi
Chris Samuel
2018-08-18 11:56:52 UTC
Permalink
Post by Kilian Cavalotti
Hi Chris,
Hiya,
Post by Kilian Cavalotti
these are bad news if InfiniBand will be affected here as well as
that is what we need to use for parallel calculations. They make use
of RMDA and if that has a problem..... well, you get the idea I
guess.
Oh yes, this is why I wanted to bring it to everyones attention, this
isn't just about Lustre, it's much more widespread.
Post by Kilian Cavalotti
Has anybody contacted the vendors like Mellanox or Intel regarding this?
As Kilian wrote in the Lustre bug quoting his RHEL bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1618452

— Comment #3 from Don Dutile <***@redhat.com> —
Already reported and being actively fixed.

Cannot make this public, as the patch that caused it was due to
embargo'd
security fix.

This issue has highest priority for resolution.
Revert to 3.10.0-862.11.5.el7 in the mean time.

This bug has been marked as a duplicate of bug 1616346
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/ma
Jörg Saßmannshausen
2018-08-18 13:55:22 UTC
Permalink
Hi Chris,

unless there is something I miss but I read about that in 'Der Spiegel Online'
on Wednesday

http://www.spiegel.de/netzwelt/gadgets/foreshadow-neue-angriffsmethode-trifft-intel-chips-und-cloud-dienste-a-1223289.html

and the link was to this page here:

https://foreshadowattack.eu/

So I don't really understand about "Cannot make this public, as the patch that
caused it was due to embargo'd security fix." issue. The problem is known and I
also noticed that Debian issued some Intel microcode patches which raised my
awareness about a potential problem again.

Sorry, maybe I miss out something here.

All the best

Jörg
Post by Chris Samuel
Post by Kilian Cavalotti
Hi Chris,
Hiya,
Post by Kilian Cavalotti
these are bad news if InfiniBand will be affected here as well as
that is what we need to use for parallel calculations. They make use
of RMDA and if that has a problem..... well, you get the idea I
guess.
Oh yes, this is why I wanted to bring it to everyones attention, this
isn't just about Lustre, it's much more widespread.
Post by Kilian Cavalotti
Has anybody contacted the vendors like Mellanox or Intel regarding this?
https://bugzilla.redhat.com/show_bug.cgi?id=1618452
Already reported and being actively fixed.
Cannot make this public, as the patch that caused it was due to
embargo'd
security fix.
This issue has highest priority for resolution.
Revert to 3.10.0-862.11.5.el7 in the mean time.
This bug has been marked as a duplicate of bug 1616346
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://ww
Chris Samuel
2018-08-19 01:26:25 UTC
Permalink
Post by Jörg Saßmannshausen
So I don't really understand about "Cannot make this public, as the patch
that caused it was due to embargo'd security fix." issue.
I don't think any of us do, unless there's another fix there that is for an
undisclosed CVE (which seems unlikely).
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowu
Jeff Johnson
2018-08-18 19:19:07 UTC
Permalink
With the spate of security flaws over the past year and the impacts their
fixes have on performance and functionality it might be worthwhile to just
run airgapped.
Post by Chris Samuel
Hi all,
Just a heads up that the 3.10.0-862.11.6.el7.x86_64 kernel from RHEL/CentOS
that was released to address the most recent Intel CPU problem "L1TF" seems to
break RDMA (found by a colleague here at Swinburne). The discovery came
about when testing the new kernel on a system running Lustre.
https://jira.whamcloud.com/browse/LU-11257
Stanford have reported it to Red Hat, but the BZ entry is locked due to its
relationship with L1TF.
https://bugzilla.redhat.com/show_bug.cgi?id=1618452
Hope this helps folks out there..
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

***@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
Joe Landman
2018-08-18 19:32:12 UTC
Permalink
FWIW: it looks like this is the CVE that keeps on giving. Yesterday some
of the mitigation hit, and this morning a new rev of kernel with a
single CVE patch came out.   Don't know when it might show up in distro
kernels, but its already in mine.

We are not done with Spectre/Meltdown vulns by any stretch (no insider
info, just a hypothesis).
Post by Jeff Johnson
With the spate of security flaws over the past year and the impacts
their fixes have on performance and functionality it might be
worthwhile to just run airgapped.
Hi all,
Just a heads up that the 3.10.0-862.11.6.el7.x86_64 kernel from RHEL/CentOS
that was released to address the most recent Intel CPU problem "L1TF" seems to
break RDMA (found by a colleague here at Swinburne).   The
discovery came
about when testing the new kernel on a system running Lustre.
https://jira.whamcloud.com/browse/LU-11257
Stanford have reported it to Red Hat, but the BZ entry is locked due to its
relationship with L1TF.
https://bugzilla.redhat.com/show_bug.cgi?id=1618452
Hope this helps folks out there..
All the best,
Chris
--
 Chris Samuel  : http://www.csamuel.org/ :  Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo
Chris Samuel
2018-08-19 01:31:04 UTC
Permalink
Post by Jeff Johnson
With the spate of security flaws over the past year and the impacts their
fixes have on performance and functionality it might be worthwhile to just
run airgapped.
For me none of the HPC systems I've been involved with here in Australia would
have had that option. Virtually all have external users and/or reliance on
external data for some of the work they are used for (and the sysadmins don't
usually have control over the projects & people who get to use them).

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/be
John Hearns via Beowulf
2018-08-19 04:59:52 UTC
Permalink
*To patch, or not to patch, that is the question:* Whether 'tis nobler in
the mind to suffer
The loops and branches of speculative execution,
Or to take arms against a sea of exploits
And by opposing end them. To die—to sleep,
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That HPC is heir to: 'tis a consummation
Devoutly to be wish'd. To die, to sleep
Post by Chris Samuel
Post by Jeff Johnson
With the spate of security flaws over the past year and the impacts their
fixes have on performance and functionality it might be worthwhile to
just
Post by Jeff Johnson
run airgapped.
For me none of the HPC systems I've been involved with here in Australia would
have had that option. Virtually all have external users and/or reliance on
external data for some of the work they are used for (and the sysadmins don't
usually have control over the projects & people who get to use them).
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
John Hearns via Beowulf
2018-08-19 05:11:16 UTC
Permalink
Rather more seriously, this is a topic which is well worth discussing,
What are best practices on patching HPC systems?
Perhaps we need a separate thread here.

I will throw in one thought, which I honestly do not want to see happening.
I recently took a trip to Bletchley Park in the UK. On display there was an
IBM punch card machine and sample punch cards Back in the day one prepared
a 'job deck' which was collected by an operator in a metal hopper then
wheeled off to the mainframe. You did not ever touch the mainframe. So
effectively an air gapped system. A system like that would in these days
kill productivity.
However should there be 'virus checking' of executables before they are
run on compute nodes.
One of the advantages lauded for Linux systems is of course that anti-virus
programs are not needed.

Also I should ask - in the jargon of anti-virus is there a 'signature' for
any of these exploit codes? One would guess that bad actors copy the
example codes already published and use these almost in a cut and paste
fashion. So the signature would be tight loops repeatedly reading or
writing to the same memory locations. Can that be distinguished from
innocent code?
Post by John Hearns via Beowulf
*To patch, or not to patch, that is the question:* Whether 'tis nobler in
the mind to suffer
The loops and branches of speculative execution,
Or to take arms against a sea of exploits
And by opposing end them. To die—to sleep,
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That HPC is heir to: 'tis a consummation
Devoutly to be wish'd. To die, to sleep
Post by Chris Samuel
Post by Jeff Johnson
With the spate of security flaws over the past year and the impacts
their
Post by Jeff Johnson
fixes have on performance and functionality it might be worthwhile to
just
Post by Jeff Johnson
run airgapped.
For me none of the HPC systems I've been involved with here in Australia would
have had that option. Virtually all have external users and/or reliance on
external data for some of the work they are used for (and the sysadmins don't
usually have control over the projects & people who get to use them).
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Jörg Saßmannshausen
2018-08-19 20:59:52 UTC
Permalink
Dear all,

whereas I am accepting that no system is 100% secure ans bug-free, I am
beginning to wonder whether the current problems we are having are actually
design flaws and whether, and that is the more important bit, Intel and other
vendors did know about it. I am thinking of the famous 'diesel-engine' scandal
and, continuing this line of thought, dragging the vendors into the limelight
and get them to pay for this.
I mean, we have to sort out the mess the company was making in the first place,
have to judge whether to apply a patch which might decrease the performance of
our systems (I am doing HPC, hence my InfiniBand question) versus security.
Where will it stop?

Given the current and previous 'bugs' are clearly design flaws IMHO, what are
the chances of a law suite? The any compensation here should go to Open Source
projects, in my opinion, which are making software more secure.

Any comments here?

All the best

Jörg
Post by John Hearns via Beowulf
Rather more seriously, this is a topic which is well worth discussing,
What are best practices on patching HPC systems?
Perhaps we need a separate thread here.
I will throw in one thought, which I honestly do not want to see happening.
I recently took a trip to Bletchley Park in the UK. On display there was an
IBM punch card machine and sample punch cards Back in the day one prepared
a 'job deck' which was collected by an operator in a metal hopper then
wheeled off to the mainframe. You did not ever touch the mainframe. So
effectively an air gapped system. A system like that would in these days
kill productivity.
However should there be 'virus checking' of executables before they are
run on compute nodes.
One of the advantages lauded for Linux systems is of course that anti-virus
programs are not needed.
Also I should ask - in the jargon of anti-virus is there a 'signature' for
any of these exploit codes? One would guess that bad actors copy the
example codes already published and use these almost in a cut and paste
fashion. So the signature would be tight loops repeatedly reading or
writing to the same memory locations. Can that be distinguished from
innocent code?
Post by John Hearns via Beowulf
*To patch, or not to patch, that is the question:* Whether 'tis nobler in
the mind to suffer
The loops and branches of speculative execution,
Or to take arms against a sea of exploits
And by opposing end them. To die—to sleep,
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That HPC is heir to: 'tis a consummation
Devoutly to be wish'd. To die, to sleep
Post by Chris Samuel
Post by Jeff Johnson
With the spate of security flaws over the past year and the impacts
their
Post by Jeff Johnson
fixes have on performance and functionality it might be worthwhile to
just
Post by Jeff Johnson
run airgapped.
For me none of the HPC systems I've been involved with here in Australia would
have had that option. Virtually all have external users and/or reliance on
external data for some of the work they are used for (and the sysadmins don't
usually have control over the projects & people who get to use them).
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://
Lux, Jim (337K)
2018-08-20 17:27:59 UTC
Permalink
All complex systems have flaws. It's more a matter of deciding which flaws are acceptable and which aren't, which is driven by economic factors for the most part - the cost of fixing the flaw (and potentially introducing a new one) vs the cost of damage from the flaw.

I'd find it hard to believe that Intel's CPU designers sat around implementing deliberate flaws ( the Bosch engine controller for VW model).

I'd not find it hard to believe that someone, somewhere raised a speculation about a potential flaw, among many others. That one just didn't happen to get resources applied to it, others did. Picking which ones to attack and spend resources on is a difficult question, and often gets answered based on totally irrelevant factors.

That's not negligence - that's just "it is impossible to discover and fix all possible bugs"

This is not unusual even in MUCH simpler chips-I have some 8 bit wide level shifters (from 2.5 to 3.3V logic) that have an obscure behavior with the rate at which the two power supplies come up that causes them not to pass data (preventing the system in which they are installed from booting). About 1 out of 500 times. The mfr's response is "yeah, we think we can duplicate that, but we've moved on to a newer version of that chip, why don't you replace the chips with the new ones". This isn't an necessarily an issue of the chip not performing to the datasheet specs (essentially, the data sheet is silent on this).

The Errata and Notes lists for complex parts (like CPUs and large FPGAs) runs to hundreds of pages, and continuously grows as people find more odd behaviors.


Therefore - one should assume your system has unknown flaws and design your software and operational procedures accordingly.


James Lux
Project Manager, SunRISE - Sun Radio Interferometer Space Experiment
Task Manager, DARPA High Frequency Research (DHFR) Space Testbed
Jet Propulsion Laboratory (Mail Stop 161-213)
4800 Oak Grove Drive
Pasadena CA 91109
(818)354-2075 (office)
(818)395-2714 (cell)
-----Original Message-----
From: Beowulf [mailto:beowulf-***@beowulf.org] On Behalf Of Jörg Saßmannshausen
Sent: Sunday, August 19, 2018 2:00 PM
To: ***@beowulf.org
Subject: Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA

Dear all,

whereas I am accepting that no system is 100% secure ans bug-free, I am beginning to wonder whether the current problems we are having are actually design flaws and whether, and that is the more important bit, Intel and other vendors did know about it. I am thinking of the famous 'diesel-engine' scandal and, continuing this line of thought, dragging the vendors into the limelight and get them to pay for this.
I mean, we have to sort out the mess the company was making in the first place, have to judge whether to apply a patch which might decrease the performance of our systems (I am doing HPC, hence my InfiniBand question) versus security.
Where will it stop?

Given the current and previous 'bugs' are clearly design flaws IMHO, what are the chances of a law suite? The any compensation here should go to Open Source projects, in my opinion, which are making software more secure.

Any comments here?

All the best

Jörg
Post by John Hearns via Beowulf
Rather more seriously, this is a topic which is well worth discussing,
What are best practices on patching HPC systems?
Perhaps we need a separate thread here.
I will throw in one thought, which I honestly do not want to see happening.
I recently took a trip to Bletchley Park in the UK. On display there
was an IBM punch card machine and sample punch cards Back in the day
one prepared a 'job deck' which was collected by an operator in a
metal hopper then wheeled off to the mainframe. You did not ever touch
the mainframe. So effectively an air gapped system. A system like that
would in these days kill productivity.
However should there be 'virus checking' of executables before they
are run on compute nodes.
One of the advantages lauded for Linux systems is of course that
anti-virus programs are not needed.
Also I should ask - in the jargon of anti-virus is there a 'signature'
for any of these exploit codes? One would guess that bad actors copy
the example codes already published and use these almost in a cut and
paste fashion. So the signature would be tight loops repeatedly
reading or writing to the same memory locations. Can that be
distinguished from innocent code?
Post by John Hearns via Beowulf
*To patch, or not to patch, that is the question:* Whether 'tis
nobler in the mind to suffer The loops and branches of speculative
execution, Or to take arms against a sea of exploits And by opposing
end them. To die—to sleep, No more; and by a sleep to say we end The
heart-ache and the thousand natural shocks That HPC is heir to: 'tis
a consummation Devoutly to be wish'd. To die, to sleep
Post by Chris Samuel
Post by Jeff Johnson
With the spate of security flaws over the past year and the impacts
their
Post by Jeff Johnson
fixes have on performance and functionality it might be
worthwhile to
just
Post by Jeff Johnson
run airgapped.
For me none of the HPC systems I've been involved with here in
Australia would have had that option. Virtually all have external
users and/or reliance on external data for some of the work they
are used for (and the sysadmins don't usually have control over the
projects & people who get to use them).
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit htt
Chris Samuel
2018-08-21 08:35:56 UTC
Permalink
Post by Lux, Jim (337K)
I'd find it hard to believe that Intel's CPU designers sat around
implementing deliberate flaws ( the Bosch engine controller for VW model).
Not to mention that Spectre variants affected AMD, ARM & IBM (at least).

This publicly NSA funded research ("The Intel 80x86 processor architecture:
pitfalls for secure systems") from 1995 has an interesting section:

https://ieeexplore.ieee.org/document/398934/
https://pdfs.semanticscholar.org/2209/42809262c17b6631c0f6536c91aaf7756857.pdf

Section 3.10 - Cache and TLB timing channels

which warns (in generalities) about the use of MSRs and the use of instruction
timing as side channels.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beow
Lux, Jim (337K)
2018-08-21 15:17:58 UTC
Permalink
Post by Lux, Jim (337K)
I'd find it hard to believe that Intel's CPU designers sat around
implementing deliberate flaws ( the Bosch engine controller for VW model).
Not to mention that Spectre variants affected AMD, ARM & IBM (at least).

This publicly NSA funded research ("The Intel 80x86 processor architecture:
pitfalls for secure systems") from 1995 has an interesting section:

https://ieeexplore.ieee.org/document/398934/
https://pdfs.semanticscholar.org/2209/42809262c17b6631c0f6536c91aaf7756857.pdf

Section 3.10 - Cache and TLB timing channels

which warns (in generalities) about the use of MSRs and the use of instruction
timing as side channels.



Such vulnerabilities have existed since the early days of computers. As processors and use cases have gotten more complex they're harder to find.

This is why back in "orange book" days there's the whole "system high" mode of operation - basically "air gap, you, or things you trust, are the only one on the machine"


_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/lis
John Hearns via Beowulf
2018-08-23 19:11:03 UTC
Permalink
https://www.theregister.co.uk/2018/08/21/intel_cpu_patch_licence/

https://perens.com/2018/08/22/new-intel-microcode-license-restriction-is-not-acceptable/
On 8/21/18, 1:37 AM, "Beowulf on behalf of Chris Samuel" <
Post by Lux, Jim (337K)
I'd find it hard to believe that Intel's CPU designers sat around
implementing deliberate flaws ( the Bosch engine controller for VW
model).
Not to mention that Spectre variants affected AMD, ARM & IBM (at least).
https://ieeexplore.ieee.org/document/398934/
https://pdfs.semanticscholar.org/2209/42809262c17b6631c0f6536c91aaf7756857.pdf
Section 3.10 - Cache and TLB timing channels
which warns (in generalities) about the use of MSRs and the use of instruction
timing as side channels.
Such vulnerabilities have existed since the early days of computers. As
processors and use cases have gotten more complex they're harder to find.
This is why back in "orange book" days there's the whole "system high"
mode of operation - basically "air gap, you, or things you trust, are the
only one on the machine"
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
John Hearns via Beowulf
2018-08-23 19:13:54 UTC
Permalink
My bad. The license has been updated now
https://www.theregister.co.uk/2018/08/23/intel_microcode_license/
Post by John Hearns via Beowulf
https://www.theregister.co.uk/2018/08/21/intel_cpu_patch_licence/
https://perens.com/2018/08/22/new-intel-microcode-license-restriction-is-not-acceptable/
On 8/21/18, 1:37 AM, "Beowulf on behalf of Chris Samuel" <
Post by Lux, Jim (337K)
I'd find it hard to believe that Intel's CPU designers sat around
implementing deliberate flaws ( the Bosch engine controller for VW
model).
Not to mention that Spectre variants affected AMD, ARM & IBM (at least).
https://ieeexplore.ieee.org/document/398934/
https://pdfs.semanticscholar.org/2209/42809262c17b6631c0f6536c91aaf7756857.pdf
Section 3.10 - Cache and TLB timing channels
which warns (in generalities) about the use of MSRs and the use of instruction
timing as side channels.
Such vulnerabilities have existed since the early days of computers. As
processors and use cases have gotten more complex they're harder to find.
This is why back in "orange book" days there's the whole "system high"
mode of operation - basically "air gap, you, or things you trust, are the
only one on the machine"
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Jörg Saßmannshausen
2018-08-21 10:08:21 UTC
Permalink
Dear all,
Post by Lux, Jim (337K)
All complex systems have flaws. It's more a matter of deciding which flaws
are acceptable and which aren't, which is driven by economic factors for
the most part - the cost of fixing the flaw (and potentially introducing a
new one) vs the cost of damage from the flaw.
I agree with this.
Post by Lux, Jim (337K)
I'd find it hard to believe that Intel's CPU designers sat around
implementing deliberate flaws ( the Bosch engine controller for VW model).
There is the famous example where NIST deliberately weakened some encryption
standards which was published not so long ago.
Post by Lux, Jim (337K)
I'd not find it hard to believe that someone, somewhere raised a speculation
about a potential flaw, among many others. That one just didn't happen to
get resources applied to it, others did. Picking which ones to attack and
spend resources on is a difficult question, and often gets answered based
on totally irrelevant factors.
That's not negligence - that's just "it is impossible to discover and fix
all possible bugs"
My understanding about the recent CPU 'problems' is that researchers were
looking into that some time ago (I believe they were from the TU Graz in
Austria but I might be wrong here). My hunch here is there is some 'common
wisdom' how to design a CPU and maybe that sometimes does not get questioned
enough and in the detail we need it. As a scientist, a friend of mine once
told me: never jeopardized your result by running a second experiment. I
totally disagree with this but these days it seems to be common practice, not
only in IT.
Post by Lux, Jim (337K)
This is not unusual even in MUCH simpler chips-I have some 8 bit wide level
shifters (from 2.5 to 3.3V logic) that have an obscure behavior with the
rate at which the two power supplies come up that causes them not to pass
data (preventing the system in which they are installed from booting).
About 1 out of 500 times. The mfr's response is "yeah, we think we can
duplicate that, but we've moved on to a newer version of that chip, why
don't you replace the chips with the new ones". This isn't an necessarily
an issue of the chip not performing to the datasheet specs (essentially,
the data sheet is silent on this).
And that is exactly the problem: instead of understanding why it is behaving
like this, there is a patch and we move on. Why bother? It only costs money.
Less profit for the company. Shareholders like to see high profits. And so on.
So we never understand what is causing this in the first place, we don't have
in-depth knowledge, but we somehow fixed it. Lesson learned? None.
Again, wearing my gentleman scientist hat: if we understand this problem, we
might not need to patch it but we can learn from it and *fix* it properly.
Hell, we might even improve our design! Oh, hang on, that would require
putting resources towards it. Sorry folks! :-)
Post by Lux, Jim (337K)
The Errata and Notes lists for complex parts (like CPUs and large FPGAs)
runs to hundreds of pages, and continuously grows as people find more odd
behaviors.
No doubt about that, the same is true in my subject: chemistry.
Post by Lux, Jim (337K)
Therefore - one should assume your system has unknown flaws and design your
software and operational procedures accordingly.
So in a nutshell: we simply have to accept that bridges might collapse so we
issue everybody a security cable when they want to cross the bridge. Can this
be the solution?

Don't get me wrong. I am deliberately playing devil's advocate here with the
aim to illustrate the underlying problem.

Added: see also Chris' email which arrived whilst composing this one here.

All the best from a sunny London!

Jörg

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Jonathan Engwall
2018-08-19 21:13:30 UTC
Permalink
Thank you
I am not shocked that my previous message may have been removed.
To clarify: nothing has been removed to my knowledge. Your email is in the
list archives.

http://beowulf.org/pipermail/beowulf/2018-August/035219.html

All the best,
Chris (just woken up)
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf
Jonathan Engwall
2018-08-19 19:02:33 UTC
Permalink
As far as vulnerabilities go, here is a terrible idea:
Write a little login patch that grabs your own email address and uses it
to attempt to login to Facebook without a password 1000 times per
second. Kill the script after two seconds. You want to read the Facebook
head first so you can kick all the noise to /dev/null. It is brute force
based on a query.

On August 18, 2018, at 10:12 PM, John Hearns via Beowulf <***@beowulf.org> wrote:

Rather more seriously, this is a topic which is well worth discussing,

What are best practices on patching HPC systems?

Perhaps we need a separate thread here.


I will throw in one thought, which I honestly do not want to see happening.

I recently took a trip to Bletchley Park in the UK. On display there was an IBM punch card machine and sample punch cards Back in the day one prepared a 'job deck' which was collected by an operator in a metal hopper then wheeled off to the mainframe. You did not ever touch the mainframe. So effectively an air gapped system. A system like that would in these days kill productivity.

However should there be 'virus checking' of executables  before they are run on compute nodes.

One of the advantages lauded for Linux systems is of course that anti-virus programs are not needed.


Also I should ask - in the jargon of anti-virus is there a 'signature' for any of these exploit codes? One would guess that bad actors copy the example codes already published and use these almost in a cut and paste fashion. So the signature would be tight loops repeatedly reading or writing to the same memory locations. Can that be distinguished from innocent code?











On Sun, 19 Aug 2018 at 05:59, John Hearns <***@googlemail.com> wrote:

To patch, or not to patch, that is the question:
Whether 'tis nobler in the mind to suffer
The loops and branches of speculative execution,
Or to take arms against a sea of exploits
And by opposing end them. To die—to sleep,
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That HPC is heir to: 'tis a consummation
Devoutly to be wish'd. To die, to sleep
Post by Jeff Johnson
With the spate of security flaws over the past year and the impacts their
fixes have on performance and functionality it might be worthwhile to just
run airgapped.
For me none of the HPC systems I've been involved with here in Australia would
have had that option.  Virtually all have external users and/or reliance on
external data for some of the work they are used for (and the sysadmins don't
usually have control over the projects & people who get to use them).

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Chris Samuel
2018-08-19 21:09:40 UTC
Permalink
I am not shocked that my previous message may have been removed.
To clarify: nothing has been removed to my knowledge. Your email is in the
list archives.

http://beowulf.org/pipermail/beowulf/2018-August/035219.html

All the best,
Chris (just woken up)
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe
Chris Samuel
2018-09-10 08:22:13 UTC
Permalink
Post by Chris Samuel
Just a heads up that the 3.10.0-862.11.6.el7.x86_64 kernel from RHEL/CentOS
that was released to address the most recent Intel CPU problem "L1TF" seems
to break RDMA (found by a colleague here at Swinburne).
So this CentOS bug has a one line bug fix for this problem!

https://bugs.centos.org/view.php?id=15193

It's a corker - basically it looks like someone typo'd a ; into an if
statement, the fix is:

- if (!rdma_is_port_valid_nospec(device, &ah_attr->port_num));
+ if (!rdma_is_port_valid_nospec(device, &ah_attr->port_num))
return -EINVAL;

So it always returns -EINVAL when checking the port as the if becomes a noop..
:-(

Patch attached...
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
John Hearns via Beowulf
2018-09-10 09:32:19 UTC
Permalink
Linux should have coded the kernel in Python then. Easily caught there.

(Yes. I am making a joke)
Post by Chris Samuel
Post by Chris Samuel
Just a heads up that the 3.10.0-862.11.6.el7.x86_64 kernel from RHEL/CentOS
that was released to address the most recent Intel CPU problem "L1TF" seems
to break RDMA (found by a colleague here at Swinburne).
So this CentOS bug has a one line bug fix for this problem!
https://bugs.centos.org/view.php?id=15193
It's a corker - basically it looks like someone typo'd a ; into an if
- if (!rdma_is_port_valid_nospec(device, &ah_attr->port_num));
+ if (!rdma_is_port_valid_nospec(device, &ah_attr->port_num))
return -EINVAL;
So it always returns -EINVAL when checking the port as the if becomes a noop..
:-(
Patch attached...
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listi
Peter St. John
2018-09-10 15:25:55 UTC
Permalink
I had wanted to say that such a bug would be caught by compiling with some
reasonalbe warning level; but I think I was wrong.

I compiled
if(1==1);

with some wrapper and got nothing with whatever gcc I have on this laptop,
until
gcc -Wextra

which is more persnickety than -Wall, and just got
mynoop.c: In function 'main':
mynoop.c:4:10: warning: suggest braces around empty body in an 'if'
statement [-Wempty-body]
if(1==1);
^

So I guess I have to forgive the software engineer who fat-fingered that
semicolon. Of course I've done worse.

Peter
Post by Chris Samuel
Post by Chris Samuel
Just a heads up that the 3.10.0-862.11.6.el7.x86_64 kernel from
RHEL/CentOS
Post by Chris Samuel
that was released to address the most recent Intel CPU problem "L1TF"
seems
Post by Chris Samuel
to break RDMA (found by a colleague here at Swinburne).
So this CentOS bug has a one line bug fix for this problem!
https://bugs.centos.org/view.php?id=15193
It's a corker - basically it looks like someone typo'd a ; into an if
- if (!rdma_is_port_valid_nospec(device, &ah_attr->port_num));
+ if (!rdma_is_port_valid_nospec(device, &ah_attr->port_num))
return -EINVAL;
So it always returns -EINVAL when checking the port as the if becomes a noop..
:-(
Patch attached...
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Chris Samuel
2018-09-10 22:15:16 UTC
Permalink
Post by Peter St. John
I had wanted to say that such a bug would be caught by compiling with some
reasonalbe warning level; but I think I was wrong.
Interesting - looks like it depends on your GCC version, 7.3.0 catches it with -Wall here:

***@quad:/tmp$ gcc -Wall test.c -o test
test.c: In function ‘main’:
test.c:6:2: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
if ( test );
^~
test.c:7:3: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
printf ( "hello\n" );
^~~~~~
Post by Peter St. John
So I guess I have to forgive the software engineer who fat-fingered that
semicolon. Of course I've done worse.
Oh yes, same here too! There but for... and all that. :-)

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www
Peter St. John
2018-09-10 22:30:05 UTC
Permalink
yes the gcc I used is 5.1, I guess that's how long I've had this laptop :-)
And I like that "not guarding" that sounds useful.
Post by Peter St. John
I had wanted to say that such a bug would be caught by compiling with
some
Post by Peter St. John
reasonalbe warning level; but I think I was wrong.
test.c:6:2: warning: this ‘if’ clause does not guard...
[-Wmisleading-indentation]
if ( test );
^~
test.c:7:3: note: ...this statement, but the latter is misleadingly
indented as if it were guarded by the ‘if’
printf ( "hello\n" );
^~~~~~
Post by Peter St. John
So I guess I have to forgive the software engineer who fat-fingered that
semicolon. Of course I've done worse.
Oh yes, same here too! There but for... and all that. :-)
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Ryan Novosielski
2018-09-10 23:17:21 UTC
Permalink
Post by Chris Samuel
Post by Peter St. John
I had wanted to say that such a bug would be caught by compiling with some
reasonalbe warning level; but I think I was wrong.
test.c:6:2: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
if ( test );
^~
test.c:7:3: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
printf ( "hello\n" );
^~~~~~
Post by Peter St. John
So I guess I have to forgive the software engineer who fat-fingered that
semicolon. Of course I've done worse.
Oh yes, same here too! There but for... and all that. :-)
So we’ve learned what, here, that RedHat doesn’t test the RDMA stack at all?
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - ***@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mai
Kilian Cavalotti
2018-09-11 00:41:24 UTC
Permalink
Post by Ryan Novosielski
So we’ve learned what, here, that RedHat doesn’t test the RDMA stack at all?
Looks like Spectre-like vulns take all precedence, these days, indeed.

Last I heard, the fix will be in 862.14.1 to be released on the 25th

Cheers,
--
Kilian
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http:
Chris Samuel
2018-09-11 05:41:52 UTC
Permalink
Post by Kilian Cavalotti
Last I heard, the fix will be in 862.14.1 to be released on the 25th
Ah interesting, I wonder if that fix is already in the 3.10.0-933 kernel
that's meant to be in the RHEL 7.6 beta?
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit h
Ryan Novosielski
2018-10-01 21:09:53 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Mon, Sep 10, 2018 at 4:18 PM Ryan Novosielski
Post by Ryan Novosielski
So we’ve learned what, here, that RedHat doesn’t test the RDMA stack at all?
Looks like Spectre-like vulns take all precedence, these days,
indeed.
Last I heard, the fix will be in 862.14.1 to be released on the 25th
Confirmed fixed in 862.14.4:

https://access.redhat.com/solutions/3568891

- --
____
|| \\UTGERS, |----------------------*O*------------------------
||_// the State | Ryan Novosielski - ***@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark
`'
-----BEGIN PGP SIGNATURE-----

iEYEARECAAYFAluyjRkACgkQmb+gadEcsb5GDQCgjS3o5QZdv2xBm3Nr08lk4ifK
ziAAoIjIbNy8yoISNxIxMA5+V+SYoDck
=ln3g
-----END PGP SIGNATURE-----
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailma
Chris Samuel
2018-09-11 05:33:03 UTC
Permalink
Post by Ryan Novosielski
So we’ve learned what, here, that RedHat doesn’t test the RDMA stack at all?
It certainly does seem to be the case. Unlike other issues I've hit in the
past with bugs introduced in the IB stack in 6.x -> 6.y transitions where
they've needed more hardware than you could reasonably expect them to have to
be able to spot the bug this is a pretty fundamental failure.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC



_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/
Michael Di Domenico
2018-09-11 11:49:53 UTC
Permalink
meant to send this to the list not just ryan

---------- Forwarded message ---------
From: Michael Di Domenico <***@gmail.com>
Date: Tue, Sep 11, 2018 at 7:48 AM
Subject: Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
Post by Ryan Novosielski
Post by Chris Samuel
Post by Peter St. John
So I guess I have to forgive the software engineer who fat-fingered that
semicolon. Of course I've done worse.
Oh yes, same here too! There but for... and all that. :-)
So we’ve learned what, here, that RedHat doesn’t test the RDMA stack at all?
i'm not sure the test path was that simple. I updated my machines to
the rhel 11.6 kernel. i was able to use IPoIB but was not able to use
Lustre. so while the ultimate fix was a semi-colon, the failure code
path i believe is/was more complicated.

what this tells me is that there's probably more bugs in the IB stack
then anyone thinks
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org
John Hearns via Beowulf
2018-09-11 13:12:06 UTC
Permalink
That raises a good point. Someone was asking recently about RDAM
availability on AWS.
Other storage setups, eg SpectrumScale operate over RDMA
What is a sure-fire test for 'ís RDMA working"
I bet there is a simple utility and I will kick myself.. but worth asking.
On Tue, 11 Sep 2018 at 12:50, Michael Di Domenico
Post by Michael Di Domenico
meant to send this to the list not just ryan
---------- Forwarded message ---------
Date: Tue, Sep 11, 2018 at 7:48 AM
Subject: Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
Post by Ryan Novosielski
Post by Chris Samuel
Post by Peter St. John
So I guess I have to forgive the software engineer who fat-fingered that
semicolon. Of course I've done worse.
Oh yes, same here too! There but for... and all that. :-)
So we’ve learned what, here, that RedHat doesn’t test the RDMA stack at all?
i'm not sure the test path was that simple. I updated my machines to
the rhel 11.6 kernel. i was able to use IPoIB but was not able to use
Lustre. so while the ultimate fix was a semi-colon, the failure code
path i believe is/was more complicated.
what this tells me is that there's probably more bugs in the IB stack
then anyone thinks
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beo
Peter Kjellström
2018-09-11 12:35:20 UTC
Permalink
On Tue, 11 Sep 2018 14:12:06 +0100
Post by John Hearns via Beowulf
That raises a good point. Someone was asking recently about RDAM
availability on AWS.
Other storage setups, eg SpectrumScale operate over RDMA
What is a sure-fire test for 'ís RDMA working"
I bet there is a simple utility and I will kick myself.. but worth
One of the simpler ways that include an actual connection (needed to
hit this bug) is ib_write_bw. It needs to be run between two hosts
though.

To be specific, this tests (part of) verbs. Applications may depend on
other upper level protocols YMMV...

/Peter
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Peter Kjellström
2018-09-11 12:32:07 UTC
Permalink
On Mon, 10 Sep 2018 23:17:21 +0000
Ryan Novosielski <***@rutgers.edu> wrote:
...
So we’ve learned what, here, that RedHat doesn’t test the RDMA stack
at all?
This we knew already since last time they completely destroyed it with
an update (and took >month to fix).

/Peter
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Peter St. John
2018-09-11 16:37:18 UTC
Permalink
A friend at RH (who works in a different area) tells me RH does not
themselves test the downstream CentOS.
Peter
Post by Peter Kjellström
On Mon, 10 Sep 2018 23:17:21 +0000
...
So we’ve learned what, here, that RedHat doesn’t test the RDMA stack
at all?
This we knew already since last time they completely destroyed it with
an update (and took >month to fix).
/Peter
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Peter Kjellström
2018-09-11 16:50:14 UTC
Permalink
On Tue, 11 Sep 2018 12:37:18 -0400
Post by Peter St. John
A friend at RH (who works in a different area) tells me RH does not
themselves test the downstream CentOS.
Peter
That isn't surprising is it? But in this case we're talking about them
not testing their own product.. :-D

/Peter K
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Peter St. John
2018-09-11 17:01:24 UTC
Permalink
I mean the RH QA that tests RH products isn't the same team as tests (or
not) CentOS, but I only know from the wiki that RH has an expanding
agreement with CentOS so may be this is all merging. As I said, my buddy
doesn't work in this area, and I sure don't. Probably all you guys are more
up to date on the merging than either of us.

Peter
Post by Peter Kjellström
On Tue, 11 Sep 2018 12:37:18 -0400
Post by Peter St. John
A friend at RH (who works in a different area) tells me RH does not
themselves test the downstream CentOS.
Peter
That isn't surprising is it? But in this case we're talking about them
not testing their own product.. :-D
/Peter K
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Peter St. John
2018-09-11 17:02:27 UTC
Permalink
I mean the RH QA that tests RH products isn't the same team as tests (or
not) CentOS, but I only know from the wiki that RH has an expanding
agreement with CentOS so may be this is all merging. As I said, my buddy
doesn't work in this area, and I sure don't. Probably all you guys are more
up to date on the merging than either of us.
Post by Peter Kjellström
On Tue, 11 Sep 2018 12:37:18 -0400
Post by Peter St. John
A friend at RH (who works in a different area) tells me RH does not
themselves test the downstream CentOS.
Peter
That isn't surprising is it? But in this case we're talking about them
not testing their own product.. :-D
/Peter K
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
John Hearns via Beowulf
2018-09-12 17:02:06 UTC
Permalink
Regarding CentOS, Karanbir Singh is the leader of the project and has
a job at Redhat
https://www.linuxfoundation.org/blog/2014/01/centos-project-leader-karanbir-singh-opens-up-on-red-hat-deal/
I mean the RH QA that tests RH products isn't the same team as tests (or not) CentOS, but I only know from the wiki that RH has an expanding agreement with CentOS so may be this is all merging. As I said, my buddy doesn't work in this area, and I sure don't. Probably all you guys are more up to date on the merging than either of us.
Post by Peter Kjellström
On Tue, 11 Sep 2018 12:37:18 -0400
Post by Peter St. John
A friend at RH (who works in a different area) tells me RH does not
themselves test the downstream CentOS.
Peter
That isn't surprising is it? But in this case we're talking about them
not testing their own product.. :-D
/Peter K
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo
Jörg Saßmannshausen
2018-10-02 21:19:58 UTC
Permalink
Dear all,

is there some kind of quick test to demonstrate the patch does have or does
not cause a problem with RDMA? I have been asked to look into that but I don't
really want to use a large cp2k calculation which, I believe, makes use of
RDMA.

All the best from London

Jörg
Post by John Hearns via Beowulf
Regarding CentOS, Karanbir Singh is the leader of the project and has
a job at Redhat
https://www.linuxfoundation.org/blog/2014/01/centos-project-leader-karanbir-> singh-opens-up-on-red-hat-deal/
Post by Peter St. John
I mean the RH QA that tests RH products isn't the same team as tests (or
not) CentOS, but I only know from the wiki that RH has an expanding
agreement with CentOS so may be this is all merging. As I said, my buddy
doesn't work in this area, and I sure don't. Probably all you guys are
more up to date on the merging than either of us.>
Post by Peter Kjellström
On Tue, 11 Sep 2018 12:37:18 -0400
Post by Peter St. John
A friend at RH (who works in a different area) tells me RH does not
themselves test the downstream CentOS.
Peter
That isn't surprising is it? But in this case we're talking about them
not testing their own product.. :-D
/Peter K
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Ade Fewings
2018-10-02 21:33:09 UTC
Permalink
Hello from Wales

Red Hat quoted just a simple ib_write_bw test as indicating the broken state of IB RDMA (https://access.redhat.com/solutions/3568891):

Run a RDMA write bandwidth test. ib_write_bw is provided by the package perftest.

On target node run :
# ib_write_bw

On client side run :
# ib_write_bw <target-IP>

The test should fail.

Hope that helps
Ade



-----Original Message-----
From: Beowulf <beowulf-***@beowulf.org> On Behalf Of Jörg Saßmannshausen
Sent: 02 October 2018 22:20
To: ***@beowulf.org
Subject: Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA

Dear all,

is there some kind of quick test to demonstrate the patch does have or does not cause a problem with RDMA? I have been asked to look into that but I don't really want to use a large cp2k calculation which, I believe, makes use of RDMA.

All the best from London

Jörg
Post by John Hearns via Beowulf
Regarding CentOS, Karanbir Singh is the leader of the project and has
a job at Redhat
https://www.linuxfoundation.org/blog/2014/01/centos-project-leader-kar
anbir-> singh-opens-up-on-red-hat-deal/ On Tue, 11 Sep 2018 at 18:03,
Post by Peter St. John
I mean the RH QA that tests RH products isn't the same team as tests (or
not) CentOS, but I only know from the wiki that RH has an expanding
agreement with CentOS so may be this is all merging. As I said, my
buddy doesn't work in this area, and I sure don't. Probably all you
guys are more up to date on the merging than either of us.> On Tue,
Post by Peter Kjellström
On Tue, 11 Sep 2018 12:37:18 -0400
Post by Peter St. John
A friend at RH (who works in a different area) tells me RH does
not themselves test the downstream CentOS.
Peter
That isn't surprising is it? But in this case we're talking about
them not testing their own product.. :-D
/Peter K
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
________________________________

[HPC Wales - www.hpcwales.co.uk] <http://www.hpcwales.co.uk>

________________________________

The contents of this email and any files transmitted with it are confidential and intended solely for the named addressee only. Unless you are the named addressee (or authorised to receive this on their behalf) you may not copy it or use it, or disclose it to anyone else. If you have received this email in error, please notify the sender by email or telephone. All emails sent by High Performance Computing Wales have been checked using an Anti-Virus system. We would advise you to run your own virus check before opening any attachments received as we will not in any event accept any liability whatsoever, once an email and/or attachment is received.

High Performance Computing Wales is a private limited company incorporated in Wales on 8 March 2010 as company number 07181701.

Our registered office is at Finance Office, Bangor University, Cae Derwen, College Road, Bangor, Gwynedd. LL57 2DG. UK.

High Performance Computing Wales is part funded by the European Regional Development Fund through the Welsh Government.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
Jörg Saßmannshausen
2018-10-02 22:09:17 UTC
Permalink
Hi Ade,

thanks for this. I will give it a spin.

So far I only done a simple ping-pong test but never done a RDMA test.

All the best and thanks!

Jörg
Post by Ade Fewings
Hello from Wales
Red Hat quoted just a simple ib_write_bw test as indicating the broken state
Run a RDMA write bandwidth test. ib_write_bw is provided by the package perftest.
# ib_write_bw
# ib_write_bw <target-IP>
The test should fail.
Hope that helps
Ade
-----Original Message-----
Sent: 02 October 2018 22:20
Subject: Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
Dear all,
is there some kind of quick test to demonstrate the patch does have or does
not cause a problem with RDMA? I have been asked to look into that but I
don't really want to use a large cp2k calculation which, I believe, makes
use of RDMA.
All the best from London
Jörg
Post by John Hearns via Beowulf
Regarding CentOS, Karanbir Singh is the leader of the project and has
a job at Redhat
https://www.linuxfoundation.org/blog/2014/01/centos-project-leader-kar
anbir-> singh-opens-up-on-red-hat-deal/ On Tue, 11 Sep 2018 at 18:03,
Post by Peter St. John
I mean the RH QA that tests RH products isn't the same team as tests (or
not) CentOS, but I only know from the wiki that RH has an expanding
agreement with CentOS so may be this is all merging. As I said, my
buddy doesn't work in this area, and I sure don't. Probably all you
guys are more up to date on the merging than either of us.> On Tue,
Post by Peter Kjellström
On Tue, 11 Sep 2018 12:37:18 -0400
Post by Peter St. John
A friend at RH (who works in a different area) tells me RH does
not themselves test the downstream CentOS.
Peter
That isn't surprising is it? But in this case we're talking about
them not testing their own product.. :-D
/Peter K
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
________________________________
[HPC Wales - www.hpcwales.co.uk] <http://www.hpcwales.co.uk>
________________________________
The contents of this email and any files transmitted with it are
confidential and intended solely for the named addressee only. Unless you
are the named addressee (or authorised to receive this on their behalf) you
may not copy it or use it, or disclose it to anyone else. If you have
received this email in error, please notify the sender by email or
telephone. All emails sent by High Performance Computing Wales have been
checked using an Anti-Virus system. We would advise you to run your own
virus check before opening any attachments received as we will not in any
event accept any liability whatsoever, once an email and/or attachment is
received.
High Performance Computing Wales is a private limited company incorporated
in Wales on 8 March 2010 as company number 07181701.
Our registered office is at Finance Office, Bangor University, Cae Derwen,
College Road, Bangor, Gwynedd. LL57 2DG. UK.
High Performance Computing Wales is part funded by the European Regional
Development Fund through the Welsh Government.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/
Loading...