Discussion:
[Beowulf] Supercomputing comes to the Daily Mail
John Hearns via Beowulf
2017-08-14 08:30:54 UTC
Permalink
The Daily Mail is (shall we say) a rather right-wing daily newspaper in the
UK. It may give some flavour if I tell you that is most famous/infamous
headline is "Hurrha for the Blackshirts" (1934)

A surprisingly good article on using HPC and a visualisation wall to mode
ocean currents.
http://www.dailymail.co.uk/sciencetech/article-4775872/NASA-supercomputer-simulation-reveals-ocean-current-motion.html

I would not delve into the Comments section though...
I believe Mr T from the A Team is commenting here:
"Learn physics fool. If you are the religious type then god created the
laws of physics, so must be true; if not then the laws of physics describe
what we observe so therefore must be true. Either way they are true. Learn
them!"

Hmmm.. .perhaps this person has a big future in HPC user support. "Its your
bug, fool!"

Read more:
http://www.dailymail.co.uk/sciencetech/article-4775872/NASA-supercomputer-simulation-reveals-ocean-current-motion.html#ixzz4piTozmPo
Jeffrey Layton
2017-08-14 17:12:15 UTC
Permalink
A friend of mine, Mark Fernandez, is the lead engineer on this project. He
works for SGI (now HPE). They are putting two servers onto the ISS and are
going to be running tests for a while. I don't know too many details except
this. Oh! I do know they won't give you SSH access to the servers (already
asked).

I'm guessing they are gathering radiation impact on the memory of the
system (cache and all), to see what happens. Probably check the health of
the system too. Maybe when it comes back to Earth they will test it again
and then pull it apart to look for changes.

Jeff


On Mon, Aug 14, 2017 at 4:30 AM, John Hearns via Beowulf <
Post by John Hearns via Beowulf
The Daily Mail is (shall we say) a rather right-wing daily newspaper in
the UK. It may give some flavour if I tell you that is most famous/infamous
headline is "Hurrha for the Blackshirts" (1934)
A surprisingly good article on using HPC and a visualisation wall to mode
ocean currents.
http://www.dailymail.co.uk/sciencetech/article-4775872/
NASA-supercomputer-simulation-reveals-ocean-current-motion.html
I would not delve into the Comments section though...
"Learn physics fool. If you are the religious type then god created the
laws of physics, so must be true; if not then the laws of physics describe
what we observe so therefore must be true. Either way they are true. Learn
them!"
Hmmm.. .perhaps this person has a big future in HPC user support. "Its
your bug, fool!"
Read more: http://www.dailymail.co.uk/sciencetech/article-4775872/
NASA-supercomputer-simulation-reveals-ocean-current-motion.
html#ixzz4piTozmPo
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
John Hearns via Beowulf
2017-08-14 18:37:02 UTC
Permalink
I would be a bit more concerned about radiation doses to the personnel!
This is well studied of course, and I believe DNA has its own ECC codes.

Spiralling dangerously off subject, if I am not wrong Concorde crews were
issued with radiation film badges.
Post by Jeffrey Layton
A friend of mine, Mark Fernandez, is the lead engineer on this project. He
works for SGI (now HPE). They are putting two servers onto the ISS and are
going to be running tests for a while. I don't know too many details except
this. Oh! I do know they won't give you SSH access to the servers (already
asked).
I'm guessing they are gathering radiation impact on the memory of the
system (cache and all), to see what happens. Probably check the health of
the system too. Maybe when it comes back to Earth they will test it again
and then pull it apart to look for changes.
Jeff
On Mon, Aug 14, 2017 at 4:30 AM, John Hearns via Beowulf <
Post by John Hearns via Beowulf
The Daily Mail is (shall we say) a rather right-wing daily newspaper in
the UK. It may give some flavour if I tell you that is most famous/infamous
headline is "Hurrha for the Blackshirts" (1934)
A surprisingly good article on using HPC and a visualisation wall to mode
ocean currents.
http://www.dailymail.co.uk/sciencetech/article-4775872/NASA-
supercomputer-simulation-reveals-ocean-current-motion.html
I would not delve into the Comments section though...
"Learn physics fool. If you are the religious type then god created the
laws of physics, so must be true; if not then the laws of physics describe
what we observe so therefore must be true. Either way they are true. Learn
them!"
Hmmm.. .perhaps this person has a big future in HPC user support. "Its
your bug, fool!"
Read more: http://www.dailymail.co.uk/sciencetech/article-4775872/NASA-
supercomputer-simulation-reveals-ocean-current-motion.html#ixzz4piTozmPo
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Christopher Samuel
2017-08-15 01:05:15 UTC
Permalink
Post by Jeffrey Layton
A friend of mine, Mark Fernandez, is the lead engineer on this
project. He works for SGI (now HPE). They are putting two servers
onto the ISS and are going to be running tests for a while. I don't
know too many details except this.
Ars Technica had more on this last weekend, which I tweeted.

https://arstechnica.com/science/2017/08/spacex-is-launching-a-supercomputer-to-the-international-space-station/

Two 1TF systems, one to go to the ISS and one to remain on
the ground as a control system, both running the same code.

# For the year-long experiment, astronauts will install the computer
# inside a rack in the Destiny module of the space station. It is
# about the size of two pizza boxes stuck together. And while the
# device is not exactly a state-of-the-art supercomputer—it has a
# computing speed of about 1 teraflop—it is the most powerful computer
# sent into space. Unlike most computers, it has not been hardened for
# the radiation environment aboard the space station. The goal is to
# better understand how the space environment will degrade the
# performance of an off-the-shelf computer.
#
# During the next year, the spaceborne computer will continuously run
# through a set of computing benchmarks to determine its performance
# over time. Meanwhile, on the ground, an identical copy of the
# computer will run in a lab as a control.

No details on the actual systems there though.

cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/b
Lux, Jim (337C)
2017-08-15 01:53:43 UTC
Permalink
And when it comes to space, ISS is a pretty benign environment - shirt
sleeves, low radiation, etc.
You want hostile, go to Europa - 1 Rad/second kinds of dose plus LOTS of
high energy particles from the interaction with Jupiter.
That sort of brownish spot on Europa¹s ice? That¹s a radiation burn.
https://www.space.com/13624-photos-europa-mysterious-moon-jupiter.html



James Lux, P.E.
Task Manager, DHFR Space Testbed
Jet Propulsion Laboratory
4800 Oak Grove Drive, MS 161-213
Pasadena CA 91109
+1(818)354-2075
+1(818)395-2714 (cell)






On 8/14/17, 6:05 PM, "Beowulf on behalf of Christopher Samuel"
Post by Christopher Samuel
Post by Jeffrey Layton
A friend of mine, Mark Fernandez, is the lead engineer on this
project. He works for SGI (now HPE). They are putting two servers
onto the ISS and are going to be running tests for a while. I don't
know too many details except this.
Ars Technica had more on this last weekend, which I tweeted.
https://arstechnica.com/science/2017/08/spacex-is-launching-a-supercompute
r-to-the-international-space-station/
Two 1TF systems, one to go to the ISS and one to remain on
the ground as a control system, both running the same code.
# For the year-long experiment, astronauts will install the computer
# inside a rack in the Destiny module of the space station. It is
# about the size of two pizza boxes stuck together. And while the
# device is not exactly a state-of-the-art supercomputer‹it has a
# computing speed of about 1 teraflop‹it is the most powerful computer
# sent into space. Unlike most computers, it has not been hardened for
# the radiation environment aboard the space station. The goal is to
# better understand how the space environment will degrade the
# performance of an off-the-shelf computer.
#
# During the next year, the spaceborne computer will continuously run
# through a set of computing benchmarks to determine its performance
# over time. Meanwhile, on the ground, an identical copy of the
# computer will run in a lab as a control.
No details on the actual systems there though.
cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) vi
Faraz Hussain
2017-08-17 16:00:27 UTC
Permalink
I noticed an mpi job was taking 5X longer to run whenever it got the
compute node lusytp104 . So I ran qperf and found the bandwidth
between it and any other nodes was ~100MB/sec. This is much lower than
~1GB/sec between all the other nodes. Any tips on how to debug
further? I haven't tried rebooting since it is currently running a
single-node job.

[***@lusytp114 ~]$ qperf lusytp104 tcp_lat tcp_bw
tcp_lat:
latency = 17.4 us
tcp_bw:
bw = 118 MB/sec
[***@lusytp114 ~]$ qperf lusytp113 tcp_lat tcp_bw
tcp_lat:
latency = 20.4 us
tcp_bw:
bw = 1.07 GB/sec

This is separate issue from my previous post about a slow compute
node. I am still investigating that per the helpful replies. Will post
an update about that once I find the root cause!

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
John Hearns via Beowulf
2017-08-17 16:28:19 UTC
Permalink
Faraz,
I really suggest you examine the Intel Cluster Checker.
I guess that you cannot take down a production cluster to run an entire
Cluster checker run, however these are the types of faults which ICC is
designed to find. You can define a smal lset of compute nodes to run on,
including this node, and maybe run ICC on them?

As for the diagnosis, run ethtool <interface name> where that
is the name of your ethernet interface.
compare that with the output of ethtool on a properly working compute node.
Post by Faraz Hussain
I noticed an mpi job was taking 5X longer to run whenever it got the
compute node lusytp104 . So I ran qperf and found the bandwidth between it
and any other nodes was ~100MB/sec. This is much lower than ~1GB/sec
between all the other nodes. Any tips on how to debug further? I haven't
tried rebooting since it is currently running a single-node job.
latency = 17.4 us
bw = 118 MB/sec
latency = 20.4 us
bw = 1.07 GB/sec
This is separate issue from my previous post about a slow compute node. I
am still investigating that per the helpful replies. Will post an update
about that once I find the root cause!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Joe Landman
2017-08-17 16:35:47 UTC
Permalink
Post by Faraz Hussain
I noticed an mpi job was taking 5X longer to run whenever it got the
compute node lusytp104 . So I ran qperf and found the bandwidth
between it and any other nodes was ~100MB/sec. This is much lower than
~1GB/sec between all the other nodes. Any tips on how to debug
further? I haven't tried rebooting since it is currently running a
single-node job.
latency = 17.4 us
bw = 118 MB/sec
latency = 20.4 us
bw = 1.07 GB/sec
This is separate issue from my previous post about a slow compute
node. I am still investigating that per the helpful replies. Will post
an update about that once I find the root cause!
Sounds very much like it is running over gigabit ethernet vs
Infiniband. Check to make sure it is using the right network ...
Post by Faraz Hussain
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.be
Scott Atchley
2017-08-17 18:02:19 UTC
Permalink
I would agree that the bandwidth points at 1 GigE in this case.

For IB/OPA cards running slower than expected, I would recommend ensuring
that they are using the correct amount of PCIe lanes.
Post by Faraz Hussain
I noticed an mpi job was taking 5X longer to run whenever it got the
compute node lusytp104 . So I ran qperf and found the bandwidth between it
and any other nodes was ~100MB/sec. This is much lower than ~1GB/sec
between all the other nodes. Any tips on how to debug further? I haven't
tried rebooting since it is currently running a single-node job.
latency = 17.4 us
bw = 118 MB/sec
latency = 20.4 us
bw = 1.07 GB/sec
This is separate issue from my previous post about a slow compute node. I
am still investigating that per the helpful replies. Will post an update
about that once I find the root cause!
Sounds very much like it is running over gigabit ethernet vs Infiniband.
Check to make sure it is using the right network ...
Post by Faraz Hussain
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Joe Landman
2017-08-17 18:10:23 UTC
Permalink
Post by Scott Atchley
I would agree that the bandwidth points at 1 GigE in this case.
For IB/OPA cards running slower than expected, I would recommend
ensuring that they are using the correct amount of PCIe lanes.
Turns out, there is a really nice open source tool that does this for
you ...

https://github.com/joelandman/pcilist

:D
Post by Scott Atchley
I noticed an mpi job was taking 5X longer to run whenever it
got the compute node lusytp104 . So I ran qperf and found the
bandwidth between it and any other nodes was ~100MB/sec. This
is much lower than ~1GB/sec between all the other nodes. Any
tips on how to debug further? I haven't tried rebooting since
it is currently running a single-node job.
latency = 17.4 us
bw = 118 MB/sec
latency = 20.4 us
bw = 1.07 GB/sec
This is separate issue from my previous post about a slow
compute node. I am still investigating that per the helpful
replies. Will post an update about that once I find the root
cause!
Sounds very much like it is running over gigabit ethernet vs
Infiniband. Check to make sure it is using the right network ...
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://w
Gus Correa
2017-08-17 18:40:22 UTC
Permalink
Post by Joe Landman
Post by Faraz Hussain
I noticed an mpi job was taking 5X longer to run whenever it got the
compute node lusytp104 . So I ran qperf and found the bandwidth
between it and any other nodes was ~100MB/sec. This is much lower than
~1GB/sec between all the other nodes. Any tips on how to debug
further? I haven't tried rebooting since it is currently running a
single-node job.
latency = 17.4 us
bw = 118 MB/sec
latency = 20.4 us
bw = 1.07 GB/sec
This is separate issue from my previous post about a slow compute
node. I am still investigating that per the helpful replies. Will post
an update about that once I find the root cause!
Sounds very much like it is running over gigabit ethernet vs
Infiniband. Check to make sure it is using the right network ...
Hi Faraz

As others have said answering your previous posting about Infiniband:

- Check if the node is configured the same way as the other nodes,
in the case of Infinband, if the MTU is the same,
using connected or datagram mode, etc.

**

Besides, for Open MPI you can force it at runtime not to use tcp:
--mca btl ^tcp
or with the syntax in this FAQ:
https://www.open-mpi.org/faq/?category=openfabrics#ib-btl

If that node has an Infinband interface with a problem,
this should at least give a clue.

**

In addition, check the limits in the node.
That may be set by your resource manager,
or in /etc/security/limits.conf
or perhaps in the actual job script.
The memlock limit is key to Open MPI over Infiniband.
See FAQ 15, 16, 17 here:
https://www.open-mpi.org/faq/?category=openfabrics

**

Moreover, check if the mlx4_core.conf (assuming it is Mellanox HW)
is configured the same way across the nodes:

/etc/modprobe.d/mlx4_core.conf

See FAQ 18 here:
https://www.open-mpi.org/faq/?category=openfabrics

**

To increase the btl diagnostic verbosity (that goes to STDERR, IRRC):

--mca btl_base_verbose 30

That may point out which interfaces are actually being used, etc.

See this FAQ:

https://www.open-mpi.org/faq/?category=all#diagnose-multi-host-problems

**

Finally, as John has suggested before, you may want to
subscribe to the Open MPI mailing list,
and ask the question there as well:

https://www.open-mpi.org/community/help/
https://www.open-mpi.org/community/lists/

There you will get feedback from the Open MPI developers +
user community, and that often includes insights from
Intel and Mellanox IB hardware experts.

**

I hope this helps.

Gus Correa
Post by Joe Landman
Post by Faraz Hussain
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visi
John Hearns via Beowulf
2017-08-18 08:28:18 UTC
Permalink
Joe - Leela? I did not know you were a Dr Who fan.

Faraz, you really should log into your switch and look at the configuration
of the ports.
Find the port to which that compute node is connected by listing the MAC
address table.
(If you are using Bright there is an easy way to do this).
Look at the port configuration - is it capped to a certain rate?
The next step is to bring the interface down then up to see if it
renegotiates.
Probably won't so then it is a trip to the data centre to reseat the
connection. (Tha tis the posh phrase for pulling the cable out and sticking
it back in).
Post by Gus Correa
Post by Faraz Hussain
I noticed an mpi job was taking 5X longer to run whenever it got the
compute node lusytp104 . So I ran qperf and found the bandwidth between it
and any other nodes was ~100MB/sec. This is much lower than ~1GB/sec
between all the other nodes. Any tips on how to debug further? I haven't
tried rebooting since it is currently running a single-node job.
latency = 17.4 us
bw = 118 MB/sec
latency = 20.4 us
bw = 1.07 GB/sec
This is separate issue from my previous post about a slow compute node.
I am still investigating that per the helpful replies. Will post an update
about that once I find the root cause!
Sounds very much like it is running over gigabit ethernet vs Infiniband.
Check to make sure it is using the right network ...
Hi Faraz
- Check if the node is configured the same way as the other nodes,
in the case of Infinband, if the MTU is the same,
using connected or datagram mode, etc.
**
--mca btl ^tcp
https://www.open-mpi.org/faq/?category=openfabrics#ib-btl
If that node has an Infinband interface with a problem,
this should at least give a clue.
**
In addition, check the limits in the node.
That may be set by your resource manager,
or in /etc/security/limits.conf
or perhaps in the actual job script.
The memlock limit is key to Open MPI over Infiniband.
https://www.open-mpi.org/faq/?category=openfabrics
**
Moreover, check if the mlx4_core.conf (assuming it is Mellanox HW)
/etc/modprobe.d/mlx4_core.conf
https://www.open-mpi.org/faq/?category=openfabrics
**
--mca btl_base_verbose 30
That may point out which interfaces are actually being used, etc.
https://www.open-mpi.org/faq/?category=all#diagnose-multi-host-problems
**
Finally, as John has suggested before, you may want to
subscribe to the Open MPI mailing list,
https://www.open-mpi.org/community/help/
https://www.open-mpi.org/community/lists/
There you will get feedback from the Open MPI developers +
user community, and that often includes insights from
Intel and Mellanox IB hardware experts.
**
I hope this helps.
Gus Correa
Post by Faraz Hussain
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Lux, Jim (337C)
2017-08-15 01:50:03 UTC
Permalink
Radiation effects sort of fall in three buckets:
1) gradual changes due to total dose - things like bias current or leakage current changes - ISS is in a low orbit and actually doesn’t see much dose; after all people live there. Maybe a Rad/year (0.01 gray). Opto isolators have problems with dose, but in general, in LEO, it’s a small effect.

2) non-destructive transient effects due to a high energy particle: solar protons and galactic cosmic rays for the most part - this is what causes “bit flips” for instance. You’ll see the term SEFI (Single Event Functional Interrupt) too - that’s where the processor or something locks up or spontaneously resets. Watchdogs and ECC are your friend.

3) destructive transient effects - This is the killer for most “consumer” equipment - something has a destructive latch up behavior triggered by the particle. Or the gate on a FET ruptures because the combination of the gate voltage plus the extra charge from the particle is just a bit too high. This is fairly common in switching power supplies, especially ones designed for cost - if you’re switching 120V, why buy 500V FETs.

You’ll also get a “not so destructive but permanent” effect like a hot pixel in a imaging array.

They fly a fair amount of consumer gear on ISS and a lot of it gets tested in various accelerator facilities around the country, and they come up with a predicted life. If the life is something like 6 months, and the experiment only runs for 3, then you’re good to go.
They worry more about fire than radiation effects.


James Lux, P.E.
Task Manager, DHFR Space Testbed
Jet Propulsion Laboratory
4800 Oak Grove Drive, MS 161-213
Pasadena CA 91109
+1(818)354-2075
+1(818)395-2714 (cell)


From: Beowulf <beowulf-***@beowulf.org<mailto:beowulf-***@beowulf.org>> on behalf of Jeffrey Layton <***@gmail.com<mailto:***@gmail.com>>
Date: Monday, August 14, 2017 at 10:12 AM
To: John Hearns <***@googlemail.com<mailto:***@googlemail.com>>
Cc: "***@beowulf.org<mailto:***@beowulf.org>" <***@beowulf.org<mailto:***@beowulf.org>>
Subject: Re: [Beowulf] Supercomputing comes to the Daily Mail

A friend of mine, Mark Fernandez, is the lead engineer on this project. He works for SGI (now HPE). They are putting two servers onto the ISS and are going to be running tests for a while. I don't know too many details except this. Oh! I do know they won't give you SSH access to the servers (already asked).

I'm guessing they are gathering radiation impact on the memory of the system (cache and all), to see what happens. Probably check the health of the system too. Maybe when it comes back to Earth they will test it again and then pull it apart to look for changes.

Jeff


On Mon, Aug 14, 2017 at 4:30 AM, John Hearns via Beowulf <***@beowulf.org<mailto:***@beowulf.org>> wrote:
The Daily Mail is (shall we say) a rather right-wing daily newspaper in the UK. It may give some flavour if I tell you that is most famous/infamous headline is "Hurrha for the Blackshirts" (1934)

A surprisingly good article on using HPC and a visualisation wall to mode ocean currents.
http://www.dailymail.co.uk/sciencetech/article-4775872/NASA-supercomputer-simulation-reveals-ocean-current-motion.html

I would not delve into the Comments section though...
I believe Mr T from the A Team is commenting here:
"Learn physics fool. If you are the religious type then god created the laws of physics, so must be true; if not then the laws of physics describe what we observe so therefore must be true. Either way they are true. Learn them!"

Hmmm.. .perhaps this person has a big future in HPC user support. "Its your bug, fool!"

Read more: http://www.dailymail.co.uk/sciencetech/article-4775872/NASA-supercomputer-simulation-reveals-ocean-current-motion.html#ixzz4piTozmPo


_______________________________________________
Beowulf mailing list, ***@beowulf.org<mailto:***@beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Continue reading on narkive:
Loading...