Discussion:
[Beowulf] Lustre Upgrades
Paul Edmon
2018-07-23 15:59:31 UTC
Permalink
We have some old large scale Lustre installs that are running 2.6.34 and
we want to get these up to the latest version of Lustre.  I was curious
if people in this group have any experience with doing this and if they
could share them.  How do you handle upgrades like this?  How much time
does it take?  What are the pitfalls?  How do you manage it with minimal
customer interruption? Should we just write off upgrading and stand up
new servers that are on the correct version (in which case we need to
transfer the several PB's worth of data over to the new system)?

Thanks for your wisdom.

-Paul Edmon-

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
Michael Di Domenico
2018-07-23 16:51:04 UTC
Permalink
Should we just write off upgrading and stand up new servers
that are on the correct version (in which case we need to transfer the
several PB's worth of data over to the new system)?
if you can afford the hardware and the time for the copy, this would
certainly be the best option... :)

I've always done it that way as well. Lustre can be a scary upgrade
and I've generally found that by the time I'm ready to update the
machines the hardware has been abused for 2 or 3 years anyhow, so
swapping out the hardware that's supporting a filesystem generally
seemed like a good todo, certainly not a necessity though.

my understanding (i haven't done it yet) is that the later versions of
lustre >2.5 using zfs the upgrades have become more of a regular thing
with much less concern.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/li
Jeff Johnson
2018-07-23 17:00:36 UTC
Permalink
Paul,

2.6.34 is a kernel version. What version of Lustre are you at now? Some
updates are easier than others.

--Jeff
Post by Paul Edmon
We have some old large scale Lustre installs that are running 2.6.34 and
we want to get these up to the latest version of Lustre. I was curious if
people in this group have any experience with doing this and if they could
share them. How do you handle upgrades like this? How much time does it
take? What are the pitfalls? How do you manage it with minimal customer
interruption? Should we just write off upgrading and stand up new servers
that are on the correct version (in which case we need to transfer the
several PB's worth of data over to the new system)?
Thanks for your wisdom.
-Paul Edmon-
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

***@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
Paul Edmon
2018-07-23 17:05:21 UTC
Permalink
My apologies I meant 2.5.34 not 2.6.34.  We'd like to get up to 2.10.4
which is what our clients are running.  Recently we upgraded our cluster
to CentOS7 which necessitated the client upgrade.  Our storage servers
though stayed behind on 2.5.34.

-Paul Edmon-
Post by Jeff Johnson
Paul,
2.6.34 is a kernel version. What version of Lustre are you at now?
Some updates are easier than others.
--Jeff
We have some old large scale Lustre installs that are running
2.6.34 and we want to get these up to the latest version of
Lustre.  I was curious if people in this group have any experience
with doing this and if they could share them. How do you handle
upgrades like this?  How much time does it take?  What are the
pitfalls?  How do you manage it with minimal customer
interruption? Should we just write off upgrading and stand up new
servers that are on the correct version (in which case we need to
transfer the several PB's worth of data over to the new system)?
Thanks for your wisdom.
-Paul Edmon-
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Jeff Johnson
2018-07-23 17:18:20 UTC
Permalink
You're running 2.10.4 clients against 2.5.34 servers? I believe there are
notable lnet attrs that don't exist in 2.5.34. Maybe a Whamcloud wiz might
chime in but I think that version mismatch might be problematic.

You can do a testbed upgrade to test taking a ldiskfs volume from 2.5.34 to
2.10.4, just to be conservative.

--Jeff
My apologies I meant 2.5.34 not 2.6.34. We'd like to get up to 2.10.4
which is what our clients are running. Recently we upgraded our cluster to
CentOS7 which necessitated the client upgrade. Our storage servers though
stayed behind on 2.5.34.
-Paul Edmon-
Paul,
2.6.34 is a kernel version. What version of Lustre are you at now? Some
updates are easier than others.
--Jeff
Post by Paul Edmon
We have some old large scale Lustre installs that are running 2.6.34 and
we want to get these up to the latest version of Lustre. I was curious if
people in this group have any experience with doing this and if they could
share them. How do you handle upgrades like this? How much time does it
take? What are the pitfalls? How do you manage it with minimal customer
interruption? Should we just write off upgrading and stand up new servers
that are on the correct version (in which case we need to transfer the
several PB's worth of data over to the new system)?
Thanks for your wisdom.
-Paul Edmon-
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

***@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
Paul Edmon
2018-07-23 17:34:57 UTC
Permalink
Yeah we've found out firsthand that its problematic as we have been
seeing issues :).  Hence the urge to upgrade.

We've begun exploring this but we wanted to reach out to other people
who may have gone through the same thing to get their thoughts.  We also
need to figure out how significant an outage this will be.  As if it
takes a day or two of full outage to do the upgrade that is more
acceptable than a week.  We also wanted to know if people had
experienced data loss/corruption in the process and any other kinks.

We were planning on playing around on VM's to test the upgrade path
before committing to upgrading our larger systems.  One of the questions
we had though was if we needed to run e2fsck before/after the upgrade as
that could add significant time to the outage for that to complete.

-Paul Edmon-
Post by Jeff Johnson
You're running 2.10.4 clients against 2.5.34 servers? I believe there
are notable lnet attrs that don't exist in 2.5.34. Maybe a Whamcloud
wiz might chime in but I think that version mismatch might be
problematic.
You can do a testbed upgrade to test taking a ldiskfs volume from
2.5.34 to 2.10.4, just to be conservative.
--Jeff
My apologies I meant 2.5.34 not 2.6.34.  We'd like to get up to
2.10.4 which is what our clients are running. Recently we upgraded
our cluster to CentOS7 which necessitated the client upgrade.  Our
storage servers though stayed behind on 2.5.34.
-Paul Edmon-
Post by Jeff Johnson
Paul,
2.6.34 is a kernel version. What version of Lustre are you at
now? Some updates are easier than others.
--Jeff
On Mon, Jul 23, 2018 at 8:59 AM, Paul Edmon
We have some old large scale Lustre installs that are running
2.6.34 and we want to get these up to the latest version of
Lustre.  I was curious if people in this group have any
experience with doing this and if they could share them.  How
do you handle upgrades like this?  How much time does it
take?  What are the pitfalls?  How do you manage it with
minimal customer interruption? Should we just write off
upgrading and stand up new servers that are on the correct
version (in which case we need to transfer the several PB's
worth of data over to the new system)?
Thanks for your wisdom.
-Paul Edmon-
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Jeff Johnson
2018-07-23 17:58:20 UTC
Permalink
Paul,

How big are your ldiskfs volumes? What type of underlying hardware are
they? Running e2fsck (ldiskfs aware) is wise and can be done in parallel.
It could be within a couple of days, the time all depends on the size and
underlying hardware.

Going from 2.5.34 to 2.10.4 is a significant jump. I would be sure there
isn't a step upgrade advised. I know there has been step upgrades in the
past, not sure about going to/from these two versions.

--Jeff
Yeah we've found out firsthand that its problematic as we have been seeing
issues :). Hence the urge to upgrade.
We've begun exploring this but we wanted to reach out to other people who
may have gone through the same thing to get their thoughts. We also need
to figure out how significant an outage this will be. As if it takes a day
or two of full outage to do the upgrade that is more acceptable than a
week. We also wanted to know if people had experienced data
loss/corruption in the process and any other kinks.
We were planning on playing around on VM's to test the upgrade path before
committing to upgrading our larger systems. One of the questions we had
though was if we needed to run e2fsck before/after the upgrade as that
could add significant time to the outage for that to complete.
-Paul Edmon-
You're running 2.10.4 clients against 2.5.34 servers? I believe there are
notable lnet attrs that don't exist in 2.5.34. Maybe a Whamcloud wiz might
chime in but I think that version mismatch might be problematic.
You can do a testbed upgrade to test taking a ldiskfs volume from 2.5.34
to 2.10.4, just to be conservative.
--Jeff
My apologies I meant 2.5.34 not 2.6.34. We'd like to get up to 2.10.4
which is what our clients are running. Recently we upgraded our cluster to
CentOS7 which necessitated the client upgrade. Our storage servers though
stayed behind on 2.5.34.
-Paul Edmon-
Paul,
2.6.34 is a kernel version. What version of Lustre are you at now? Some
updates are easier than others.
--Jeff
Post by Paul Edmon
We have some old large scale Lustre installs that are running 2.6.34 and
we want to get these up to the latest version of Lustre. I was curious if
people in this group have any experience with doing this and if they could
share them. How do you handle upgrades like this? How much time does it
take? What are the pitfalls? How do you manage it with minimal customer
interruption? Should we just write off upgrading and stand up new servers
that are on the correct version (in which case we need to transfer the
several PB's worth of data over to the new system)?
Thanks for your wisdom.
-Paul Edmon-
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

***@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
Paul Edmon
2018-07-23 18:11:40 UTC
Permalink
Yeah we've pinged Intel/Whamcloud to find out upgrade paths as we wanted
to know what the recommended procedure is.

Sure. So we have 3 systems that we want to upgrade 1 that is a PB and 2
that are 5 PB each.  I will just give you a description of one and
assume that everything would scale linearly with size. They all have the
same hardware.

The head nodes are Dell R620's while the shelves are M3420 (mds) and
M3260 (oss).  The MDT is 2.2T with 466G used and 268M inodes used.  Each
OST is 30T with each OSS hosting 6.  The filesystem itself is 93% full.

-Paul Edmon-
Post by Jeff Johnson
Paul,
How big are your ldiskfs volumes? What type of underlying hardware are
they? Running e2fsck (ldiskfs aware) is wise and can be done in
parallel. It could be within a couple of days, the time all depends on
the size and underlying hardware.
Going from 2.5.34 to 2.10.4 is a significant jump. I would be sure
there isn't a step upgrade advised. I know there has been step
upgrades in the past, not sure about going to/from these two versions.
--Jeff
Yeah we've found out firsthand that its problematic as we have
been seeing issues :).  Hence the urge to upgrade.
We've begun exploring this but we wanted to reach out to other
people who may have gone through the same thing to get their
thoughts.  We also need to figure out how significant an outage
this will be.  As if it takes a day or two of full outage to do
the upgrade that is more acceptable than a week.  We also wanted
to know if people had experienced data loss/corruption in the
process and any other kinks.
We were planning on playing around on VM's to test the upgrade
path before committing to upgrading our larger systems.  One of
the questions we had though was if we needed to run e2fsck
before/after the upgrade as that could add significant time to the
outage for that to complete.
-Paul Edmon-
Post by Jeff Johnson
You're running 2.10.4 clients against 2.5.34 servers? I believe
there are notable lnet attrs that don't exist in 2.5.34. Maybe a
Whamcloud wiz might chime in but I think that version mismatch
might be problematic.
You can do a testbed upgrade to test taking a ldiskfs volume from
2.5.34 to 2.10.4, just to be conservative.
--Jeff
On Mon, Jul 23, 2018 at 10:05 AM, Paul Edmon
My apologies I meant 2.5.34 not 2.6.34. We'd like to get up
to 2.10.4 which is what our clients are running.  Recently we
upgraded our cluster to CentOS7 which necessitated the client
upgrade.  Our storage servers though stayed behind on 2.5.34.
-Paul Edmon-
Post by Jeff Johnson
Paul,
2.6.34 is a kernel version. What version of Lustre are you
at now? Some updates are easier than others.
--Jeff
On Mon, Jul 23, 2018 at 8:59 AM, Paul Edmon
We have some old large scale Lustre installs that are
running 2.6.34 and we want to get these up to the latest
version of Lustre.  I was curious if people in this
group have any experience with doing this and if they
could share them.  How do you handle upgrades like
this?  How much time does it take?  What are the
pitfalls?  How do you manage it with minimal customer
interruption? Should we just write off upgrading and
stand up new servers that are on the correct version (in
which case we need to transfer the several PB's worth of
data over to the new system)?
Thanks for your wisdom.
-Paul Edmon-
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Jörg Saßmannshausen
2018-07-24 08:52:44 UTC
Permalink
Hi Paul,

with a file system being 93% full, in my humble opinion it would make sense to
increase the underlying hardware capacity as well. The reasoning behind it is
that usually over time there will be more data on any given file system and
thus if there is already a downtime, I would increase the size of it as well.
I rather have a bit of a longer down time and then I got a new version of
Luster (of which I know little about) and more capacity which will last for
longer than only upgrading Luster and then run out of disc capacity a bit
later.
It also means that in your case you could simply install the new system, test
it, and then migrate the data over. Depending how it is set up you even could
do that at stages.
As you mentioned 3 different Luster servers, you could for example start with
the biggest one and use new hardware here. The freed capacity of the now
obsolete hardware could for example being utilized for the other systems.
Of course, I don't know your hardware etc.

Just some ideas from a hot London 8-)

Jörg
Post by Paul Edmon
Yeah we've pinged Intel/Whamcloud to find out upgrade paths as we wanted
to know what the recommended procedure is.
Sure. So we have 3 systems that we want to upgrade 1 that is a PB and 2
that are 5 PB each. I will just give you a description of one and
assume that everything would scale linearly with size. They all have the
same hardware.
The head nodes are Dell R620's while the shelves are M3420 (mds) and
M3260 (oss). The MDT is 2.2T with 466G used and 268M inodes used. Each
OST is 30T with each OSS hosting 6. The filesystem itself is 93% full.
-Paul Edmon-
Post by Jeff Johnson
Paul,
How big are your ldiskfs volumes? What type of underlying hardware are
they? Running e2fsck (ldiskfs aware) is wise and can be done in
parallel. It could be within a couple of days, the time all depends on
the size and underlying hardware.
Going from 2.5.34 to 2.10.4 is a significant jump. I would be sure
there isn't a step upgrade advised. I know there has been step
upgrades in the past, not sure about going to/from these two versions.
--Jeff
Yeah we've found out firsthand that its problematic as we have
been seeing issues :). Hence the urge to upgrade.
We've begun exploring this but we wanted to reach out to other
people who may have gone through the same thing to get their
thoughts. We also need to figure out how significant an outage
this will be. As if it takes a day or two of full outage to do
the upgrade that is more acceptable than a week. We also wanted
to know if people had experienced data loss/corruption in the
process and any other kinks.
We were planning on playing around on VM's to test the upgrade
path before committing to upgrading our larger systems. One of
the questions we had though was if we needed to run e2fsck
before/after the upgrade as that could add significant time to the
outage for that to complete.
-Paul Edmon-
Post by Jeff Johnson
You're running 2.10.4 clients against 2.5.34 servers? I believe
there are notable lnet attrs that don't exist in 2.5.34. Maybe a
Whamcloud wiz might chime in but I think that version mismatch
might be problematic.
You can do a testbed upgrade to test taking a ldiskfs volume from
2.5.34 to 2.10.4, just to be conservative.
--Jeff
On Mon, Jul 23, 2018 at 10:05 AM, Paul Edmon
My apologies I meant 2.5.34 not 2.6.34. We'd like to get up
to 2.10.4 which is what our clients are running. Recently we
upgraded our cluster to CentOS7 which necessitated the client
upgrade. Our storage servers though stayed behind on 2.5.34.
-Paul Edmon-
Post by Jeff Johnson
Paul,
2.6.34 is a kernel version. What version of Lustre are you
at now? Some updates are easier than others.
--Jeff
On Mon, Jul 23, 2018 at 8:59 AM, Paul Edmon
We have some old large scale Lustre installs that are
running 2.6.34 and we want to get these up to the latest
version of Lustre. I was curious if people in this
group have any experience with doing this and if they
could share them. How do you handle upgrades like
this? How much time does it take? What are the
pitfalls? How do you manage it with minimal customer
interruption? Should we just write off upgrading and
stand up new servers that are on the correct version (in
which case we need to transfer the several PB's worth of
data over to the new system)?
Thanks for your wisdom.
-Paul Edmon-
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
change your subscription (digest mode or unsubscribe)
visithttp://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
sponsored by Penguin Computing To change your subscription (digest
mode or unsubscribe)
visithttp://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe
Paul Edmon
2018-07-24 14:19:18 UTC
Permalink
Yeah, that's my preferred solution as the hardware we have is nearing
end of life.  In that case though we would then have to coordinate the
cut over of the data to the new storage and forklift all those PB's over
to the new system, which brings its own unique challenges.  Plus then
you also have to have the budget to buy the new hardware.

Right now we are just exploring our options.

-Paul Edmon-
Post by Jörg Saßmannshausen
Hi Paul,
with a file system being 93% full, in my humble opinion it would make sense to
increase the underlying hardware capacity as well. The reasoning behind it is
that usually over time there will be more data on any given file system and
thus if there is already a downtime, I would increase the size of it as well.
I rather have a bit of a longer down time and then I got a new version of
Luster (of which I know little about) and more capacity which will last for
longer than only upgrading Luster and then run out of disc capacity a bit
later.
It also means that in your case you could simply install the new system, test
it, and then migrate the data over. Depending how it is set up you even could
do that at stages.
As you mentioned 3 different Luster servers, you could for example start with
the biggest one and use new hardware here. The freed capacity of the now
obsolete hardware could for example being utilized for the other systems.
Of course, I don't know your hardware etc.
Just some ideas from a hot London 8-)
Jörg
Post by Paul Edmon
Yeah we've pinged Intel/Whamcloud to find out upgrade paths as we wanted
to know what the recommended procedure is.
Sure. So we have 3 systems that we want to upgrade 1 that is a PB and 2
that are 5 PB each. I will just give you a description of one and
assume that everything would scale linearly with size. They all have the
same hardware.
The head nodes are Dell R620's while the shelves are M3420 (mds) and
M3260 (oss). The MDT is 2.2T with 466G used and 268M inodes used. Each
OST is 30T with each OSS hosting 6. The filesystem itself is 93% full.
-Paul Edmon-
Post by Jeff Johnson
Paul,
How big are your ldiskfs volumes? What type of underlying hardware are
they? Running e2fsck (ldiskfs aware) is wise and can be done in
parallel. It could be within a couple of days, the time all depends on
the size and underlying hardware.
Going from 2.5.34 to 2.10.4 is a significant jump. I would be sure
there isn't a step upgrade advised. I know there has been step
upgrades in the past, not sure about going to/from these two versions.
--Jeff
Yeah we've found out firsthand that its problematic as we have
been seeing issues :). Hence the urge to upgrade.
We've begun exploring this but we wanted to reach out to other
people who may have gone through the same thing to get their
thoughts. We also need to figure out how significant an outage
this will be. As if it takes a day or two of full outage to do
the upgrade that is more acceptable than a week. We also wanted
to know if people had experienced data loss/corruption in the
process and any other kinks.
We were planning on playing around on VM's to test the upgrade
path before committing to upgrading our larger systems. One of
the questions we had though was if we needed to run e2fsck
before/after the upgrade as that could add significant time to the
outage for that to complete.
-Paul Edmon-
Post by Jeff Johnson
You're running 2.10.4 clients against 2.5.34 servers? I believe
there are notable lnet attrs that don't exist in 2.5.34. Maybe a
Whamcloud wiz might chime in but I think that version mismatch
might be problematic.
You can do a testbed upgrade to test taking a ldiskfs volume from
2.5.34 to 2.10.4, just to be conservative.
--Jeff
On Mon, Jul 23, 2018 at 10:05 AM, Paul Edmon
My apologies I meant 2.5.34 not 2.6.34. We'd like to get up
to 2.10.4 which is what our clients are running. Recently we
upgraded our cluster to CentOS7 which necessitated the client
upgrade. Our storage servers though stayed behind on 2.5.34.
-Paul Edmon-
Post by Jeff Johnson
Paul,
2.6.34 is a kernel version. What version of Lustre are you
at now? Some updates are easier than others.
--Jeff
On Mon, Jul 23, 2018 at 8:59 AM, Paul Edmon
We have some old large scale Lustre installs that are
running 2.6.34 and we want to get these up to the latest
version of Lustre. I was curious if people in this
group have any experience with doing this and if they
could share them. How do you handle upgrades like
this? How much time does it take? What are the
pitfalls? How do you manage it with minimal customer
interruption? Should we just write off upgrading and
stand up new servers that are on the correct version (in
which case we need to transfer the several PB's worth of
data over to the new system)?
Thanks for your wisdom.
-Paul Edmon-
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out
Storage
_______________________________________________
change your subscription (digest mode or unsubscribe)
visithttp://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
sponsored by Penguin Computing To change your subscription (digest
mode or unsubscribe)
visithttp://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
www.aeoncomputing.com <http://www.aeoncomputing.com>
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe
John Hearns via Beowulf
2018-07-24 14:31:00 UTC
Permalink
Forgive me for saying this, but the philosophy for software defined storage
such as CEPH and Gluster is that forklift style upgrades should not be
necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data. Rebalanced is
probaby a better word.

Sorry if I am seeming to be a smartarse. I have gone through the pain of
forklift style upgrades in the past when storage arrays reach End of Life.
I just really like the Software Defined Storage mantra - no component
should be a point of failure.
Paul Edmon
2018-07-24 14:40:55 UTC
Permalink
While I agree with you in principle, one also has to deal with the
reality as you find yourself in.  In our case we have more experience
with Lustre than Ceph in an HPC and we got burned pretty badly by
Gluster.  While I like Ceph in principle I haven't seen it do what
Lustre can do in a HPC setting over IB.  Now it may be able to do that,
which is great.  However then you have to get your system set up to do
that and prove that it can.  After all users have a funny way of
breaking things that work amazingly well in controlled test environs,
especially when you have no control how they will actually use the
system (as in a research environment).  Certainly we are working on
exploring this option too as it would be awesome and save many headaches.

Anyways no worries about you being a smartarse, it is a valid point. 
One just needs to consider the realities on the ground in ones own
environment.

-Paul Edmon-
Post by John Hearns via Beowulf
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades
should not be necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data.
Rebalanced is probaby a better word.
Sorry if I am seeming to be a smartarse. I have gone through the pain
of forklift style upgrades in the past when storage arrays reach End
of Life.
I just really like the Software Defined Storage mantra - no component
should be a point of failure.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
John Hearns via Beowulf
2018-07-24 15:02:43 UTC
Permalink
Paul, thanks for the reply.
I would like to ask, if I may. I rather like Glustre, but have not deployed
it in HPC. I have heard a few people comment about Gluster not working well
in HPC. Would you be willing to be more specific?

One research site I talked to did the classic 'converged infrastructure'
idea of attaching storage drives to their compute nodes and distributing
Glustre storage. They were not happy with that IW as told, and I can very
much understand why. But Gluster on dedicated servers I would be interested
to hear about.
While I agree with you in principle, one also has to deal with the reality
as you find yourself in. In our case we have more experience with Lustre
than Ceph in an HPC and we got burned pretty badly by Gluster. While I
like Ceph in principle I haven't seen it do what Lustre can do in a HPC
setting over IB. Now it may be able to do that, which is great. However
then you have to get your system set up to do that and prove that it can.
After all users have a funny way of breaking things that work amazingly
well in controlled test environs, especially when you have no control how
they will actually use the system (as in a research environment).
Certainly we are working on exploring this option too as it would be
awesome and save many headaches.
Anyways no worries about you being a smartarse, it is a valid point. One
just needs to consider the realities on the ground in ones own environment.
-Paul Edmon-
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades should not
be necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data. Rebalanced is
probaby a better word.
Sorry if I am seeming to be a smartarse. I have gone through the pain of
forklift style upgrades in the past when storage arrays reach End of Life.
I just really like the Software Defined Storage mantra - no component
should be a point of failure.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Paul Edmon
2018-07-24 15:56:14 UTC
Permalink
This was several years back so the current version of Gluster may be in
better shape.  We tried to use it for our primary storage but ran into
scalability problems.  It especially was the case when it came to
healing bricks and doing replication.  It just didn't scale well. 
Eventually we abandoned it for NFS and Lustre, NFS for deep storage and
Lustre for performance.  We tried it for hosting VM images which worked
pretty well but we've since moved to Ceph for that.

Anyways I have no idea about current Gluster in terms of scalability so
the issues we ran into may not be an problem anymore.  However it has
made us very gun shy about trying Gluster again.  Instead we've decided
to use Ceph as we've gained a bunch of experience with Ceph in our
OpenNebula installation.

-Paul Edmon-
Post by John Hearns via Beowulf
Paul, thanks for the reply.
I would like to ask, if I may. I rather like Glustre, but have not
deployed it in HPC. I have heard a few people comment about Gluster
not working well in HPC. Would you be willing to be more specific?
One research site I talked to did the classic 'converged
infrastructure' idea of attaching storage drives to their compute
nodes and distributing Glustre storage. They were not happy with that
IW as told, and I can very much understand why. But Gluster on
dedicated servers I would be interested to hear about.
While I agree with you in principle, one also has to deal with the
reality as you find yourself in.  In our case we have more
experience with Lustre than Ceph in an HPC and we got burned
pretty badly by Gluster.  While I like Ceph in principle I haven't
seen it do what Lustre can do in a HPC setting over IB.  Now it
may be able to do that, which is great.  However then you have to
get your system set up to do that and prove that it can.  After
all users have a funny way of breaking things that work amazingly
well in controlled test environs, especially when you have no
control how they will actually use the system (as in a research
environment).  Certainly we are working on exploring this option
too as it would be awesome and save many headaches.
Anyways no worries about you being a smartarse, it is a valid
point.  One just needs to consider the realities on the ground in
ones own environment.
-Paul Edmon-
Post by John Hearns via Beowulf
Forgive me for saying this, but the philosophy for software
defined storage such as CEPH and Gluster is that forklift style
upgrades should not be necessary.
When a storage server is to be retired the data is copied onto
the new server then the old one taken out of service. Well,
copied is not the correct word, as there are erasure-coded copies
of the data. Rebalanced is probaby a better word.
Sorry if I am seeming to be a smartarse. I have gone through the
pain of forklift style upgrades in the past when storage arrays
reach End of Life.
I just really like the Software Defined Storage mantra - no
component should be a point of failure.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
John Hearns via Beowulf
2018-07-24 17:15:34 UTC
Permalink
Thankyou for a comprehensive reply.
Post by Paul Edmon
This was several years back so the current version of Gluster may be in
better shape. We tried to use it for our primary storage but ran into
scalability problems. It especially was the case when it came to healing
bricks and doing replication. It just didn't scale well. Eventually we
abandoned it for NFS and Lustre, NFS for deep storage and Lustre for
performance. We tried it for hosting VM images which worked pretty well
but we've since moved to Ceph for that.
Anyways I have no idea about current Gluster in terms of scalability so
the issues we ran into may not be an problem anymore. However it has made
us very gun shy about trying Gluster again. Instead we've decided to use
Ceph as we've gained a bunch of experience with Ceph in our OpenNebula
installation.
-Paul Edmon-
Paul, thanks for the reply.
I would like to ask, if I may. I rather like Glustre, but have not
deployed it in HPC. I have heard a few people comment about Gluster not
working well in HPC. Would you be willing to be more specific?
One research site I talked to did the classic 'converged infrastructure'
idea of attaching storage drives to their compute nodes and distributing
Glustre storage. They were not happy with that IW as told, and I can very
much understand why. But Gluster on dedicated servers I would be interested
to hear about.
Post by Paul Edmon
While I agree with you in principle, one also has to deal with the
reality as you find yourself in. In our case we have more experience with
Lustre than Ceph in an HPC and we got burned pretty badly by Gluster.
While I like Ceph in principle I haven't seen it do what Lustre can do in a
HPC setting over IB. Now it may be able to do that, which is great.
However then you have to get your system set up to do that and prove that
it can. After all users have a funny way of breaking things that work
amazingly well in controlled test environs, especially when you have no
control how they will actually use the system (as in a research
environment). Certainly we are working on exploring this option too as it
would be awesome and save many headaches.
Anyways no worries about you being a smartarse, it is a valid point. One
just needs to consider the realities on the ground in ones own environment.
-Paul Edmon-
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades should not
be necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data. Rebalanced is
probaby a better word.
Sorry if I am seeming to be a smartarse. I have gone through the pain of
forklift style upgrades in the past when storage arrays reach End of Life.
I just really like the Software Defined Storage mantra - no component
should be a point of failure.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Jörg Saßmannshausen
2018-07-26 07:14:54 UTC
Permalink
Dear all,

I once had this idea as well: using the spinning discs which I have in the
compute nodes as part of a distributed scratch space. I was using glusterfs
for that as I thought it might be a good idea. It was not. The reason behind
it is that as soon as a job is creating say 700 GB of scratch data (real job
not some fictional one!), the performance of the node which is hosting part of
that data approaches zero due to the high disc IO. This meant that the job
which was running there was affected. So in the end this led to an
installation which got a separate file server for the scratch space.
I also should add that this was a rather small setup of 8 nodes and it was a
few years back.
The problem I found in computational chemistry is that some jobs require
either large amount of memory, i.e. significantly more than the usual 2 GB per
core, or large amount of scratch space (if there is insufficient memory). You
are in trouble if it requires both. :-)

All the best from a still hot London

Jörg
Post by John Hearns via Beowulf
Paul, thanks for the reply.
I would like to ask, if I may. I rather like Glustre, but have not deployed
it in HPC. I have heard a few people comment about Gluster not working well
in HPC. Would you be willing to be more specific?
One research site I talked to did the classic 'converged infrastructure'
idea of attaching storage drives to their compute nodes and distributing
Glustre storage. They were not happy with that IW as told, and I can very
much understand why. But Gluster on dedicated servers I would be interested
to hear about.
While I agree with you in principle, one also has to deal with the reality
as you find yourself in. In our case we have more experience with Lustre
than Ceph in an HPC and we got burned pretty badly by Gluster. While I
like Ceph in principle I haven't seen it do what Lustre can do in a HPC
setting over IB. Now it may be able to do that, which is great. However
then you have to get your system set up to do that and prove that it can.
After all users have a funny way of breaking things that work amazingly
well in controlled test environs, especially when you have no control how
they will actually use the system (as in a research environment).
Certainly we are working on exploring this option too as it would be
awesome and save many headaches.
Anyways no worries about you being a smartarse, it is a valid point. One
just needs to consider the realities on the ground in ones own environment.
-Paul Edmon-
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades should not
be necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data. Rebalanced is
probaby a better word.
Sorry if I am seeming to be a smartarse. I have gone through the pain of
forklift style upgrades in the past when storage arrays reach End of Life.
I just really like the Software Defined Storage mantra - no component
should be a point of failure.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.o
John Hearns via Beowulf
2018-07-26 07:53:35 UTC
Permalink
Jorg,
you should look at BeeGFS and BeeOnDemand https://www.beegfs.io/wiki/BeeOND

On Thu, 26 Jul 2018 at 09:15, Jörg Saßmannshausen <
Post by Jörg Saßmannshausen
Dear all,
I once had this idea as well: using the spinning discs which I have in the
compute nodes as part of a distributed scratch space. I was using glusterfs
for that as I thought it might be a good idea. It was not. The reason behind
it is that as soon as a job is creating say 700 GB of scratch data (real job
not some fictional one!), the performance of the node which is hosting part of
that data approaches zero due to the high disc IO. This meant that the job
which was running there was affected. So in the end this led to an
installation which got a separate file server for the scratch space.
I also should add that this was a rather small setup of 8 nodes and it was a
few years back.
The problem I found in computational chemistry is that some jobs require
either large amount of memory, i.e. significantly more than the usual 2 GB per
core, or large amount of scratch space (if there is insufficient memory). You
are in trouble if it requires both. :-)
All the best from a still hot London
Jörg
Post by John Hearns via Beowulf
Paul, thanks for the reply.
I would like to ask, if I may. I rather like Glustre, but have not
deployed
Post by John Hearns via Beowulf
it in HPC. I have heard a few people comment about Gluster not working
well
Post by John Hearns via Beowulf
in HPC. Would you be willing to be more specific?
One research site I talked to did the classic 'converged infrastructure'
idea of attaching storage drives to their compute nodes and distributing
Glustre storage. They were not happy with that IW as told, and I can very
much understand why. But Gluster on dedicated servers I would be
interested
Post by John Hearns via Beowulf
to hear about.
Post by Paul Edmon
While I agree with you in principle, one also has to deal with the
reality
Post by John Hearns via Beowulf
Post by Paul Edmon
as you find yourself in. In our case we have more experience with
Lustre
Post by John Hearns via Beowulf
Post by Paul Edmon
than Ceph in an HPC and we got burned pretty badly by Gluster. While I
like Ceph in principle I haven't seen it do what Lustre can do in a HPC
setting over IB. Now it may be able to do that, which is great.
However
Post by John Hearns via Beowulf
Post by Paul Edmon
then you have to get your system set up to do that and prove that it
can.
Post by John Hearns via Beowulf
Post by Paul Edmon
After all users have a funny way of breaking things that work amazingly
well in controlled test environs, especially when you have no control
how
Post by John Hearns via Beowulf
Post by Paul Edmon
they will actually use the system (as in a research environment).
Certainly we are working on exploring this option too as it would be
awesome and save many headaches.
Anyways no worries about you being a smartarse, it is a valid point.
One
Post by John Hearns via Beowulf
Post by Paul Edmon
just needs to consider the realities on the ground in ones own environment.
-Paul Edmon-
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades should not
be necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data.
Rebalanced is
Post by John Hearns via Beowulf
Post by Paul Edmon
probaby a better word.
Sorry if I am seeming to be a smartarse. I have gone through the pain
of
Post by John Hearns via Beowulf
Post by Paul Edmon
forklift style upgrades in the past when storage arrays reach End of
Life.
Post by John Hearns via Beowulf
Post by Paul Edmon
I just really like the Software Defined Storage mantra - no component
should be a point of failure.
_______________________________________________
Computing
Post by John Hearns via Beowulf
Post by Paul Edmon
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Computing
Post by John Hearns via Beowulf
Post by Paul Edmon
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Jörg Saßmannshausen
2018-07-26 08:24:46 UTC
Permalink
Hi John,

thanks. I should have said that this was one of the reasons I became
interested in BeeGFS and this experience is some years ago. I believe at the
time I was not aware of BeeGFS.
In any case, that was at the old workplace and at the current one we don't
have these demands on the hardware.

All the best

Jörg
Post by John Hearns via Beowulf
Jorg,
you should look at BeeGFS and BeeOnDemand https://www.beegfs.io/wiki/BeeOND
On Thu, 26 Jul 2018 at 09:15, Jörg Saßmannshausen <
Post by Jörg Saßmannshausen
Dear all,
I once had this idea as well: using the spinning discs which I have in the
compute nodes as part of a distributed scratch space. I was using glusterfs
for that as I thought it might be a good idea. It was not. The reason behind
it is that as soon as a job is creating say 700 GB of scratch data (real job
not some fictional one!), the performance of the node which is hosting part of
that data approaches zero due to the high disc IO. This meant that the job
which was running there was affected. So in the end this led to an
installation which got a separate file server for the scratch space.
I also should add that this was a rather small setup of 8 nodes and it was a
few years back.
The problem I found in computational chemistry is that some jobs require
either large amount of memory, i.e. significantly more than the usual 2 GB per
core, or large amount of scratch space (if there is insufficient memory). You
are in trouble if it requires both. :-)
All the best from a still hot London
Jörg
Post by John Hearns via Beowulf
Paul, thanks for the reply.
I would like to ask, if I may. I rather like Glustre, but have not
deployed
Post by John Hearns via Beowulf
it in HPC. I have heard a few people comment about Gluster not working
well
Post by John Hearns via Beowulf
in HPC. Would you be willing to be more specific?
One research site I talked to did the classic 'converged infrastructure'
idea of attaching storage drives to their compute nodes and distributing
Glustre storage. They were not happy with that IW as told, and I can very
much understand why. But Gluster on dedicated servers I would be
interested
Post by John Hearns via Beowulf
to hear about.
Post by Paul Edmon
While I agree with you in principle, one also has to deal with the
reality
Post by John Hearns via Beowulf
Post by Paul Edmon
as you find yourself in. In our case we have more experience with
Lustre
Post by John Hearns via Beowulf
Post by Paul Edmon
than Ceph in an HPC and we got burned pretty badly by Gluster. While I
like Ceph in principle I haven't seen it do what Lustre can do in a HPC
setting over IB. Now it may be able to do that, which is great.
However
Post by John Hearns via Beowulf
Post by Paul Edmon
then you have to get your system set up to do that and prove that it
can.
Post by John Hearns via Beowulf
Post by Paul Edmon
After all users have a funny way of breaking things that work amazingly
well in controlled test environs, especially when you have no control
how
Post by John Hearns via Beowulf
Post by Paul Edmon
they will actually use the system (as in a research environment).
Certainly we are working on exploring this option too as it would be
awesome and save many headaches.
Anyways no worries about you being a smartarse, it is a valid point.
One
Post by John Hearns via Beowulf
Post by Paul Edmon
just needs to consider the realities on the ground in ones own environment.
-Paul Edmon-
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades
should
not
be necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data.
Rebalanced is
Post by John Hearns via Beowulf
Post by Paul Edmon
probaby a better word.
Sorry if I am seeming to be a smartarse. I have gone through the pain
of
Post by John Hearns via Beowulf
Post by Paul Edmon
forklift style upgrades in the past when storage arrays reach End of
Life.
Post by John Hearns via Beowulf
Post by Paul Edmon
I just really like the Software Defined Storage mantra - no component
should be a point of failure.
_______________________________________________
Computing
Post by John Hearns via Beowulf
Post by Paul Edmon
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Computing
Post by John Hearns via Beowulf
Post by Paul Edmon
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
James Burton
2018-07-26 14:50:33 UTC
Permalink
I have done some research on using local storage on the compute nodes as a
DFS, and I would agree that it is not as good an idea as it sounds. You
gain data locality, but you pay for this with compute node resources and
you still have the problem of having to copy the data in and out. To
summarize: It makes sense if you are network bound or have very high
performance storage (like an NVMe) as your local scratch. It doesn't make a
lot of sense if you have spinning disks on your local scratch. It might
make more sense for a compute intensive workload, but less sense for a
memory intensive one.

Overall GlusterFS provided low latency, but performance that was mediocre
at best and could get ugly depending on configuration and workload. BeeOND
is a really cool product, although the focus seems to be on more making it
easy to get "quick-and-dirty" BeeGFS system running on the compute nodes
than on maximum performance.


On Thu, Jul 26, 2018 at 3:53 AM, John Hearns via Beowulf <
Post by John Hearns via Beowulf
Jorg,
you should look at BeeGFS and BeeOnDemand https://www.beegfs.io/wiki/
BeeOND
On Thu, 26 Jul 2018 at 09:15, Jörg Saßmannshausen <
Post by Jörg Saßmannshausen
Dear all,
I once had this idea as well: using the spinning discs which I have in the
compute nodes as part of a distributed scratch space. I was using glusterfs
for that as I thought it might be a good idea. It was not. The reason behind
it is that as soon as a job is creating say 700 GB of scratch data (real job
not some fictional one!), the performance of the node which is hosting part of
that data approaches zero due to the high disc IO. This meant that the job
which was running there was affected. So in the end this led to an
installation which got a separate file server for the scratch space.
I also should add that this was a rather small setup of 8 nodes and it was a
few years back.
The problem I found in computational chemistry is that some jobs require
either large amount of memory, i.e. significantly more than the usual 2 GB per
core, or large amount of scratch space (if there is insufficient memory). You
are in trouble if it requires both. :-)
All the best from a still hot London
Jörg
Post by John Hearns via Beowulf
Paul, thanks for the reply.
I would like to ask, if I may. I rather like Glustre, but have not
deployed
Post by John Hearns via Beowulf
it in HPC. I have heard a few people comment about Gluster not working
well
Post by John Hearns via Beowulf
in HPC. Would you be willing to be more specific?
One research site I talked to did the classic 'converged infrastructure'
idea of attaching storage drives to their compute nodes and distributing
Glustre storage. They were not happy with that IW as told, and I can
very
Post by John Hearns via Beowulf
much understand why. But Gluster on dedicated servers I would be
interested
Post by John Hearns via Beowulf
to hear about.
Post by Paul Edmon
While I agree with you in principle, one also has to deal with the
reality
Post by John Hearns via Beowulf
Post by Paul Edmon
as you find yourself in. In our case we have more experience with
Lustre
Post by John Hearns via Beowulf
Post by Paul Edmon
than Ceph in an HPC and we got burned pretty badly by Gluster. While
I
Post by John Hearns via Beowulf
Post by Paul Edmon
like Ceph in principle I haven't seen it do what Lustre can do in a
HPC
Post by John Hearns via Beowulf
Post by Paul Edmon
setting over IB. Now it may be able to do that, which is great.
However
Post by John Hearns via Beowulf
Post by Paul Edmon
then you have to get your system set up to do that and prove that it
can.
Post by John Hearns via Beowulf
Post by Paul Edmon
After all users have a funny way of breaking things that work
amazingly
Post by John Hearns via Beowulf
Post by Paul Edmon
well in controlled test environs, especially when you have no control
how
Post by John Hearns via Beowulf
Post by Paul Edmon
they will actually use the system (as in a research environment).
Certainly we are working on exploring this option too as it would be
awesome and save many headaches.
Anyways no worries about you being a smartarse, it is a valid point.
One
Post by John Hearns via Beowulf
Post by Paul Edmon
just needs to consider the realities on the ground in ones own environment.
-Paul Edmon-
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades
should
Post by John Hearns via Beowulf
Post by Paul Edmon
not
be necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data.
Rebalanced is
Post by John Hearns via Beowulf
Post by Paul Edmon
probaby a better word.
Sorry if I am seeming to be a smartarse. I have gone through the pain
of
Post by John Hearns via Beowulf
Post by Paul Edmon
forklift style upgrades in the past when storage arrays reach End of
Life.
Post by John Hearns via Beowulf
Post by Paul Edmon
I just really like the Software Defined Storage mantra - no component
should be a point of failure.
_______________________________________________
Computing
Post by John Hearns via Beowulf
Post by Paul Edmon
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Computing
Post by John Hearns via Beowulf
Post by Paul Edmon
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
James Burton
OS and Storage Architect
Advanced Computing Infrastructure
Clemson University Computing and Information Technology
340 Computer Court
Anderson, SC 29625
(864) 656-9047
John Hearns via Beowulf
2018-07-26 15:00:26 UTC
Permalink
Regarding NVME storage, we have some pretty spiffing networks these days. I
would say - storage servers for storage. (yeah. I just flagged up
BeeOnDemand earlier!)
Use NVMe 'drives' and use RDMA to send the data to them.
I have done this with GPFS - six all-NVMe storage servers over a 100Gbps
Infiniband network.

Also look at the fancy BlueField processors which Mellanox is putting into
the network
http://www.mellanox.com/page/products_dyn?product_family=256&mtag=soc_overview
I am a bit out of the loop regarding Mellanox products at the moment. If
anyone from Mellanox would like to give me access to play with one of these
things? Hint. Hint.
Post by James Burton
I have done some research on using local storage on the compute nodes as a
DFS, and I would agree that it is not as good an idea as it sounds. You
gain data locality, but you pay for this with compute node resources and
you still have the problem of having to copy the data in and out. To
summarize: It makes sense if you are network bound or have very high
performance storage (like an NVMe) as your local scratch. It doesn't make a
lot of sense if you have spinning disks on your local scratch. It might
make more sense for a compute intensive workload, but less sense for a
memory intensive one.
Overall GlusterFS provided low latency, but performance that was mediocre
at best and could get ugly depending on configuration and workload. BeeOND
is a really cool product, although the focus seems to be on more making it
easy to get "quick-and-dirty" BeeGFS system running on the compute nodes
than on maximum performance.
On Thu, Jul 26, 2018 at 3:53 AM, John Hearns via Beowulf <
Post by John Hearns via Beowulf
Jorg,
you should look at BeeGFS and BeeOnDemand
https://www.beegfs.io/wiki/BeeOND
On Thu, 26 Jul 2018 at 09:15, Jörg Saßmannshausen <
Post by Jörg Saßmannshausen
Dear all,
I once had this idea as well: using the spinning discs which I have in the
compute nodes as part of a distributed scratch space. I was using glusterfs
for that as I thought it might be a good idea. It was not. The reason behind
it is that as soon as a job is creating say 700 GB of scratch data (real job
not some fictional one!), the performance of the node which is hosting part of
that data approaches zero due to the high disc IO. This meant that the job
which was running there was affected. So in the end this led to an
installation which got a separate file server for the scratch space.
I also should add that this was a rather small setup of 8 nodes and it was a
few years back.
The problem I found in computational chemistry is that some jobs require
either large amount of memory, i.e. significantly more than the usual 2 GB per
core, or large amount of scratch space (if there is insufficient memory). You
are in trouble if it requires both. :-)
All the best from a still hot London
Jörg
Post by John Hearns via Beowulf
Paul, thanks for the reply.
I would like to ask, if I may. I rather like Glustre, but have not
deployed
Post by John Hearns via Beowulf
it in HPC. I have heard a few people comment about Gluster not working
well
Post by John Hearns via Beowulf
in HPC. Would you be willing to be more specific?
One research site I talked to did the classic 'converged
infrastructure'
Post by John Hearns via Beowulf
idea of attaching storage drives to their compute nodes and
distributing
Post by John Hearns via Beowulf
Glustre storage. They were not happy with that IW as told, and I can
very
Post by John Hearns via Beowulf
much understand why. But Gluster on dedicated servers I would be
interested
Post by John Hearns via Beowulf
to hear about.
Post by Paul Edmon
While I agree with you in principle, one also has to deal with the
reality
Post by John Hearns via Beowulf
Post by Paul Edmon
as you find yourself in. In our case we have more experience with
Lustre
Post by John Hearns via Beowulf
Post by Paul Edmon
than Ceph in an HPC and we got burned pretty badly by Gluster.
While I
Post by John Hearns via Beowulf
Post by Paul Edmon
like Ceph in principle I haven't seen it do what Lustre can do in a
HPC
Post by John Hearns via Beowulf
Post by Paul Edmon
setting over IB. Now it may be able to do that, which is great.
However
Post by John Hearns via Beowulf
Post by Paul Edmon
then you have to get your system set up to do that and prove that it
can.
Post by John Hearns via Beowulf
Post by Paul Edmon
After all users have a funny way of breaking things that work
amazingly
Post by John Hearns via Beowulf
Post by Paul Edmon
well in controlled test environs, especially when you have no
control how
Post by John Hearns via Beowulf
Post by Paul Edmon
they will actually use the system (as in a research environment).
Certainly we are working on exploring this option too as it would be
awesome and save many headaches.
Anyways no worries about you being a smartarse, it is a valid
point. One
Post by John Hearns via Beowulf
Post by Paul Edmon
just needs to consider the realities on the ground in ones own environment.
-Paul Edmon-
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades
should
Post by John Hearns via Beowulf
Post by Paul Edmon
not
be necessary.
When a storage server is to be retired the data is copied onto the
new
Post by John Hearns via Beowulf
Post by Paul Edmon
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data.
Rebalanced is
Post by John Hearns via Beowulf
Post by Paul Edmon
probaby a better word.
Sorry if I am seeming to be a smartarse. I have gone through the
pain of
Post by John Hearns via Beowulf
Post by Paul Edmon
forklift style upgrades in the past when storage arrays reach End of
Life.
Post by John Hearns via Beowulf
Post by Paul Edmon
I just really like the Software Defined Storage mantra - no component
should be a point of failure.
_______________________________________________
Computing
Post by John Hearns via Beowulf
Post by Paul Edmon
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Computing
Post by John Hearns via Beowulf
Post by Paul Edmon
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
James Burton
OS and Storage Architect
Advanced Computing Infrastructure
Clemson University Computing and Information Technology
340 Computer Court
Anderson, SC 29625
(864) 656-9047
Joe Landman
2018-07-24 14:58:20 UTC
Permalink
Post by John Hearns via Beowulf
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades
should not be necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data.
Rebalanced is probaby a better word.
This ^^

I'd seen/helped build/benchmarked some very nice/fast CephFS based
storage systems in $dayjob-1.  While it is a neat system, if you are
focused on availability, scalability, and performance, its pretty hard
to beat BeeGFS.  We'd ($dayjob-1) deployed several very large/fast file
systems with it on our spinning rust, SSD, and NVMe units.
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailma
John Hearns via Beowulf
2018-07-24 15:06:10 UTC
Permalink
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world, those
sites being independent research units. But HPC facilities are in
headquarters.
The sites want to be able to drop files onto local storage yet have it
magically appear on HPC storage, and same with the results going back the
other way.

One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster geo-replication is
one way only.
What do you know of the BeeGFS mirroring? Will it work over long distances?
(Note to me - find out yourself you lazy besom)
Post by Joe Landman
Post by John Hearns via Beowulf
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades
should not be necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data.
Rebalanced is probaby a better word.
This ^^
I'd seen/helped build/benchmarked some very nice/fast CephFS based
storage systems in $dayjob-1. While it is a neat system, if you are
focused on availability, scalability, and performance, its pretty hard
to beat BeeGFS. We'd ($dayjob-1) deployed several very large/fast file
systems with it on our spinning rust, SSD, and NVMe units.
--
Joe Landman
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Joe Landman
2018-07-24 15:18:06 UTC
Permalink
Post by John Hearns via Beowulf
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world,
those sites being independent research units. But HPC facilities are
in headquarters.
The sites want to be able to drop files onto local storage yet have it
magically appear on HPC storage, and same with the results going back
the other way.
One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster
geo-replication is one way only.
What do you know of the BeeGFS mirroring? Will it work over long
distances? (Note to me - find out yourself you lazy besom)
This isn't the use case for most/all cluster file systems.   This is
where distributed object systems and buckets rule.

Take your file, dump it into an S3 like bucket on one end, pull it out
of the S3 like bucket on the other.  If you don't want to use get/put
operations, then use s3fs/s3ql.  You can back this up with replicating
EC minio stores (will take a few minutes to set up ... compare that to
others).

The down side to this is that minio has limits of about 16TiB last I
checked.   If you need more, replace minio with another system (igneous,
ceph, etc.).  Ping me offline if you want to talk more.

[...]
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visi
Joe Landman
2018-07-24 15:18:11 UTC
Permalink
Post by John Hearns via Beowulf
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world,
those sites being independent research units. But HPC facilities are
in headquarters.
The sites want to be able to drop files onto local storage yet have it
magically appear on HPC storage, and same with the results going back
the other way.
One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster
geo-replication is one way only.
What do you know of the BeeGFS mirroring? Will it work over long
distances? (Note to me - find out yourself you lazy besom)
This isn't the use case for most/all cluster file systems.   This is
where distributed object systems and buckets rule.

Take your file, dump it into an S3 like bucket on one end, pull it out
of the S3 like bucket on the other.  If you don't want to use get/put
operations, then use s3fs/s3ql.  You can back this up with replicating
EC minio stores (will take a few minutes to set up ... compare that to
others).

The down side to this is that minio has limits of about 16TiB last I
checked.   If you need more, replace minio with another system (igneous,
ceph, etc.).  Ping me offline if you want to talk more.

[...]
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visi
Joe Landman
2018-07-24 15:34:25 UTC
Permalink
Post by John Hearns via Beowulf
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world,
those sites being independent research units. But HPC facilities are
in headquarters.
The sites want to be able to drop files onto local storage yet have it
magically appear on HPC storage, and same with the results going back
the other way.
One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster
geo-replication is one way only.
What do you know of the BeeGFS mirroring? Will it work over long
distances? (Note to me - find out yourself you lazy besom)
This isn't the use case for most/all cluster file systems.   This is
where distributed object systems and buckets rule.

Take your file, dump it into an S3 like bucket on one end, pull it out
of the S3 like bucket on the other.  If you don't want to use get/put
operations, then use s3fs/s3ql.  You can back this up with replicating
EC minio stores (will take a few minutes to set up ... compare that to
others).

The down side to this is that minio has limits of about 16TiB last I
checked.   If you need more, replace minio with another system (igneous,
ceph, etc.).  Ping me offline if you want to talk more.

[...]
--
Joe Landman
e:***@gmail.com
t: @hpcjoe
w:https://scalability.org
g:https://github.com/joelandman
l:https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit ht
James Burton
2018-07-25 02:19:43 UTC
Permalink
Does anyone have any experience with how BeeGFS compares to Lustre? We're
looking at both of those for our next generation HPC storage system.

Is CephFS a valid option for HPC now? Last time I played with CephFS it
wasn't ready for prime time, but that was a few years ago.
Post by Joe Landman
Post by John Hearns via Beowulf
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades should not
be necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data. Rebalanced is
probaby a better word.
This ^^
I'd seen/helped build/benchmarked some very nice/fast CephFS based storage
systems in $dayjob-1. While it is a neat system, if you are focused on
availability, scalability, and performance, its pretty hard to beat
BeeGFS. We'd ($dayjob-1) deployed several very large/fast file systems
with it on our spinning rust, SSD, and NVMe units.
--
Joe Landman
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
James Burton
OS and Storage Architect
Advanced Computing Infrastructure
Clemson University Computing and Information Technology
340 Computer Court
Anderson, SC 29625
(864) 656-9047
Chris Samuel
2018-07-25 10:16:11 UTC
Permalink
Post by James Burton
Is CephFS a valid option for HPC now? Last time I played with CephFS it
wasn't ready for prime time, but that was a few years ago.
I'm not sure, but I know people who've recently (last month or two) had a
world of pain running CephFS with multiple MDS's when it managed to get into a
split brain situation (if my understanding of what happened is right) .
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http:
Jeff Johnson
2018-07-25 23:11:44 UTC
Permalink
Post by Chris Samuel
I'm not sure, but I know people who've recently (last month or two) had a
world of pain running CephFS with multiple MDS's when it managed to get into a
split brain situation (if my understanding of what happened is right) .
Split brains are nearly always ugly.
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

***@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
Prentice Bisbal
2018-07-25 20:36:23 UTC
Permalink
Paging Dr. Joe Landman, paging Dr. Landman...

Prentice
Post by James Burton
Does anyone have any experience with how BeeGFS compares to Lustre?
We're looking at both of those for our next generation HPC storage
system.
Is CephFS a valid option for HPC now? Last time I played with CephFS
it wasn't ready for prime time, but that was a few years ago.
Forgive me for saying this, but the philosophy for software
defined storage such as CEPH and Gluster is that forklift
style upgrades should not be necessary.
When a storage server is to be retired the data is copied onto
the new server then the old one taken out of service. Well,
copied is not the correct word, as there are erasure-coded
copies of the data. Rebalanced is probaby a better word.
This ^^
I'd seen/helped build/benchmarked some very nice/fast CephFS based
storage systems in $dayjob-1.  While it is a neat system, if you
are focused on availability, scalability, and performance, its
pretty hard to beat BeeGFS.  We'd ($dayjob-1) deployed several
very large/fast file systems with it on our spinning rust, SSD,
and NVMe units.
--
Joe Landman
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
<https://www.linkedin.com/in/joelandman>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
James Burton
OS and Storage Architect
Advanced Computing Infrastructure
Clemson University Computing and Information Technology
340 Computer Court
Anderson, SC 29625
(864) 656-9047
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Joe Landman
2018-07-25 21:11:55 UTC
Permalink
Post by Prentice Bisbal
Paging Dr. Joe Landman, paging Dr. Landman...
My response was

"I'd seen/helped build/benchmarked some very nice/fast CephFS based
storage systems in $dayjob-1.  While it is a neat system, if you are
focused on availability, scalability, and performance, its pretty hard
to beat BeeGFS.  We'd ($dayjob-1) deployed several very large/fast file
systems with it on our spinning rust, SSD, and NVMe units."

at the bottom of the post.

Yes, BeeGFS compares very favorably to Lustre across performance,
management, resiliency dimensions.  Distributed replicated metadata and
data is possible, atop zfs, xfs, etc.  We sustained >  40GB/s in a
single rack of spinning disk in 2014 at a customer site using it, no
SSD/cache implicated, and using 56Gb IB throughout.  Customer wanted to
see us sustain 46+GB/s writes, and we did.

These are some of our other results with it:

https://scalability.org/2014/05/massive-unapologetic-firepower-2tb-write-in-73-seconds/

https://scalability.org/2014/10/massive-unapologetic-firepower-part-2-the-dashboard/
(that was my first effort with Grafana, and look at the writes ...
vertical scale is 10k MB/s, aka 10GB/s increments.

W.r.t. BeeGFS, very easy to install, you can set it up trivially on
extra hardware to see it in action.  Won't be as fast as my old stuff,
but that's the price people pay for not buying the good stuff when it
was available.
Post by Prentice Bisbal
Prentice
Post by James Burton
Does anyone have any experience with how BeeGFS compares to Lustre?
We're looking at both of those for our next generation HPC storage
system.
Is CephFS a valid option for HPC now? Last time I played with CephFS
it wasn't ready for prime time, but that was a few years ago.
Forgive me for saying this, but the philosophy for software
defined storage such as CEPH and Gluster is that forklift
style upgrades should not be necessary.
When a storage server is to be retired the data is copied
onto the new server then the old one taken out of service.
Well, copied is not the correct word, as there are
erasure-coded copies of the data. Rebalanced is probaby a
better word.
This ^^
I'd seen/helped build/benchmarked some very nice/fast CephFS
based storage systems in $dayjob-1.  While it is a neat system,
if you are focused on availability, scalability, and performance,
its pretty hard to beat BeeGFS.  We'd ($dayjob-1) deployed
several very large/fast file systems with it on our spinning
rust, SSD, and NVMe units.
--
Joe Landman
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
<https://www.linkedin.com/in/joelandman>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
James Burton
OS and Storage Architect
Advanced Computing Infrastructure
Clemson University Computing and Information Technology
340 Computer Court
Anderson, SC 29625
(864) 656-9047
_______________________________________________
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org
Stu Midgley
2018-07-25 03:26:22 UTC
Permalink
let be clear... you can do this with Lustre as well (we do it all the
time). We also rebalance the OST's all the time...


On Tue, Jul 24, 2018 at 10:31 PM John Hearns via Beowulf <
Post by John Hearns via Beowulf
Forgive me for saying this, but the philosophy for software defined
storage such as CEPH and Gluster is that forklift style upgrades should not
be necessary.
When a storage server is to be retired the data is copied onto the new
server then the old one taken out of service. Well, copied is not the
correct word, as there are erasure-coded copies of the data. Rebalanced is
probaby a better word.
--
Dr Stuart Midgley
***@gmail.com
Michael Di Domenico
2018-07-23 18:03:36 UTC
Permalink
Yeah we've found out firsthand that its problematic as we have been seeing
issues :). Hence the urge to upgrade.
what issues are you seeing? I have 2.10.4 clients pointing at 2.5.1
servers, haven't seen any obvious issues and it's been running for
sometime now.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/b
Paul Edmon
2018-07-23 18:19:04 UTC
Permalink
The main issue we see is that OST's get hung up occassionally which
causes writes to hang as the OST flaps connecting and disconnecting with
the MDS.  Rebooting the OSS's fixes the issue as it forces the remount. 
It seems to only happen when the system is full (i.e. above 95% usage)
and under heavy load.  Previous to our CentOS7 upgrade we didn't see
this issue so we are convinced it is due to mismatch in the Lustre
version.  Though it is most certainly the case that the fullness of the
filesystem is contributing as it seems to go away when the filesystem
usage is lower.  Still I have seen it a few times when the filesystem
was at 85%.

Anyways the obvious culprit is the version mismatch.  It may also be
that some of the addition features/enhancements in the 2.5.34 are
conflicting with the mainline version as the 2.5.34 is something we got
from Intel for the IEEL appliance we have been running.

Odds are you systems are fine as they aren't taking quite the pounding
ours is.  The problem doesn't happen that frequently.

-Paul Edmon-
Post by Michael Di Domenico
Yeah we've found out firsthand that its problematic as we have been seeing
issues :). Hence the urge to upgrade.
what issues are you seeing? I have 2.10.4 clients pointing at 2.5.1
servers, haven't seen any obvious issues and it's been running for
sometime now.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo
Jonathan Engwall
2018-07-24 18:04:04 UTC
Permalink
Snowball is the very large scale AWS data service.
Post by John Hearns via Beowulf
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world,
those sites being independent research units. But HPC facilities are
in headquarters.
The sites want to be able to drop files onto local storage yet have it
magically appear on HPC storage, and same with the results going back
the other way.
One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster
geo-replication is one way only.
What do you know of the BeeGFS mirroring? Will it work over long
distances? (Note to me - find out yourself you lazy besom)
This isn't the use case for most/all cluster file systems.   This is
where distributed object systems and buckets rule.

Take your file, dump it into an S3 like bucket on one end, pull it out
of the S3 like bucket on the other.  If you don't want to use get/put
operations, then use s3fs/s3ql.  You can back this up with replicating
EC minio stores (will take a few minutes to set up ... compare that to
others).

The down side to this is that minio has limits of about 16TiB last I
checked.   If you need more, replace minio with another system (igneous,
ceph, etc.).  Ping me offline if you want to talk more.

[...]
--
Joe Landman
e:***@gmail.com
t: @hpcjoe
w:https://scalability.org
g:https://github.com/joelandman
l:https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/be
Fred Youhanaie
2018-07-24 18:20:42 UTC
Permalink
Nah, that ain't large scale ;-) If you want large scale have a look at snowmobile:

https://aws.amazon.com/snowmobile/

They drive a 45-foot truck to your data centre, fill it up with your data bits, then drive it back to their data centre :-()

Cheers,
Fred
Post by Jonathan Engwall
Snowball is the very large scale AWS data service.
Post by John Hearns via Beowulf
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world,
those sites being independent research units. But HPC facilities are
in headquarters.
The sites want to be able to drop files onto local storage yet have it
magically appear on HPC storage, and same with the results going back
the other way.
One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster
geo-replication is one way only.
What do you know of the BeeGFS mirroring? Will it work over long
distances? (Note to me - find out yourself you lazy besom)
This isn't the use case for most/all cluster file systems.   This is
where distributed object systems and buckets rule.
Take your file, dump it into an S3 like bucket on one end, pull it out
of the S3 like bucket on the other.  If you don't want to use get/put
operations, then use s3fs/s3ql.  You can back this up with replicating
EC minio stores (will take a few minutes to set up ... compare that to
others).
The down side to this is that minio has limits of about 16TiB last I
checked.   If you need more, replace minio with another system (igneous,
ceph, etc.).  Ping me offline if you want to talk more.
[...]
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowu
Lux, Jim (337K)
2018-07-26 20:49:15 UTC
Permalink
SO this is the modern equivalent of "nothing beats the bandwidth of a station wagon full of mag tapes"
It *is* a clever idea - I'm sure all the big cloud providers have figured out how to do a "data center in shipping container", and that's basically what this is.

I wonder what it costs (yeah, I know I can "Contact Sales to order a AWS Snowmobile"... but...)


Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)

-----Original Message-----
From: Beowulf [mailto:beowulf-***@beowulf.org] On Behalf Of Fred Youhanaie
Sent: Tuesday, July 24, 2018 11:21 AM
To: ***@beowulf.org
Subject: Re: [Beowulf] Lustre Upgrades

Nah, that ain't large scale ;-) If you want large scale have a look at snowmobile:

https://aws.amazon.com/snowmobile/

They drive a 45-foot truck to your data centre, fill it up with your data bits, then drive it back to their data centre :-()

Cheers,
Fred
Post by Jonathan Engwall
Snowball is the very large scale AWS data service.
Post by John Hearns via Beowulf
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world,
those sites being independent research units. But HPC facilities are
in headquarters.
The sites want to be able to drop files onto local storage yet have
it magically appear on HPC storage, and same with the results going
back the other way.
One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster
geo-replication is one way only.
What do you know of the BeeGFS mirroring? Will it work over long
distances? (Note to me - find out yourself you lazy besom)
This isn't the use case for most/all cluster file systems.   This is
where distributed object systems and buckets rule.
Take your file, dump it into an S3 like bucket on one end, pull it out
of the S3 like bucket on the other.  If you don't want to use get/put
operations, then use s3fs/s3ql.  You can back this up with replicating
EC minio stores (will take a few minutes to set up ... compare that to
others).
The down side to this is that minio has limits of about 16TiB last I
checked.   If you need more, replace minio with another system
(igneous, ceph, etc.).  Ping me offline if you want to talk more.
[...]
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
Fred Youhanaie
2018-07-27 00:09:38 UTC
Permalink
Yep, this could be considered as a form of COTS high volume data transfer ;-)

from https://aws.amazon.com/snowmobile/faqs/ (the very last item)

"Q: How much does a Snowmobile job cost?

"Snowmobile provides a practical solution to exabyte-scale data migration and is significantly faster and cheaper than any network-based solutions, which can take decades and millions of dollars of
investment in networking and logistics. Snowmobile jobs cost $0.005/GB/month based on the amount of provisioned Snowmobile storage capacity and the end to end duration of the job, which starts when a
Snowmobile departs an AWS data center for delivery to the time when data ingestion into AWS is complete. Please see AWS Snowmobile pricing or contact AWS Sales for an evaluation."

So it seems a fully loaded snowmobile, 100PB at 0.005/GB/month, would cost $524,288.00/month!

Cheers,
Fred.
Post by Lux, Jim (337K)
SO this is the modern equivalent of "nothing beats the bandwidth of a station wagon full of mag tapes"
It *is* a clever idea - I'm sure all the big cloud providers have figured out how to do a "data center in shipping container", and that's basically what this is.
I wonder what it costs (yeah, I know I can "Contact Sales to order a AWS Snowmobile"... but...)
Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)
-----Original Message-----
Sent: Tuesday, July 24, 2018 11:21 AM
Subject: Re: [Beowulf] Lustre Upgrades
https://aws.amazon.com/snowmobile/
They drive a 45-foot truck to your data centre, fill it up with your data bits, then drive it back to their data centre :-()
Cheers,
Fred
Post by Jonathan Engwall
Snowball is the very large scale AWS data service.
Post by John Hearns via Beowulf
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world,
those sites being independent research units. But HPC facilities are
in headquarters.
The sites want to be able to drop files onto local storage yet have
it magically appear on HPC storage, and same with the results going
back the other way.
One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster
geo-replication is one way only.
What do you know of the BeeGFS mirroring? Will it work over long
distances? (Note to me - find out yourself you lazy besom)
This isn't the use case for most/all cluster file systems.   This is
where distributed object systems and buckets rule.
Take your file, dump it into an S3 like bucket on one end, pull it out
of the S3 like bucket on the other.  If you don't want to use get/put
operations, then use s3fs/s3ql.  You can back this up with replicating
EC minio stores (will take a few minutes to set up ... compare that to
others).
The down side to this is that minio has limits of about 16TiB last I
checked.   If you need more, replace minio with another system
(igneous, ceph, etc.).  Ping me offline if you want to talk more.
[...]
_______________________________________________
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/
Lux, Jim (337K)
2018-07-27 00:47:18 UTC
Permalink
A quick calculation shows that the bandwidth is on the order of single digit Tbps, depending on the link length and road conditions. Pasadena to Ann Arbor works out to 7.6 Tbps on I-80

If they charge by fractional months - it's about a 33 hour drive, so call that 1/15th of a month. So about $35k to do the transport.
Significantly cheaper than 4000 km of fiber, coax, or cat 5 cable.


Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)


-----Original Message-----
From: Beowulf [mailto:beowulf-***@beowulf.org] On Behalf Of Fred Youhanaie
Sent: Thursday, July 26, 2018 5:10 PM
To: ***@beowulf.org
Subject: Re: [Beowulf] Lustre Upgrades

Yep, this could be considered as a form of COTS high volume data transfer ;-)

from https://aws.amazon.com/snowmobile/faqs/ (the very last item)

"Q: How much does a Snowmobile job cost?

"Snowmobile provides a practical solution to exabyte-scale data migration and is significantly faster and cheaper than any network-based solutions, which can take decades and millions of dollars of investment in networking and logistics. Snowmobile jobs cost $0.005/GB/month based on the amount of provisioned Snowmobile storage capacity and the end to end duration of the job, which starts when a Snowmobile departs an AWS data center for delivery to the time when data ingestion into AWS is complete. Please see AWS Snowmobile pricing or contact AWS Sales for an evaluation."

So it seems a fully loaded snowmobile, 100PB at 0.005/GB/month, would cost $524,288.00/month!

Cheers,
Fred.
Post by Lux, Jim (337K)
SO this is the modern equivalent of "nothing beats the bandwidth of a station wagon full of mag tapes"
It *is* a clever idea - I'm sure all the big cloud providers have figured out how to do a "data center in shipping container", and that's basically what this is.
I wonder what it costs (yeah, I know I can "Contact Sales to order a
AWS Snowmobile"... but...)
Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)
-----Original Message-----
Sent: Tuesday, July 24, 2018 11:21 AM
Subject: Re: [Beowulf] Lustre Upgrades
https://aws.amazon.com/snowmobile/
They drive a 45-foot truck to your data centre, fill it up with your
data bits, then drive it back to their data centre :-()
Cheers,
Fred
Post by Jonathan Engwall
Snowball is the very large scale AWS data service.
Post by John Hearns via Beowulf
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world,
those sites being independent research units. But HPC facilities are
in headquarters.
The sites want to be able to drop files onto local storage yet have
it magically appear on HPC storage, and same with the results going
back the other way.
One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster
geo-replication is one way only.
What do you know of the BeeGFS mirroring? Will it work over long
distances? (Note to me - find out yourself you lazy besom)
This isn't the use case for most/all cluster file systems.   This is
where distributed object systems and buckets rule.
Take your file, dump it into an S3 like bucket on one end, pull it
out of the S3 like bucket on the other.  If you don't want to use
get/put operations, then use s3fs/s3ql.  You can back this up with
replicating EC minio stores (will take a few minutes to set up ...
compare that to others).
The down side to this is that minio has limits of about 16TiB last I
checked.   If you need more, replace minio with another system
(igneous, ceph, etc.).  Ping me offline if you want to talk more.
[...]
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mai
Jörg Saßmannshausen
2018-07-27 07:10:38 UTC
Permalink
Hi all,

Jim: the flip side of the cable is: once it is installed you still can use it,
whereas with the snow mobile you have to pay for every use.

So in the long run the cable is cheaper, specially as we do need fast
connection for scientific purposes.

I was at a talk here in London not so long ago when they were talking about
data transfer of the very large telescope. As that is generating a huge amount
of data a week, say, a snow mobile would simply not be practical here.
Besides, the data is generated on literally the other side of the world.

All the best from a hot, sunny London

Jörg
Post by Lux, Jim (337K)
A quick calculation shows that the bandwidth is on the order of single digit
Tbps, depending on the link length and road conditions. Pasadena to Ann
Arbor works out to 7.6 Tbps on I-80
If they charge by fractional months - it's about a 33 hour drive, so call
that 1/15th of a month. So about $35k to do the transport. Significantly
cheaper than 4000 km of fiber, coax, or cat 5 cable.
Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)
-----Original Message-----
Youhanaie Sent: Thursday, July 26, 2018 5:10 PM
Subject: Re: [Beowulf] Lustre Upgrades
Yep, this could be considered as a form of COTS high volume data transfer ;-)
from https://aws.amazon.com/snowmobile/faqs/ (the very last item)
"Q: How much does a Snowmobile job cost?
"Snowmobile provides a practical solution to exabyte-scale data migration
and is significantly faster and cheaper than any network-based solutions,
which can take decades and millions of dollars of investment in networking
and logistics. Snowmobile jobs cost $0.005/GB/month based on the amount of
provisioned Snowmobile storage capacity and the end to end duration of the
job, which starts when a Snowmobile departs an AWS data center for delivery
to the time when data ingestion into AWS is complete. Please see AWS
Snowmobile pricing or contact AWS Sales for an evaluation."
So it seems a fully loaded snowmobile, 100PB at 0.005/GB/month, would cost
$524,288.00/month!
Cheers,
Fred.
Post by Lux, Jim (337K)
SO this is the modern equivalent of "nothing beats the bandwidth of a
station wagon full of mag tapes" It *is* a clever idea - I'm sure all the
big cloud providers have figured out how to do a "data center in shipping
container", and that's basically what this is.
I wonder what it costs (yeah, I know I can "Contact Sales to order a
AWS Snowmobile"... but...)
Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)
-----Original Message-----
Sent: Tuesday, July 24, 2018 11:21 AM
Subject: Re: [Beowulf] Lustre Upgrades
https://aws.amazon.com/snowmobile/
They drive a 45-foot truck to your data centre, fill it up with your
data bits, then drive it back to their data centre :-()
Cheers,
Fred
Post by Jonathan Engwall
Snowball is the very large scale AWS data service.
Post by John Hearns via Beowulf
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world,
those sites being independent research units. But HPC facilities are
in headquarters.
The sites want to be able to drop files onto local storage yet have
it magically appear on HPC storage, and same with the results going
back the other way.
One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster
geo-replication is one way only.
What do you know of the BeeGFS mirroring? Will it work over long
distances? (Note to me - find out yourself you lazy besom)
This isn't the use case for most/all cluster file systems. This is
where distributed object systems and buckets rule.
Take your file, dump it into an S3 like bucket on one end, pull it
out of the S3 like bucket on the other. If you don't want to use
get/put operations, then use s3fs/s3ql. You can back this up with
replicating EC minio stores (will take a few minutes to set up ...
compare that to others).
The down side to this is that minio has limits of about 16TiB last I
checked. If you need more, replace minio with another system
(igneous, ceph, etc.). Ping me offline if you want to talk more.
[...]
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mail
John Hearns via Beowulf
2018-07-27 08:01:53 UTC
Permalink
Jörg, then the days of the Tea Clipper Races should be revived. We have
just the ship for it already. Powered by green energy, and built in
Scotland of course.
https://en.wikipedia.org/wiki/Cutty_Sark

Just fill her hold with hard drives and set sail. Aaar me hearties.
I can just see HPC types being made to climb the rigging in a gale...
Jörg Saßmannshausen
2018-07-27 08:31:39 UTC
Permalink
Hi John,

good idea! Specially as the ship has been restaurated and is in my
neighbourhood. The only flip side here might be that some tourists might not
like the idea and it might be a wee bit difficult to get it back into the
Thames. :-)

All the best

Jörg
Jörg, then the days of the Tea Clipper Races should be revived. We have
just the ship for it already. Powered by green energy, and built in
Scotland of course.
https://en.wikipedia.org/wiki/Cutty_Sark
Just fill her hold with hard drives and set sail. Aaar me hearties.
I can just see HPC types being made to climb the rigging in a gale...
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowu
Fred Youhanaie
2018-07-27 08:49:20 UTC
Permalink
Post by John Hearns via Beowulf
I can just see HPC types being made to climb the rigging in a gale...
... and they would be called DevOpSailors

Happy SysAdmin / DevOpSailor Day :-)

http://sysadminday.com/
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/be
Lux, Jim (337K)
2018-07-27 13:06:20 UTC
Permalink
William Henry Dana – “Two Years Before the Mast” – An excellent book describing what it was like to be a sailor in the days just before steam, there’s plenty of climbing the rigging in a sleet storm, while trying to round Cape Horn. – but when they got to California, the weather was a lot nicer. Dana later went on to be a lawyer fighting for sailor’s rights.

Silicon valley wasn’t very developed when Dana was doing his shipboard duties, and there weren’t any disk drives at the time.

They *did*, however, have parallel cluster computing – rooms full of computers grinding out navigation tables. And I suppose they were commodity computers, using commodity interconnects (of the day), so could they fairly be called a Beowulf.

https://www.theatlantic.com/science/archive/2016/06/the-women-behind-the-jet-propulsion-laboratory/482847/ describes a 1953 version of the same.


From: Beowulf <beowulf-***@beowulf.org> on behalf of "***@beowulf.org" <***@beowulf.org>
Reply-To: John Hearns <***@googlemail.com>
Date: Friday, July 27, 2018 at 1:03 AM
To: Jörg Saßmannshausen <sassy-***@sassy.formativ.net>
Cc: "***@beowulf.org" <***@beowulf.org>
Subject: Re: [Beowulf] Lustre Upgrades

Jörg, then the days of the Tea Clipper Races should be revived. We have just the ship for it already. Powered by green energy, and built in Scotland of course.
https://en.wikipedia.org/wiki/Cutty_Sark

Just fill her hold with hard drives and set sail. Aaar me hearties.
I can just see HPC types being made to climb the rigging in a gale...
John Hearns via Beowulf
2018-07-27 13:17:28 UTC
Permalink
Jim, thankyou for that link. It is quite helpful! I have a poster accepted
for the Julia Conference in two weeks time.
My proposal is to discuss computers just like that - on the Manhattan
project etc. Then to show how Julia can easily be used to solve the
equation for critical mass from the Los Alamos Primer.
I havent done a damn thing for the poster yet.. ooops.
I am also arranging a visit to Bletchley Park at the end of the conference.
JuliaCon is sold out but I am sure you can watch the presentations
http://juliacon.org/2018/
Post by Lux, Jim (337K)
William Henry Dana – “Two Years Before the Mast” – An excellent book
describing what it was like to be a sailor in the days just before steam,
there’s plenty of climbing the rigging in a sleet storm, while trying to
round Cape Horn. – but when they got to California, the weather was a lot
nicer. Dana later went on to be a lawyer fighting for sailor’s rights.
Silicon valley wasn’t very developed when Dana was doing his shipboard
duties, and there weren’t any disk drives at the time.
They **did**, however, have parallel cluster computing – rooms full of
computers grinding out navigation tables. And I suppose they were
commodity computers, using commodity interconnects (of the day), so could
they fairly be called a Beowulf.
https://www.theatlantic.com/science/archive/2016/06/the-women-behind-the-jet-propulsion-laboratory/482847/
describes a 1953 version of the same.
*Date: *Friday, July 27, 2018 at 1:03 AM
*Subject: *Re: [Beowulf] Lustre Upgrades
Jörg, then the days of the Tea Clipper Races should be revived. We have
just the ship for it already. Powered by green energy, and built in
Scotland of course.
https://en.wikipedia.org/wiki/Cutty_Sark
Just fill her hold with hard drives and set sail. Aaar me hearties.
I can just see HPC types being made to climb the rigging in a gale...
Lux, Jim (337K)
2018-07-27 17:52:10 UTC
Permalink
If you need more pictures of the early JPL cluster computers, using acoustic and optical interconnects, let me know.. They’ve recently reorganized the photo archives here, and it’s a lot easier to find stuff (like pictures of the foundation of the building my office is in, from the 1950s, when they were digging it)

There’s also a story from Feynman of a pipelined compute chain using EAM equipment (EAM – Electric Accounting Machinery – readers, sorters, punches, programmed with plugboards) and human computers. You might be able to find pictures, but since the whole project was classified, it’s less likely – the pictures probably declassified 50 years later, but that doesn’t mean someone has spent the time and money to put them online anywhere.

Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)

From: John Hearns [mailto:***@googlemail.com]
Sent: Friday, July 27, 2018 6:17 AM
To: Lux, Jim (337K) <***@jpl.nasa.gov>
Cc: Beowulf Mailing List <***@beowulf.org>
Subject: Re: [Beowulf] Lustre Upgrades

Jim, thankyou for that link. It is quite helpful! I have a poster accepted for the Julia Conference in two weeks time.
My proposal is to discuss computers just like that - on the Manhattan project etc. Then to show how Julia can easily be used to solve the equation for critical mass from the Los Alamos Primer.
I havent done a damn thing for the poster yet.. ooops.
I am also arranging a visit to Bletchley Park at the end of the conference.
JuliaCon is sold out but I am sure you can watch the presentations http://juliacon.org/2018/




On Fri, 27 Jul 2018 at 15:06, Lux, Jim (337K) <***@jpl.nasa.gov<mailto:***@jpl.nasa.gov>> wrote:
William Henry Dana – “Two Years Before the Mast” – An excellent book describing what it was like to be a sailor in the days just before steam, there’s plenty of climbing the rigging in a sleet storm, while trying to round Cape Horn. – but when they got to California, the weather was a lot nicer. Dana later went on to be a lawyer fighting for sailor’s rights.

Silicon valley wasn’t very developed when Dana was doing his shipboard duties, and there weren’t any disk drives at the time.

They *did*, however, have parallel cluster computing – rooms full of computers grinding out navigation tables. And I suppose they were commodity computers, using commodity interconnects (of the day), so could they fairly be called a Beowulf.

https://www.theatlantic.com/science/archive/2016/06/the-women-behind-the-jet-propulsion-laboratory/482847/ describes a 1953 version of the same.


From: Beowulf <beowulf-***@beowulf.org<mailto:beowulf-***@beowulf.org>> on behalf of "***@beowulf.org<mailto:***@beowulf.org>" <***@beowulf.org<mailto:***@beowulf.org>>
Reply-To: John Hearns <***@googlemail.com<mailto:***@googlemail.com>>
Date: Friday, July 27, 2018 at 1:03 AM
To: Jörg Saßmannshausen <sassy-***@sassy.formativ.net<mailto:sassy-***@sassy.formativ.net>>
Cc: "***@beowulf.org<mailto:***@beowulf.org>" <***@beowulf.org<mailto:***@beowulf.org>>
Subject: Re: [Beowulf] Lustre Upgrades

Jörg, then the days of the Tea Clipper Races should be revived. We have just the ship for it already. Powered by green energy, and built in Scotland of course.
https://en.wikipedia.org/wiki/Cutty_Sark

Just fill her hold with hard drives and set sail. Aaar me hearties.
I can just see HPC types being made to climb the rigging in a gale...
Prentice Bisbal
2018-07-27 18:37:16 UTC
Permalink
'Top Secret Rosies" is a good documentary on the women computers of
yesteryear:

https://en.wikipedia.org/wiki/Top_Secret_Rosies:_The_Female_%22Computers%22_of_WWII
Post by Lux, Jim (337K)
William Henry Dana – “Two Years Before the Mast” – An excellent book
describing what it was like to be a sailor in the days just before
steam, there’s plenty of climbing the rigging in a sleet storm, while
trying to round Cape Horn. – but when they got to California, the
weather was a lot nicer.  Dana later went on to be a lawyer fighting
for sailor’s rights.
Silicon valley wasn’t very developed when Dana was doing his shipboard
duties, and there weren’t any disk drives at the time.
They **did**, however, have parallel cluster computing – rooms full of
computers grinding out navigation tables.  And I suppose they were
commodity computers, using commodity interconnects (of the day), so
could they fairly be called a Beowulf.
https://www.theatlantic.com/science/archive/2016/06/the-women-behind-the-jet-propulsion-laboratory/482847/
describes a 1953 version of the same.
*Date: *Friday, July 27, 2018 at 1:03 AM
*Subject: *Re: [Beowulf] Lustre Upgrades
Jörg,  then the days of the Tea Clipper Races should be revived. We
have just the ship for it already. Powered by green energy, and built
in Scotland of course.
https://en.wikipedia.org/wiki/Cutty_Sark
Just fill her hold with hard drives and set sail. Aaar me hearties.
I can just see HPC types being made to climb the rigging in a gale...
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Fred Youhanaie
2018-07-27 08:41:17 UTC
Permalink
They do mention up to 1Tb/s transfer rate between the truck and the data centre, which, for 100PB, would take 10 days to transfer into the truck and 10 days out of it into the AWS storage.

I think the main purpose of the service, snowmobile and snowball, is the initial transfer of high volume of data to the AWS cloud. You would still need a data link for the data that is
acquired/generated on an ongoing basis.

And you may decide to migrate/copy your data to the cloud so that someone else can take care of the storage and global distribution of the data.


Cheers,
Fred
Post by Jörg Saßmannshausen
Hi all,
Jim: the flip side of the cable is: once it is installed you still can use it,
whereas with the snow mobile you have to pay for every use.
So in the long run the cable is cheaper, specially as we do need fast
connection for scientific purposes.
I was at a talk here in London not so long ago when they were talking about
data transfer of the very large telescope. As that is generating a huge amount
of data a week, say, a snow mobile would simply not be practical here.
Besides, the data is generated on literally the other side of the world.
All the best from a hot, sunny London
Jörg
Post by Lux, Jim (337K)
A quick calculation shows that the bandwidth is on the order of single digit
Tbps, depending on the link length and road conditions. Pasadena to Ann
Arbor works out to 7.6 Tbps on I-80
If they charge by fractional months - it's about a 33 hour drive, so call
that 1/15th of a month. So about $35k to do the transport. Significantly
cheaper than 4000 km of fiber, coax, or cat 5 cable.
Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)
-----Original Message-----
Youhanaie Sent: Thursday, July 26, 2018 5:10 PM
Subject: Re: [Beowulf] Lustre Upgrades
Yep, this could be considered as a form of COTS high volume data transfer ;-)
from https://aws.amazon.com/snowmobile/faqs/ (the very last item)
"Q: How much does a Snowmobile job cost?
"Snowmobile provides a practical solution to exabyte-scale data migration
and is significantly faster and cheaper than any network-based solutions,
which can take decades and millions of dollars of investment in networking
and logistics. Snowmobile jobs cost $0.005/GB/month based on the amount of
provisioned Snowmobile storage capacity and the end to end duration of the
job, which starts when a Snowmobile departs an AWS data center for delivery
to the time when data ingestion into AWS is complete. Please see AWS
Snowmobile pricing or contact AWS Sales for an evaluation."
So it seems a fully loaded snowmobile, 100PB at 0.005/GB/month, would cost
$524,288.00/month!
Cheers,
Fred.
Post by Lux, Jim (337K)
SO this is the modern equivalent of "nothing beats the bandwidth of a
station wagon full of mag tapes" It *is* a clever idea - I'm sure all the
big cloud providers have figured out how to do a "data center in shipping
container", and that's basically what this is.
I wonder what it costs (yeah, I know I can "Contact Sales to order a
AWS Snowmobile"... but...)
Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)
-----Original Message-----
Sent: Tuesday, July 24, 2018 11:21 AM
Subject: Re: [Beowulf] Lustre Upgrades
Nah, that ain't large scale ;-) If you want large scale have a look at
https://aws.amazon.com/snowmobile/
They drive a 45-foot truck to your data centre, fill it up with your
data bits, then drive it back to their data centre :-()
Cheers,
Fred
Post by Jonathan Engwall
Snowball is the very large scale AWS data service.
Post by John Hearns via Beowulf
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world,
those sites being independent research units. But HPC facilities are
in headquarters.
The sites want to be able to drop files onto local storage yet have
it magically appear on HPC storage, and same with the results going
back the other way.
One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster
geo-replication is one way only.
What do you know of the BeeGFS mirroring? Will it work over long
distances? (Note to me - find out yourself you lazy besom)
This isn't the use case for most/all cluster file systems. This is
where distributed object systems and buckets rule.
Take your file, dump it into an S3 like bucket on one end, pull it
out of the S3 like bucket on the other. If you don't want to use
get/put operations, then use s3fs/s3ql. You can back this up with
replicating EC minio stores (will take a few minutes to set up ...
compare that to others).
The down side to this is that minio has limits of about 16TiB last I
checked. If you need more, replace minio with another system
(igneous, ceph, etc.). Ping me offline if you want to talk more.
[...]
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beow
Lux, Jim (337K)
2018-07-27 12:52:46 UTC
Permalink
Indeed..
Interestingly, we were having this discussion (in a similar form) with respect to a radio telescope at work - modern arrays like SKA, LOFAR, MeerKAT, etc. have *lots of data* being pushed around. I'm working on one that flies in space (so we're not going to have a fiber to the ground<grin>) - but there has always been discussion of the difference between "real time low latency interconnect" and "long latency interconnect"

One aspect of this, though, is that buying and installing fiber is a "capital expenditure", while buying lots of Snowmobiles (or mailing lots of diskdrives) is a "operating expense" and often comes out of a different bucket of money. Of course, if you can procure your network services " by the month" or "by the bit" then someone else deals with the capital expense.

This comes up for us in NASA missions and projects all the time - do you buy test equipment or rent it? And this is where you can make the correct decision for the short run that may turn out to be more expensive in the (speculative) long run. Consider Spirit and Opportunity - minimum mission was about 3 months, goal was 1 Martian year (I think). And here we are 10 years later with Opportunity still grinding along (well, not right now, because of the dust storm blotting out the sun). Curiosity - 1 Martian year (about 2 Earth years), and it landed in 2012.

On 7/27/18, 12:11 AM, "Beowulf on behalf of Jörg Saßmannshausen" <beowulf-***@beowulf.org on behalf of sassy-***@sassy.formativ.net> wrote:

Hi all,

Jim: the flip side of the cable is: once it is installed you still can use it,
whereas with the snow mobile you have to pay for every use.

So in the long run the cable is cheaper, specially as we do need fast
connection for scientific purposes.

I was at a talk here in London not so long ago when they were talking about
data transfer of the very large telescope. As that is generating a huge amount
of data a week, say, a snow mobile would simply not be practical here.
Besides, the data is generated on literally the other side of the world.

All the best from a hot, sunny London

Jörg
Post by Lux, Jim (337K)
A quick calculation shows that the bandwidth is on the order of single digit
Tbps, depending on the link length and road conditions. Pasadena to Ann
Arbor works out to 7.6 Tbps on I-80
If they charge by fractional months - it's about a 33 hour drive, so call
that 1/15th of a month. So about $35k to do the transport. Significantly
cheaper than 4000 km of fiber, coax, or cat 5 cable.
Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)
-----Original Message-----
Youhanaie Sent: Thursday, July 26, 2018 5:10 PM
Subject: Re: [Beowulf] Lustre Upgrades
Yep, this could be considered as a form of COTS high volume data transfer ;-)
from https://aws.amazon.com/snowmobile/faqs/ (the very last item)
"Q: How much does a Snowmobile job cost?
"Snowmobile provides a practical solution to exabyte-scale data migration
and is significantly faster and cheaper than any network-based solutions,
which can take decades and millions of dollars of investment in networking
and logistics. Snowmobile jobs cost $0.005/GB/month based on the amount of
provisioned Snowmobile storage capacity and the end to end duration of the
job, which starts when a Snowmobile departs an AWS data center for delivery
to the time when data ingestion into AWS is complete. Please see AWS
Snowmobile pricing or contact AWS Sales for an evaluation."
So it seems a fully loaded snowmobile, 100PB at 0.005/GB/month, would cost
$524,288.00/month!
Cheers,
Fred.
Post by Lux, Jim (337K)
SO this is the modern equivalent of "nothing beats the bandwidth of a
station wagon full of mag tapes" It *is* a clever idea - I'm sure all the
big cloud providers have figured out how to do a "data center in shipping
container", and that's basically what this is.
I wonder what it costs (yeah, I know I can "Contact Sales to order a
AWS Snowmobile"... but...)
Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)
-----Original Message-----
Sent: Tuesday, July 24, 2018 11:21 AM
Subject: Re: [Beowulf] Lustre Upgrades
https://aws.amazon.com/snowmobile/
They drive a 45-foot truck to your data centre, fill it up with your
data bits, then drive it back to their data centre :-()
Cheers,
Fred
Post by Jonathan Engwall
Snowball is the very large scale AWS data service.
Post by John Hearns via Beowulf
Joe, sorry to split the thread here. I like BeeGFS and have set it up.
I have worked for two companies now who have sites around the world,
those sites being independent research units. But HPC facilities are
in headquarters.
The sites want to be able to drop files onto local storage yet have
it magically appear on HPC storage, and same with the results going
back the other way.
One company did this well with GPFS and AFM volumes.
For the current company, I looked at gluster and Gluster
geo-replication is one way only.
What do you know of the BeeGFS mirroring? Will it work over long
distances? (Note to me - find out yourself you lazy besom)
This isn't the use case for most/all cluster file systems. This is
where distributed object systems and buckets rule.
Take your file, dump it into an S3 like bucket on one end, pull it
out of the S3 like bucket on the other. If you don't want to use
get/put operations, then use s3fs/s3ql. You can back this up with
replicating EC minio stores (will take a few minutes to set up ...
compare that to others).
The down side to this is that minio has limits of about 16TiB last I
checked. If you need more, replace minio with another system
(igneous, ceph, etc.). Ping me offline if you want to talk more.
[...]
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.b
Continue reading on narkive:
Loading...