Beowulf Questions

Discussion:

Beowulf Questions

Randall Jouett

2003-01-03 08:46:41 UTC

Howdy,

Newbie alert, although I have ready the FAQ, visited a couple
of the project sites listed on beowulf.org, and even read a
HOWTO or 3 :^). Being a software engineer, I'll just say that
I'm way too familiar with RTFM and google searches :^).

Basically, I'm looking for the following info:

* Is there a beowulf setup out that that is used to
gernerate ray-traced graphics? If so, does anyone
have a links to sites that in-depth info?

* Is there a beowulf setup out there setup specifically
for random-number generation, with its main focus on
generating "truly random" numbers, if you can say that
synchronus, clock-based computers are capable of generating
truly random numbers :^). If so, has a cluser of this type been
used in encryption and decryption purposes.

* Is there a beowulf chess engine out there? If so, has it
ever played in the computer chess championship, and how
did it perform?

* Is anyone working on beowulf clustering based on trusted-host
computing? That is, instead of having a local cluster of
numerous computers in various racks, a person could say something
like, "I trust this host and that host, and if any of them want
to use my excess CPU time over the Internet, then they are welcome
to use my broadband-connected machine for their purposes. In other
words, this would be a setup much like the SETI screen blanker, yet
the screen blanker would be removed and out of the loop. Anyone doing
this? If not, I'm interested in giving this a shot. (Send e-mail or
post here, and I'll give a verbose explanation of the way I see
something like this working.)

* Is anyone using real-time versions of Unix(TM) in their beowulf
setup? If so, did they/you notice a significant increase in
processing speed? Personally,and off the top of my head, I'd
think the main bottleneck is the MPI and network communications.

[This next one will be a bit "out there," but what the hell -- no
guts no glory! :^) ]

* Is anybody working on a model that encompasses nano-technology?
That is, is there anyone out there working on algorithms that
would allow nano-bots to communicate and process information in
a beowulf-like manner?

Anywho, TIA, and feel free to answer any of the questions, even
if you can't answer them all. The way I see things, a bit of knowledge
gained is better than no knowledge at all....

Best Regards,

Randall

--
Randall Jouett
Amateur Radio: AB5NI

I eat spaghetti code out of a bit bucket while sitting at a hash table!

Donald Becker

2003-01-03 18:42:21 UTC

Post by Randall Jouett
Newbie alert, although I have ready the FAQ, visited a couple
of the project sites listed on beowulf.org, and even read a
HOWTO or 3 :^). Being a software engineer, I'll just say that
I'm way too familiar with RTFM and google searches :^).
* Is there a beowulf setup out that that is used to
gernerate ray-traced graphics? If so, does anyone
have a links to sites that in-depth info?

Various ports of POV-Ray are available for PVM and MPI, with a very wide
range of quality and performance.

We distribute a version of POV-Ray specifically ported for our cluster
system and BeoMPI. Our POV-ray port is interesting because
It transparently uses all available cluster nodes, and works even if
that number is '0'.
It does all of the serial setup and run-time I/O on the front end
machine (technically, the MPI rank 0 node). This minimizes
overall work and keeps the POV-Ray call-out semantics unchanged
It does the rendering only on compute nodes (except for the N=0 case).
It complets the rendering even with crashed or slow nodes.

Post by Randall Jouett
* Is there a beowulf setup out there setup specifically
for random-number generation, with its main focus on
generating "truly random" numbers, if you can say that
synchronus, clock-based computers are capable of generating
truly random numbers :^). If so, has a cluser of this type been
used in encryption and decryption purposes.

This is trivial. Once you get a good a serial pseudo-random number
generator, you can use a cluster to generate more.

Post by Randall Jouett
* Is anyone working on beowulf clustering based on trusted-host
computing? That is, instead of having a local cluster of

...

Post by Randall Jouett
words, this would be a setup much like the SETI screen blanker, yet

This wouldn't be a Beowulf cluster. (It's a stretch to call it a
cluster at all.)
Cycle scavenging is different concept, and it only to a tiny percentage
of problems.

(People have heated discussions about cluster I/O performance and
communication latency. With cycle-scavenging, these are orders of
magnitude worse.)

--
Donald Becker ***@scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993

Randall Jouett

2003-01-04 05:59:57 UTC

Hello Donald, and thanks for the response.

Post by Donald Becker
Various ports of POV-Ray are available for PVM and MPI, with a very wide
range of quality and performance.
We distribute a version of POV-Ray specifically ported for our cluster
system and BeoMPI. Our POV-ray port is interesting because
It transparently uses all available cluster nodes, and works even if
that number is '0'.

I like this.

Post by Donald Becker
It does all of the serial setup and run-time I/O on the front end
machine (technically, the MPI rank 0 node). This minimizes
overall work and keeps the POV-Ray call-out semantics unchanged
It does the rendering only on compute nodes (except for the N=0 case).
It complets the rendering even with crashed or slow nodes.

Ah. So it redistributes the work, huh? Kewl. Sounds like
somebody at your place of employment has a great abstract
mind, being able to encompass all variables, yet the end
solution seems concrete and directed toward the individual
user. Too kewl.

Post by Donald Becker

Post by Randall Jouett
* Is there a beowulf setup out there setup specifically
for random-number generation, with its main focus on
generating "truly random" numbers, if you can say that
synchronus, clock-based computers are capable of generating
truly random numbers :^). If so, has a cluser of this type been
used in encryption and decryption purposes.

This is trivial. Once you get a good a serial pseudo-random number
generator, you can use a cluster to generate more.

That's what I was thinking. Can you say 1 billion bits of
RSA encryption? I knew you could :^). My only worry here is
that with a few hundred thousand nodes, would the NSA be able
to decrypt stuff relating to national security? That is, could
the government actually afford to purchase on-site equipment
that could keep up with P2P, Internet-based clustering, if you
will? Peronsally, I think they'd have to jump in on the bandwagon
with everyone else, yet with the current level of "Big Brother
is wathcing us all" attitude floating around, I seriously doubt
that people would knowingly allow them to use their free cycles
for decryption without some mega-serious watch-dogging geeks
watching their every move.

Post by Donald Becker

Post by Randall Jouett
* Is anyone working on beowulf clustering based on trusted-host
computing? That is, instead of having a local cluster of

This wouldn't be a Beowulf cluster. (It's a stretch to
call it a cluster at all.) Cycle scavenging is different
concept, and it only to a tiny percentage of problems.

Ah. Ok. Thanks for the info. OTOH, one can't help to wonder
if this isn't the future of computing, especially when we
consider the fact that almost all of our appliances we'll
be using in the near future will be network aware. For instance,
my computer determines that a particular task/process is very
CPU intensive, and the best solution to the problem would be
to use more processors in a parallel fashion to obtain the
result. Wouldn't it make perfect sense for the system to set
itself up as a master node and subdivide the task at hand to
various network-aware devices/computers, such as the unused
television or coffee pot? :^). Also, if the "local cluster"
of household machines isn't up to the computation, wouldn't
it also make sense that the master/root node would then go
out and ask its/your nextdoor neighbor if it can steal a few
cycles, too, branching out further if necessary? Basically,
I'm thinking along the lines of a power-grid like setup here.

I guess what I'm really trying to say here is this: do all
nodes "have" to be on site to be considered a beowulf cluster?
Personally, I don't think so, especially if we consider the
fact that in the not-too-distant future, networking speeds
will be up to snuff with the various tasks at hand. With these
concepts in mind, I'll step out on a limb here and say that it
might actually be very advantageous to live in a highly-congested
area, like in an apartment complex or next to a busy freeway. Living
in an area that encompasses both would flat-out rule :^).

BTW, when I say "living next to a freeway would be advantageous,"
I'm talking about using unused cycles from various WiFi-connected
cars :^). Futuristic? At this point in time, most definitely; however,
I'm sure that most of us here see something as left-field as this
eventually happening, and, IMHO, someplace like this is exactly
the place I want to be before the rest show up.

As a side note, I found something on the web last night called
"GreenTea," which is a P2P-like "operating system" setup around
Java. Interesting concept, portable, and I can see a system like
this taking off fairly soon. OTOH, I think interpreted languages
and environments have a bit further to go before they'll start
pushing compiled languages aside, though.Eventually, they should
take over,although I seriously doubt it will be anytime soon.

Post by Donald Becker
(People have heated discussions about cluster I/O performance and
communication latency.

I bet they do, and, quite frankly, it's very understandable,
especially when we all start to consider the major drawbacks
of using Ethernet-based NICS for message passing. OTOH, it is
cheap, are there are certainly various applications that lend
themselves nicely to using this type of environment.Personally,
I'd be more worried about things such as node stability, algorithms,
code profiling, algorithms, and maybe even looking into writing
various time-critical subroutines in assembly. IMHO, this would
time much better spent :^). Not only that, but if we were to
nit-pick the beowulf model to death, I think we'd come to the
conclusion that using multiple processors on a real buss would
be a much better way to go, although I'm sure we'd all eventually
wind up designing something that resembles a 1.5 million dollar
Sun or IBM system :^).

Post by Donald Becker
(With cycle-scavenging, these are orders of magnitude worse.)

Agreed. OTOH, this will eventually change, and linear-progressive,
"next logical step," fringe-level computing is where I like spending
my man hours. Unfortunately, though, we all have bills to pay and
rationilaztions to make about the current level of real-world computing,
so let's just say that I'm a bit flexible in this particular area :^).

BTW, my main level of interest in this particular field is system
administration and a bit of code cranking. Having done both for
years, I find network administration much more relaxing -- believe
it or not! :^) Not only that, but having a decent level of hardware,
electronics, telecommunications, and programming knowledge has to be
benificial when administrating something like a beowulf cluster.
Hopefully, the road to learning this particular model of computing won't
be too time consuming. OTOH, I'm sure the infamous Murphy will rear
his ugly head somewhere along the line :^).

Thanks for your time and remarks, Donald -- it is much appreciated!

Randall

--
Randall Jouett
Amateur Radio: AB5NI

I eat spaghetti code out of a bit bucket while sitting at a hash table!

P.S.

Is it just me, or do others find it annoying that the current
version of Mozilla doesn't have spell-checking capability? BTW,
this is a semi-shrewd way of saying, "Please disregard all spelling
errors, and please send all complaints to /dev/null :^).

P.P.S.

Anyone out there try Plan 9 in a beowulf environment? As a layman
and at first glance, it does seem to be well suited to the tasks
at hand.

Rupert Davey

2003-01-03 23:49:54 UTC

Hi Randall and all

I have used PVMPOV for ray-tracing. It is the POV-Ray application with
a PVM patch applied to it. It works very well and the XPVM interface
allows all the processes and communications to be seen. It takes .pov
text files and converts them into .tga graphics files. My small cluster
of four 750MHz Athlons, does skyvase.pov in about 30 seconds. PVMPOV is
easy-ish to setup and configure and once its all done you have a nice
face rendering system.
This uses PVM3.4.3, POV-Ray v3.1 and the PVMPOV3-1g-1.

Other things like BEOLIN are Beowulf specific and don't require PVM to
be on each node. This addresses issues of external node access and
pipes ALL the processes via the frontend.

www.povray.org -> pov-ray
http://www.epm.ornl.gov/pvm/pvm_home.html ->pvm
http://pvmpov.sourceforge.net/ -. PVMPOV

Please feel free to correct me if any of that is wrong, I kinda a newbie
myself. Doing a final year degree project on this stuff.

:)

Rupert Davey

-----Original Message-----
From: beowulf-***@beowulf.org [mailto:beowulf-***@beowulf.org] On
Behalf Of Randall Jouett
Sent: 03 January 2003 08:47
To: ***@beowulf.org
Subject: Beowulf Questions

Howdy,

Newbie alert, although I have ready the FAQ, visited a couple
of the project sites listed on beowulf.org, and even read a
HOWTO or 3 :^). Being a software engineer, I'll just say that
I'm way too familiar with RTFM and google searches :^).

Basically, I'm looking for the following info:

* Is there a beowulf setup out that that is used to
gernerate ray-traced graphics? If so, does anyone
have a links to sites that in-depth info?

* Is there a beowulf setup out there setup specifically
for random-number generation, with its main focus on
generating "truly random" numbers, if you can say that
synchronus, clock-based computers are capable of generating
truly random numbers :^). If so, has a cluser of this type been
used in encryption and decryption purposes.

* Is there a beowulf chess engine out there? If so, has it
ever played in the computer chess championship, and how
did it perform?

* Is anyone working on beowulf clustering based on trusted-host
computing? That is, instead of having a local cluster of
numerous computers in various racks, a person could say something
like, "I trust this host and that host, and if any of them want
to use my excess CPU time over the Internet, then they are welcome
to use my broadband-connected machine for their purposes. In other
words, this would be a setup much like the SETI screen blanker, yet
the screen blanker would be removed and out of the loop. Anyone doing
this? If not, I'm interested in giving this a shot. (Send e-mail or
post here, and I'll give a verbose explanation of the way I see
something like this working.)

* Is anyone using real-time versions of Unix(TM) in their beowulf
setup? If so, did they/you notice a significant increase in
processing speed? Personally,and off the top of my head, I'd
think the main bottleneck is the MPI and network communications.

[This next one will be a bit "out there," but what the hell -- no
guts no glory! :^) ]

* Is anybody working on a model that encompasses nano-technology?
That is, is there anyone out there working on algorithms that
would allow nano-bots to communicate and process information in
a beowulf-like manner?

Anywho, TIA, and feel free to answer any of the questions, even
if you can't answer them all. The way I see things, a bit of knowledge
gained is better than no knowledge at all....

Best Regards,

Randall

--
Randall Jouett
Amateur Radio: AB5NI

I eat spaghetti code out of a bit bucket while sitting at a hash table!

_______________________________________________
Beowulf mailing list, ***@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Randall Jouett

2003-01-04 07:29:49 UTC

Howdy Rupert (and all), and thanks for the reply.

Post by Rupert Davey
Hi Randall and all
I have used PVMPOV for ray-tracing. It is the POV-Ray application with
a PVM patch applied to it. It works very well and the XPVM interface
allows all the processes and communications to be seen.

Kewl. Exactly what I'm looking for. Thanks. I have 3
spare 1-GHz machines hanging around here, and I think
they'd be well suited for something like this, Rupert.
OTOH, this is just "fun stuff to do while I'm messing
around," and my main focus will be beowulf administration.
I also delve into computer security too, and I'd imagine
that security on a beowulf clusters can be VERY interesting :^).
I only hope that I don't have to do too much reading before
I get something up and running, although after doing this
stuff we all call computin' for 25 years, I know that this isn't
going to be the case -- not by a long shot.:^)

Post by Rupert Davey
It takes .pov text files and converts them into .tga graphics
files.

Ok. Kewl. Personally, I'm planning on converting them over to
jpg to save disk space (and memory when editing). Since I'm just
messing around for enjoyment, I can live with losing a bit of
resolution and color.

Post by Rupert Davey
My small cluster of four 750MHz Athlons, does skyvase.pov in
about 30 seconds. PVMPOV is easy-ish to setup and configure
and once its all done you have a nice face rendering system.
This uses PVM3.4.3, POV-Ray v3.1 and the PVMPOV3-1g-1.

Okie doke. I'll make sure I download all programs mentioned
before I go to sleep tonight. Thanks!

Post by Rupert Davey
Other things like BEOLIN are Beowulf specific and don't require PVM to
be on each node. This addresses issues of external node access and
pipes ALL the processes via the frontend.

Hmmm. Not familiar with BEOLIN, although it does sound
like a beowulf-specific version of Linux? If so, too
kewl!

Post by Rupert Davey
www.povray.org -> pov-ray
http://www.epm.ornl.gov/pvm/pvm_home.html ->pvm
http://pvmpov.sourceforge.net/ -. PVMPOV

Thanks for the links!

Post by Rupert Davey
Please feel free to correct me if any of that is wrong, I kinda a newbie
myself. Doing a final year degree project on this stuff.
:).

Well, from one neophyte to another, I'll just say thanks and
a job well done :^). Who am I to correct anyone when discussing
this subject, although I've become a bit familiar with hardware
and software development over the last 25 years. Hopefully, this
previous knowledge will be of help with learn the this beowulf
stuff :^).

One thing that has me rather perplexed is that I didn't notice
beowulf setups from Industrial Light and Magic, Amblin(sic?)
and the like in the projects list on beowulf.org. You'd think
that some of these shops would be proud to announce that they're
using a setup like this, huh? OTOH, I just thought about something,
and I guess that they can afford all of that high-end SGI gear
and what not, so who needs a beowulf cluster? OTOH yet again :^),
whose not into saving millions of bucks on rendering gear? Hmmm.
Maybe LightWave hasn't been ported to a parallel environment yet
or something? (Shrug.)

BTW, I just came up with YADDA (Yet Another Damn Dumb Assessment :^)).
With the Lindows and the Linux XBox projects floating around, I
wonder if a person could build an el-cheapo beowulf setup for a
few grand? I mean, the Lindows and XBox machines can be had at
Wally World for around $200.00 bucks or so. You'd think they'd make a
semi-decent cluster, and the XBox machines would have the added
advantage of being able to be used as an XBox network for game
parties. Not only that, but the XBox footprint is pretty damn
small, especially when compared to something like a mid-tower
(or larger) case. I can see it now. XBox parties after work at
ORNL, LBL, and JPL, and all the physicists there are trashing
the hell out of the physics model used in some racing game or
something. :^) :^)

Ok. On to answering more e-mails...

Best Regards,

Randall

--
Randall Jouett
Amateur Radio: AB5NI

I eat spaghetti code out of a bit bucket while sitting at a hash table!

Randall Jouett

2003-01-04 12:11:35 UTC

Howdy Rupert (and all), and thanks for the reply.

Post by Rupert Davey
Hi Randall and all
I have used PVMPOV for ray-tracing.It is the POV-Ray application
with a PVM patch applied to it. It works very well and the XPVM
interface allows all the processes and communications to be seen.

Kewl. Exactly what I'm looking for. Thanks. I have 3
spare 1-GHz machines hanging around here, and I think
they'd be well suited for something like this, Rupert.
OTOH, this is just "fun stuff to do while I'm messing
around," and my main focus will be beowulf administration.
I also delve into computer security too, and I'd imagine
that security on a beowulf clusters can be VERY interesting :^).
I only hope that I don't have to do too much reading before
I get something up and running, although after doing this
stuff we all call computin' for 25 years, I know that this isn't
going to be the case -- not by a long shot.:^)

Post by Rupert Davey
It takes .pov text files and converts them into .tga graphics
files.

Ok. Kewl. Personally, I'm planning on converting them over to
jpg to save disk space (and memory when editing). Since I'm just
messing around for enjoyment,thouh, I can live with losing a bit
of resolution and color.

Post by Rupert Davey
My small cluster of four 750MHz Athlons, does skyvase.pov in
about 30 seconds. PVMPOV is easy-ish to setup and configure
and once its all done you have a nice face rendering system.
This uses PVM3.4.3, POV-Ray v3.1 and the PVMPOV3-1g-1.

Okie doke. I'll make sure I download all programs mentioned
before I go to sleep tonight. Thanks!

Post by Rupert Davey
Other things like BEOLIN are Beowulf specific and don't require PVM to
be on each node. This addresses issues of external node access and
pipes ALL the processes via the frontend.

Hmmm. Not familiar with BEOLIN, although it does sound
like a beowulf-specific version of Linux? If so, too
kewl!

Post by Rupert Davey
www.povray.org -> pov-ray
http://www.epm.ornl.gov/pvm/pvm_home.html ->pvm
http://pvmpov.sourceforge.net/ -. PVMPOV

Thanks for the links!

Post by Rupert Davey
Please feel free to correct me if any of that is wrong, I kinda a newbie
myself. Doing a final year degree project on this stuff.
:).

Well, from one neophyte to another, I'll just say thanks and
a job well done :^). Who am I to correct anyone when discussing
this subject, although I've become a bit familiar with hardware
and software development over the last 25 years. Hopefully, this
previous knowledge will be of help with learn the this beowulf
stuff :^).

One thing that has me rather perplexed is that I didn't notice
beowulf setups from Industrial Light and Magic, Amblin(sic?)
and the like in the projects list on beowulf.org. You'd think
that some of these shops would be proud to announce that they're
using a setup like this, huh? OTOH, I just thought about something,
and I guess that they can afford all of that high-end SGI gear
and what not, so who needs a beowulf cluster? OTOH yet again :^),
whose not into saving millions of bucks on rendering gear? Hmmm.
Maybe LightWave hasn't been ported to a parallel environment yet
or something? (Shrug.)

BTW, I just came up with YADDA (Yet Another Damn Dumb Assessment :^)).
With the Lindows and the Linux XBox projects floating around, I
wonder if a person could build an el-cheapo beowulf setup for a
few grand? I mean, the Lindows and XBox machines can be had at
Wally World for around $200.00 bucks or so. You'd think they'd make a
semi-decent cluster, and the XBox machines would have the added
advantage of being able to be used as an XBox network for game
parties. Not only that, but the XBox footprint is pretty damn
small, especially when compared to something like a mid-tower
(or larger) case. I can see it now. XBox parties after work at
ORNL, LBL, and JPL, and all the physicists there are trashing
the hell out of the physics model used in some racing game or
something. :^) :^)

Ok. On to answering more e-mails...

Best Regards,

Randall

--
Randall Jouett
Amateur Radio: AB5NI

I eat spaghetti code out of a bit bucket while sitting at a hash table!

Donald Becker

2003-01-04 07:05:13 UTC

Post by Randall Jouett

Post by Donald Becker
It transparently uses all available cluster nodes, and works even if
that number is '0'.

I like this.

Our cluster philosophy is that the end user should not be required to
do anything new or special to run a cluster application.

That means
Applications should work even if there is only a single machine in
the cluster. Many beginner MPI applications don't handle this case
correctly.

Cluster applications should not require a helper program such as
'mpirun' or 'mpiexec'.

The application code should interact with the scheduler to set any
special scheduling requirements or suggestions.

A sophisticated user should still be able to optimize and do clever
things, but the basic operation shouldn't require any new knowledge.

Post by Randall Jouett

Post by Donald Becker
It does all of the serial setup and run-time I/O on the front end
machine (technically, the MPI rank 0 node). This minimizes
overall work and keeps the POV-Ray call-out semantics unchanged
It does the rendering only on compute nodes (except for the N=0 case).
It completes the rendering even with crashed or slow nodes.

Ah. So it redistributes the work, huh? Kewl.

Here we use knowledge about the application semantics to implement
failure tolerence. When we have idle workers and the rendering isn't
finished, we send some of the remaining work to the idle machine.
If a machine fails we still finish the rendering and do the final
call-outs, but don't cleanly terminate.

--
Donald Becker ***@scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993

Randall Jouett

2003-01-04 12:07:10 UTC

Hello again, Donald.

Post by Donald Becker
Our cluster philosophy is that the end user should not be required to
do anything new or special to run a cluster application.

Great. End users and turn-key solutions are always a nice
thing to have in a business-level environment. Rock on, dewd!
:^)

Post by Donald Becker
That means
Applications should work even if there is only a single machine in
the cluster. Many beginner MPI applications don't handle this case
correctly.

Wow. I would have thought that people would have made plans
to deal with this, especially since something along these lines
can happen, although I'm pretty sure it's rather infrequent.
Go figure.

Post by Donald Becker
Cluster applications should not require a helper program such as
'mpirun' or 'mpiexec'.

In a commerical system, where end users shouldn't and wouldn't
know about such things, I totally agree. OTOH, in a production
environment where that vast majority of users are geekoids, I don't
have a problem with this, especially if mpirun or mpiexec is hidden
by a GUI or something. Since you are doing this as a commercial
endeavor, though, I agree with the way you guys are handling
this, Donald. This lets me and others know that your systems are
well thought out and end-user friendly, and that is something
we all expect when shelling out serious cash for a good
number-cruncher setup.

Post by Donald Becker
The application code should interact with the scheduler to set any
special scheduling requirements or suggestions.

True. Also, this shouldn't be any big deal, and I'd imagine
this is easily done via shell scripts or a quick C hack,
especially if feel that this type of your code should be
propriatary or something. Personally, I'd want to see something
like this done at the script level, though, so that a geek could
come along and change a few things for tweaks. That's just
me, though. (Shrug.)

Post by Donald Becker
A sophisticated user should still be able to optimize and do clever
things, but the basic operation shouldn't require any new knowledge.

Agreed.

Post by Donald Becker

Post by Randall Jouett

Post by Donald Becker
It does all of the serial setup and run-time I/O on the front end
machine (technically, the MPI rank 0 node). This minimizes
overall work and keeps the POV-Ray call-out semantics unchanged
It does the rendering only on compute nodes (except for the N=0 case).
It completes the rendering even with crashed or slow nodes.

Ah. So it redistributes the work, huh? Kewl.

Here we use knowledge about the application semantics to implement
failure tolerence. When we have idle workers and the rendering isn't
finished, we send some of the remaining work to the idle machine.

Well, I hate to sound like a knothead here, Donald, and I don't
mean to be rude, but isn't this a defacto setup and standard in
a beowulf environment?? If not, what the hell are people thinking
about? :^) :^). To me, this just seems like the logical way to
write code, but the heck do I know? :^)

Post by Donald Becker
If a machine fails we still finish the rendering and do the final
call-outs, but don't cleanly terminate.

Ah. Ok. Kewl. Sounds logical to me.

Type at ya' later,

Randall
--
Randall Jouett
Amateur Radio: AB5NI

I eat spaghetti code out of a bit bucket while sitting at a hash table!

Florent.Calvayrac

2003-01-04 16:04:33 UTC

Post by Donald Becker
This is trivial. Once you get a good a serial pseudo-random number
generator, you can use a cluster to generate more.

Sorry to disagree here, but you forget that you can not be sure
that the initial seed on one of the machines will not
be generated by another one...so you can end up with
completely correlated series. Check "scalable parallel number
generator" (sprng), working well for us, using different
recurrence formulas on different machines.

--
Florent Calvayrac | Tel : 02 43 83 26 26
Laboratoire de Physique de l'Etat Condense | Fax : 02 43 83 35 18
UMR-CNRS 6087 | http://www.univ-lemans.fr/~fcalvay
Universite du Maine-Faculte des Sciences |
72085 Le Mans Cedex 9

Donald Becker

2003-01-04 17:29:40 UTC

Post by Florent.Calvayrac

Post by Donald Becker
This is trivial. Once you get a good a serial pseudo-random number
generator, you can use a cluster to generate more.

Sorry to disagree here, but you forget that you can not be sure
that the initial seed on one of the machines will not
be generated by another one...so you can end up with
completely correlated series. Check "scalable parallel number
generator" (sprng), working well for us, using different
recurrence formulas on different machines.

I was assuming a sophisticated RNG. With such, the likelyhood of
identical seeds is very low, exactly the same as correlation within the
number stream. Anyone that needs a cluster to generate random numbers
will be far beyond using a LFSR with a small seed.

I'll even put forth a hand-waving argument that multiple machines will
be working from a much richer entropy pool, and thus generate better
quality numbers.

--
Donald Becker ***@scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993

Randall Jouett

2003-01-05 04:31:13 UTC

Hi Donald (folks),

Post by Donald Becker

Post by Florent.Calvayrac

Post by Donald Becker
This is trivial. Once you get a good a serial pseudo-random number
generator, you can use a cluster to generate more.

Sorry to disagree here, but you forget that you can not be sure
that the initial seed on one of the machines will not
be generated by another one...so you can end up with
completely correlated series. Check "scalable parallel number
generator" (sprng), working well for us, using different
recurrence formulas on different machines.

I was assuming a sophisticated RNG. With such, the likelyhood of
identical seeds is very low, exactly the same as correlation within the
number stream. Anyone that needs a cluster to generate random numbers
will be far beyond using a LFSR with a small seed.

Exactly along the lines I was thinking about, except that I
didn't mention this in my original posting. I already write volumes
in my postings (diatribes?:^), so I tend to let others read between the
lines, when applicable.

Post by Donald Becker
I'll even put forth a hand-waving argument that multiple machines will
be working from a much richer entropy pool, and thus generate better
quality numbers.

Yep, and let's not forget that a semi-decent sized, hardware-encryption
based cluster could be set aside to generate the initial seeds, and then
the seeds could propagate over the entire network, further reducing the
chance that an identical seed will rear its ugly head. Well, for a
certain amount of time, anyway.

To take things even further, another cluster could be used that would
try guess the random sequences being generated (pattern matching), and
if it found something, it could report back to the entropy cluster and
tell it to change things up and get with the program :^). (Sorry, but I
had to say it :^)

Taking this method a step further still, MPI latency might even able to
be used (delta time between the compute nodes and the head) to generate
seeds, too, although I'm not really sure how random something
like this would be. OTOH, with thousands and thousands of nodes
participating in the network, maybe a tactic like this could be
useful. (Shrug.)

BTW, this is all off of the top of my head, and I haven't really thought
about this very deeply, but it does seem to make a certain bit of sense
to my way of thinking. OTOH, maybe I'm weird or something, and what the
hello do I know :^). (Shrug.)

Best Regards,

Randall

--
Randall Jouett
Amateur Radio: AB5NI

Florent Calvayrac

2003-01-06 08:28:25 UTC

Post by Randall Jouett
Hi Donald (folks),

Post by Donald Becker

Post by Florent.Calvayrac

Post by Donald Becker
This is trivial. Once you get a good a serial pseudo-random number
generator, you can use a cluster to generate more.

Sorry to disagree here, but you forget that you can not be sure
that the initial seed on one of the machines will not
be generated by another one...so you can end up with
completely correlated series. Check "scalable parallel number
generator" (sprng), working well for us, using different
recurrence formulas on different machines.

I was assuming a sophisticated RNG. With such, the likelyhood of
identical seeds is very low, exactly the same as correlation within the
number stream. Anyone that needs a cluster to generate random numbers
will be far beyond using a LFSR with a small seed.

well, theoretically at least, I am right. I agree that the probability
of identical seeds is the same than the one of correlation (this is
trivial), but this probability increases with the number of nodes and
the length of the run. Robert G Brown pointed that out as well.

Post by Randall Jouett

Post by Donald Becker
I'll even put forth a hand-waving argument that multiple machines will
be working from a much richer entropy pool, and thus generate better
quality numbers.

is this a joke ? For sure also, the more expensive the computer, the
better the results are ! (especially with blinkenlights)

Post by Randall Jouett
Yep, and let's not forget that a semi-decent sized, hardware-encryption
based cluster could be set aside to generate the initial seeds, and then
the seeds could propagate over the entire network,
To take things even further, another cluster could be used that would
try guess the random sequences being generated (pattern matching), and
if it found something, it could report back to the entropy cluster and
Taking this method a step further still, MPI latency might even able to
be used (delta time between the compute nodes and the head)

this is certainly much simpler to implement than downloading sprng,
(on http://sprng.cs.fsu.edu/ )
or just implementing a node-number dependent recurrence formula on each
machine, as RGB is also doing. Measuring an MPI latency time
(in order of several thousands cycles), communicating back and forth,
and running the same calculation on another cluster to predict the
results as you suggest is also certainly faster and more elegant to
achieve independence of the streams.

check

http://www.npaci.edu/online/v3.7/SCAN1.html

--
Florent Calvayrac | Tel : 02 43 83 26 26
Laboratoire de Physique de l'Etat Condense | Fax : 02 43 83 35 18
UMR-CNRS 6087 | http://www.univ-lemans.fr/~fcalvay
Universite du Maine-Faculte des Sciences |
72085 Le Mans Cedex 9

Randall Jouett

2003-01-07 18:21:22 UTC

Hello Florent, and thanks for the reply.

Post by Florent Calvayrac
this is certainly much simpler to implement than downloading sprng,
(on http://sprng.cs.fsu.edu/ )

Thanks for the link. I'll check it out.

Post by Florent Calvayrac
or just implementing a node-number dependent recurrence formula on each
machine, as RGB is also doing. Measuring an MPI latency time
(in order of several thousands cycles), communicating back and forth,
and running the same calculation on another cluster to predict the
results as you suggest is also certainly faster and more elegant to
achieve independence of the streams.

Thanks. The way I see it, if you have latency, you might as well
see if you can somehow put it to good use :^).

Post by Florent Calvayrac
check
http://www.npaci.edu/online/v3.7/SCAN1.html

Will do. Maybe I might even learn something. Kewl :^).

Randall

--
Randall Jouett
Amateur Radio: AB5NI

Bryce Bockman

2003-01-14 19:07:52 UTC

Donald Becker

2003-01-04 17:58:51 UTC

On Sat, 4 Jan 2003, Randall Jouett wrote:

[[ Topic: POV-Ray modified to use BeoMPI. ]]

Post by Randall Jouett

Post by Donald Becker

Post by Randall Jouett

Post by Donald Becker
It completes the rendering even with crashed or slow nodes.

Ah. So it redistributes the work, huh? Kewl.

Here we use knowledge about the application semantics to implement
failure tolerence. When we have idle workers and the rendering isn't
finished, we send some of the remaining work to the idle machine.

Well, I hate to sound like a knothead here, Donald, and I don't
mean to be rude, but isn't this a defacto setup and standard in
a beowulf environment?? If not, what the hell are people thinking
about? :^) :^). To me, this just seems like the logical way to
write code, but the heck do I know? :^)

Not at all! MPI does not handle faults. Most MPI applications just
fail when a node fails. A few periodically write checkpoint files, and
a subset ;-) of those can be re-run from the last checkpoint.

With the POV-Ray port I used application specific knowledge and explicit
code to re-issue the work and handle duplicate results. You can use the
same idea (but unique code) with other MPI applications that don't have
side effects within the time step.

Although the program completes the rendering, there is still much
ugliness when a partially-failed MPI program tries to finish.

--
Donald Becker ***@scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993

Randall Jouett

2003-01-05 07:27:35 UTC

Post by Donald Becker
Not at all! MPI does not handle faults. Most MPI applications just
fail when a node fails. A few periodically write checkpoint files, and
a subset ;-) of those can be re-run from the last checkpoint.

Checkpoint files? BLAH!!! :^). Admittedly, I'm a total neophyte
when it comes to parallel processing and the beowulf architecture,
but computin is computin, and I think I might have a "el-cheapo,
ham-operator solution." (Hams are infamous be being TOTAL cheapskates
:^).

Off the top of my head, why couldn't you just plug in an old
10Base-T card to each node. Add a server node that specifically
polls each machine via hardware latch and software response.
Just a quick, "Hey, I'm still here." This fault server would
then send the root/head node a quick "we're running, boss!"
message, or it would tell the root/head node that a particular
machine was down. If the root machine sees a fault message,
it parses the packet, ignores the broken node, then reschedules
the task for execution. It could also send an e-mail to the
sysadmin, page him, and even play a "RED ALERT!" sample from
Trek :^).

Now, if your REALLY wanted to be cheap :^), you could do something
like this with a USB hub, although I'm pretty sure it wouldn't
be as fast as the 10Base-T setup. OTOH, 10Base-T gear (e.g. hub,
switch, NICs) can probably be had for the asking at most
institutions, I'd imagine.

BTW, has anyone bothered to calculate all the wasted cycles
used up by check-point files? :^). BTW, I guess you could
also implement something like this in software, having the
root node poll each compute node every so often, but I'm
pretty sure this would probably be kinda "chatty" on the
network and be a waste of bandwidth. I guess you could
monitor network traffic via tcpdump or something and set
the polls to a reasonable level, though. (Shrug.) Hey,
I'm not paid to do this, so I'm not going to get out the
calculator and strain the brain :^).

Post by Donald Becker
With the POV-Ray port I used application specific knowledge and explicit
code to re-issue the work and handle duplicate results.

Well, I like my "mainly hardware" version better :^p. :^) :^)

Post by Donald Becker
You can use the same idea (but unique code) with other MPI
applications that don't have side effects within the time step.

Kewl, and I'd imagine that time is everything in a beowulf
setup.

Post by Donald Becker
Although the program completes the rendering, there is still much
ugliness when a partially-failed MPI program tries to finish.

Hmmm. Why aren't folks flagging the node as dead and ignoring
any other output until the node is back up and saying it's
ready to run. This would have to be verified by the sysadmin,
of course.

Best Regards,

Randall
--
Randall Jouett
Amateur Radio: AB5NI

P.S.

The model I mentioned does have its flaws, of course, such
as a switch or hub going down, or maybe a busted CAT-5 cable
here or there. Something tells me, though, that it HAS to be
infinitely superior to check-point files and the like :^).
That is, if I'm understanding your meaning here of check-point
files. If I'm off base here, Donald, maybe you could clarify?

John Burton

2003-01-06 15:36:27 UTC

Post by Randall Jouett

Post by Donald Becker
Not at all! MPI does not handle faults. Most MPI applications just
fail when a node fails. A few periodically write checkpoint files, and
a subset ;-) of those can be re-run from the last checkpoint.

Checkpoint files? BLAH!!! :^). Admittedly, I'm a total neophyte
when it comes to parallel processing and the beowulf architecture,
but computin is computin, and I think I might have a "el-cheapo,
ham-operator solution." (Hams are infamous be being TOTAL cheapskates
:^).

Ummm...'all "computin" ain't equal'. While checkpoint files might not be
useful for what you do, they save thousands of machine and man hours in
my business. We have gigabytes of raw data from satellites being
recorded per day. Processing a day's worth of data requires 2 days on a
2.5ghz P-4. So, divide the data into orbits and process the orbits in
parallel. The mathematical model is such that fine-grained parallel
processing is not practical at this time (massive redesign and the
scientists don't understand parallel). If a process dies, then we can go
back to the logs and correct the problem and restart from the last
checkpoint (which was a minute or so ago) instead of starting over at
the begining, which could be as much as 24 hours ago...

Post by Randall Jouett
Off the top of my head, why couldn't you just plug in an old
10Base-T card to each node. Add a server node that specifically
polls each machine via hardware latch and software response.
Just a quick, "Hey, I'm still here." This fault server would
then send the root/head node a quick "we're running, boss!"
message, or it would tell the root/head node that a particular
machine was down. If the root machine sees a fault message,
it parses the packet, ignores the broken node, then reschedules
the task for execution. It could also send an e-mail to the
sysadmin, page him, and even play a "RED ALERT!" sample from
Trek :^).

Apparently you are not current on cluster technology, or you wouldn't be
proposing something that is common knowledge.

Post by Randall Jouett
Now, if your REALLY wanted to be cheap :^), you could do something
like this with a USB hub, although I'm pretty sure it wouldn't
be as fast as the 10Base-T setup. OTOH, 10Base-T gear (e.g. hub,
switch, NICs) can probably be had for the asking at most
institutions, I'd imagine.

10Base-T is too slow for typical parallel application. Switched
100Base-T is almost as inexpensive.

Post by Randall Jouett
BTW, has anyone bothered to calculate all the wasted cycles
used up by check-point files? :^).

Yup, and it is significantly less than the number of cycles that would
be wasted having to rerun 24 hours worth of processing because a machine
hiccuped and the process died...

Post by Randall Jouett
Randall
--
Randall Jouett
Amateur Radio: AB5NI
P.S.
The model I mentioned does have its flaws, of course, such
as a switch or hub going down, or maybe a busted CAT-5 cable
here or there. Something tells me, though, that it HAS to be
infinitely superior to check-point files and the like :^).
That is, if I'm understanding your meaning here of check-point
files. If I'm off base here, Donald, maybe you could clarify?

In my world a check point file is a "snapshot" of the state of running
process at a given time. This "snapshot" is complete enough to restart
the process at that point should it fail at a later point.

John

Randall Jouett

2003-01-07 18:22:12 UTC

Howdy John, and thanks for the reply.

Post by John Burton
Ummm...'all "computin" ain't equal'. While checkpoint files might not be
useful for what you do, they save thousands of machine and man hours in
my business.

With all kidding aside, I can see how (in some applications)
check-point files are and absolute necessity. My only beef
with the situation is that a large amount of time is being
spent doing IO on a "maybe." I do, however, see how they
can be useful.

Post by John Burton
Apparently you are not current on cluster technology, or you wouldn't be
proposing something that is common knowledge.

As I've said from the beginning, I'm a complete and total
neophyte when it comes to parallel processing and beowulf.
I'm a contractual software engineer, and I specialize in
data-acquisition and process-control software for the oil biz.
I also work part time for Stewart Technology Associates.
(If interested, check out www.stewart-usa.com, although
be forewarned that it's a pretty geeky site :^). One of the
reasons I'm on this list is to look at the feasibility of
using a beowulf cluster for computational fluid dynamics.
Rather than send the work down road or have the calculations
take a few days to run on a single PC, I'd like to do this
internally and setup an in-house cluster on some old PCs we
have hanging around.

When I'm not busy doing all of that, I'm usually visiting
my family in Louisiana, and I also do some network administration
in the business world for some friends that own local businesses.
With networking comes the added responsibility of computer security,
of course, which I also help to implement at STA in Houston. I've also
been known to take on embedded-control projects or design a bit of
hardware to solve a given task.

BTW, I'm also on the list to see if I can learn a bit about cluster
administration. So, if the oil biz falls off in a serious way, I might
be able to find a decent job quickly. For the most part, though, I'm
here for FUN, and I enjoy brainstorming with the learned folks here to
see if I my "somtimes old news to the experienced" ideas have any form
of validity. If it's old hat, no big deal. If not, then maybe I might
come up with an idea that could actually help someone. Let's just say
that doing this is a creative outlet that also helps me to understand
all of this stuff quickly. Unfortunately, though, some may see this
as mucking up the signal-to-noise ratio, raising the noise floor to
an unappreciated level." In many respects, they'd be correct! :^). I can
only say that this isn't my intention, and I'm only doing this to
see if my layman ideas and thought processes are at least in the
ballpark when it comes to parallel processing and beowulf.

Post by John Burton
10Base-T is too slow for typical parallel application. Switched
100Base-T is almost as inexpensive.

Great. This is good to know. OTOH, I was trying to think of
ways to put really antiquated gear to use. After all, and
from what I've been reading, doing things cheaply is what
beowulf is all about.

Post by John Burton
In my world a check point file is a "snapshot" of the state of running
process at a given time. This "snapshot" is complete enough to restart
the process at that point should it fail at a later point.

OK. Thanks for the explanation, and that was the way I was thinking
about it when Donald and I were discussing this earlier.

Best Regards,

Randall

--
Randall Jouett
Amateur Radio: AB5NI

Greg Lindahl

2003-01-07 19:42:45 UTC

Post by Randall Jouett
With all kidding aside, I can see how (in some applications)
check-point files are and absolute necessity. My only beef
with the situation is that a large amount of time is being
spent doing IO on a "maybe." I do, however, see how they
can be useful.

Most people don't waste large amount of time. What they do is compare
the average loss of computation due to a failure with the loss of
computation due to the extra I/O.

Example: My machine fails on average every 24 hours. It takes me 1
hour to checkpoint. Therefore if I checkpoint every 8 hours, the
average loss from a failure is 4 hours, and I spent 3 hours doing I/O.

That's an ASCI-class example; most small clusters only need a few
minutes to checkpoint and have a failure every month.

-- greg

Joe Nellis

2003-01-07 20:22:02 UTC

Greetings,

I purchased the scyld CDROM from Linux Central about June 2001. Is the CD
currently being sold the same one or a new version since then?

Sincerely,
Joe

Mark Hahn

2003-01-04 18:19:40 UTC

Post by Randall Jouett
Personally, I don't think so, especially if we consider the
fact that in the not-too-distant future, networking speeds
will be up to snuff with the various tasks at hand. With these

ah! I think this is the central fallacy that drives grid enthusiasm.

there simply is no coming breakthrough that will make all networking
fast, low-latency, cheap, ubiquitous and low-power. and grid
(in the grand sense) really does require *all* those properties.
oh, you will certainly manage to do some very interesting things
with wimpier networking, but with major compromises. I don't see
people doing parallel weather sims over 803.11*-connected nodes
any time soon. but ***@home-type applications (very losely coupled
and coarse-grained) would be a fine way to keep my fridge's brain
busy. on the other hand, a fridge will always be a tiny fraction
of the compute power of a desktop, so is it worth it? not to mention
the fact that ***@fridge will jack up my monthly power bill...

ultrawideband is an interesting development for this kind of networking,
perhaps also in the optical range. anyone interested in this stuff should
read Robert Forward and Vernor Vinge's books (FS novels).

ps: I don't mean grid stuff isn't worthwhile, or that we can't do
any of it until the perfect network arrives. there's lots of great
work going on - p2p networking, java/jini/jxta, etc. I just don't see
it being relevant to the beowulf world very soon, or ever being as
grand as the starry-eyed gridophiliacs would like to predict...

Erik Paulson

2003-01-05 17:51:59 UTC

Post by Mark Hahn

Post by Randall Jouett
Personally, I don't think so, especially if we consider the
fact that in the not-too-distant future, networking speeds
will be up to snuff with the various tasks at hand. With these

ah! I think this is the central fallacy that drives grid enthusiasm.

Then you clearly don't understand grid computing. That's understandable,
because "grid computing" has become a hijacked term - just like when
most people say "hacker" they really mean "cracker". Most of the press
coverage of grid computing in the past few months has simply been wrong
about what's really going on, and most of the new grid products and
grid companies are either useless or really don't have very much to do
with grid computing. They're just on the grid bandwagon because it's trendy,
and it's pretty frustrating for those of us who are doing real work in the
area, because the marketing drones are drowning out the real message.

Post by Mark Hahn
there simply is no coming breakthrough that will make all networking
fast, low-latency, cheap, ubiquitous and low-power. and grid
(in the grand sense) really does require *all* those properties.

Grid computing does not require any of this. Grid computing is all about
access and coordination. Grid computing is much more than just
running naturally (embarrassingly) parallel problems on spare cycles on
every computer people can find. Certainly ***@Home and distributed.net
have been successful and are probably the first sorts of applications
that will be able to make serious use of "the grid" (if there's ever a real
"the grid" that emerges, though it's more likely there will be several large
"the grids" and many internal ones)

Post by Mark Hahn
oh, you will certainly manage to do some very interesting things
with wimpier networking, but with major compromises. I don't see
people doing parallel weather sims over 803.11*-connected nodes
and coarse-grained) would be a fine way to keep my fridge's brain
busy. on the other hand, a fridge will always be a tiny fraction
of the compute power of a desktop, so is it worth it? not to mention

No, it's not. I'm convinced that there will never be a market for cycles
recovered from the home user - it's just not worth it. Pretend that compute
power is (physical) storage space - nearly everyone in the world has some
extra closet space (or at least I do, but I'm single :). Nearly every
company in the world has some storage requirements - maybe it's worthwhile
to rent some of that extra closet space - if I ever need my space back, I'll
just send back the box the company sent me to store. But now the company
needs to:
1. Be able to put their stuff into boxes small enough to fit into my closet
2. Handle keeping track off all of those boxes
3. Verifying that I never messed with the contents of that box when I get it
back.

It's just not worth it - probably not for the company, and certainly not for
me - whatever they're paying me it's not likely enough for me to keep their
box safe and send it back to them on a whim. There may be a few things I'd
consider doing that for, though - if the Red Cross told me they wanted to
store a box of emergency supplies in my closet (and the closets of all of my
neighbors) and all I had to do was pull them out if they ever needed them,
then I'd probably do that - this is the model that ***@home and
***@home are using and it's working for them.

However, it's probably worth it for the company to do that internally -
they've got control over all of the closets anyway, so for some of their stuff
they should try and reclaim that space. (I think we've all got signs at work
that say "Physical Plant storage only" :)

Think of "grid computing" as a shipping and warehousing company (actually,
after the power crisis in California a while back naming something after
the electric grid is kinda a bad idea :). You don't want to have to
build and maintain your own warehouse, you'd rather get it from someone
else. This warehouse(grid) company will have the appropriate resources for
what I want to do - if I just want to store lots of little boxes then
maybe I don't care much if they use lots of little 10x10 storage units. Or,
maybe my stuff is too big (maybe I need to store the shuttle for a few days :)
so I need a huge, huge warehouse(cluster). I don't want to have separate
billing and addressing methods for this; I just want to say "I need to store
this" and have it happen. And not only do I need just space, but maybe I
want to consider location - if my factory is in northern Wisconsin, I'd
rather not ship my widgets to California if there's closer, unused space in
Milwaukee.

Some companies will setup their own, internal distribution/grids - think of
Walmart - and inside the company they'll deal with however the cost recovery
method needs to work. Others will get it from the big boys - you'll want
someone you can trust, so you're more likely to use FedEx than Fly-By-Night
Shipping, Inc. The important point is that access to it is basically the
same - you've got a box that needs to go somewhere - FedEx and UPS both
take packages with the same address, only the billing is a bit different.

There are some cooler things that Grid Computing will let you do that aren't
really covered in a shipping analogy. First off, it's easy to create
free-wheeling deals with other sites - maybe I can access to another cluster
down the road, and I can use the standard grid interfaces to it, instead of
having to learn where all the software is installed, remember my username on
the machine, which batch system it's running, etc etc etc. There's also
possibilities for levels of indirection and middlemen - maybe the American
Association of Physicists will buy 1 million CPU hours for it's members. The
physicist will just go to grid://aap.org* and submit their jobs. AAP.org will
deal 750,000 hours at IBM.com, 10,000 from doe.gov, 100,000 from
GridStartup.com, and so forth. When the million hours starts to run out,
AAP.org will deal with buying more.

I think that (at least for the next few years) linux-based beowulfs will
be the main building blocks of these sorts of Grids. This doesn't mean that
we'll stick 16-node clusters at 10 sites, and haphazardly schedule MPI jobs
across the 160 CPU's - clearly, tightly-coupled codes will stay together.
Consider TeraGrid (teragrid.org) - it's 4 (5 now that the Pittsburgh TCS-1
will be part of it) large clusters. Most jobs will run on one cluster at a
time - certainly some will span multiple clusters on the grid, probably as
two tightly coupled instances exchanging coarse-grained boundary info or some
such. (Teragrid does have a 40Gbps connection between the 4 sites, but it
still takes those 40Gb some time to cross between Illinois and California :)
If your job can tolerate the latency, then go ahead and schedule it
wherever. If it can't, then don't. Grid Computing doesn't throw 40 years
of parallel computing basics out the window.

Grid Computing is not going to replace cluster computing - it's a
complementary style of computing. Some people are going to (and with good
reason) still build their own clusters and keep them in house. I think that
in the future (when grid computing is more reliable) that many of the people
who currently buy 32 node clusters and then have them at about 5% utilization
will be better off going to one of the (someday to exist) grid providers. What
the grid community really hopes to see is the (much larger) percentage of
scientists who currently don't use computing but should be will be able to
get into it - even with all the wonderful work that Scyld and LinuxNetworX and
the like have been doing to make turn-key clustering easy, I'd still guess
that only 1 out of every 10 or 1 out of every 100 people who could do better
science with more computing will. (And of course, there are other people than
just scientists. The arts, business, etc)

Post by Mark Hahn
ultrawideband is an interesting development for this kind of networking,
perhaps also in the optical range. anyone interested in this stuff should
read Robert Forward and Vernor Vinge's books (FS novels).
ps: I don't mean grid stuff isn't worthwhile, or that we can't do
any of it until the perfect network arrives. there's lots of great
work going on - p2p networking, java/jini/jxta, etc.

The great thing about a downturn in the economy is all-hype doesn't
survive. At SC01, P2P companies were all over the place, and at SC02
most of them didn't have a booth this year. Hopefully a good amount of
the deadwood in grid computing will be culled out this year (and we'll
have a different set of companies at SC03 that won't make it the year :)

-Erik

Post by Mark Hahn
I just don't see
it being relevant to the beowulf world very soon, or ever being as
grand as the starry-eyed gridophiliacs would like to predict...
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Randall Jouett

2003-01-05 04:31:46 UTC

Post by Mark Hahn

Post by Randall Jouett
Personally, I don't think so, especially if we consider the
fact that in the not-too-distant future, networking speeds
will be up to snuff with the various tasks at hand. With these

ah! I think this is the central fallacy that drives grid enthusiasm.

Maybe so, but the fact still remains that there are
certain task out that will most definitely take
advantage of more nodes being available. Not only
that, but the application will most definitely not
care where the nodes are coming from. In this instance,
all that should really matter is "the more nodes the
better."

BTW, I still stand by my statement that networking
speeds will increase, and also that this increase
will add to overall computing productivity.

Post by Mark Hahn
there simply is no coming breakthrough that will make all networking
fast, low-latency, cheap, ubiquitous and low-power. and grid
(in the grand sense) really does require *all* those properties.

Well, I agree, to some extent, but where I differ from you
is with the word "latency." As I just said, there are applications
that will take advantage of the availability of more nodes.
So you don't get your answer in 2.3 microseconds. In a lot
of instances, nobody will actually care, and in some programming
situations, it will be down right miraculous that you can even
get an answer in a day! :^)

Post by Mark Hahn
oh, you will certainly manage to do some very interesting things
with wimpier networking, but with major compromises. I don't see
people doing parallel weather sims over 803.11*-connected nodes
any time soon.

Agreed, but they could be around the corner. IMHO, speculation
in this area would as least be a productive use of ones time,
and it sure in the hell beats the crap out of watching TV :^).

Post by Mark Hahn
and coarse-grained) would be a fine way to keep my fridge's brain
busy.

NOW we're in the same ballpark, Mark :^).

Post by Mark Hahn
on the other hand, a fridge will always be a tiny fraction
of the compute power of a desktop, so is it worth it? not to mention

I don't think you'd even notice the fluctuation in your power bill,
especially when you consider the current draw of the compressor :^).
Well, you'd probably notice it a bit, but also remember that we're
using a currently viable application for comparison here and that
there is no telling what some ultra-creative geek might dream up
in the future.

Post by Mark Hahn
ultrawideband is an interesting development for this kind of networking,
perhaps also in the optical range. anyone interested in this stuff should
read Robert Forward and Vernor Vinge's books (FS novels).

I was thinking about using a setup like this above 300 GHz,
where the spectrum isn't monitored by the FCC, if I remember
correctly.

Post by Mark Hahn
ps: I don't mean grid stuff isn't worthwhile, or that we can't do
any of it until the perfect network arrives. there's lots of great
work going on - p2p networking, java/jini/jxta, etc. I just don't see
it being relevant to the beowulf world very soon, or ever being as
grand as the starry-eyed gridophiliacs would like to predict...

I totally agree here, Mark. OTOH, with enough people focusing
their efforts in this arena, there is probably no telling what
some of these people will come up with. Also note that we're
really talking time and distance here, and something like
a billion cubed nanobots cranking on a problem, taking up
an inch of physical space, has to be something that we
should all consider. Sure, it won't happen any time soon,
but the ramifications will surely be mind boggling. Can you
imagine saying something like this to your nano-bot cluster:

"Bots...You see that chick on TV? Well, I want you to duplicate
her overall looks, but please increase her bust size by 4 inches,
and make her appear right there, nude, and with a sexy grin on
her face." :^) :^). Move over holodeck :^) :^) :^).

Randall
--
Randall Jouett
Amateur Radio: AB5NI

Eugen Leitl

2003-01-04 19:20:50 UTC

Post by Mark Hahn
there simply is no coming breakthrough that will make all networking
fast, low-latency, cheap, ubiquitous and low-power. and grid
(in the grand sense) really does require *all* those properties.

I'm not quite sure. The only hard limit on latency is relativistic (in
vacuum, 1 ns = 0.3 m; 10 ns = 3 m, 100 ns = 30 m; 1 us = 3 km; 10 us = 30
km, 100 us = 300 km). Right now, commercial networks based on GBit fiber
Ethernet backbones exist, delivering sub-ms latency to end consumers. 10
GBit fiber Ethernet will be starting to displace GBit Ethernet in that
niche. At 10 GBps fiber acts as a FIFO, containing ~50 bit/m (50 kBit/km)
of fiber allowing (admittedly, there is no impetus for developing
cut-through WAN transmission technology) almost purely photonically
switched networks where routing latency is negligible in regards to
relativistic latency. That assumes that the fiber(s) is unloaded, of
course, as store-and forward will suddenly result in lousy latency. This
can't happen on a true crossbar-switched LAN.

This clearly can't compete with dedicated ultralocal interconnects like
Myrinet & Co, but it indicates GBit based clusters need not to be located
physically close.

Patrick Geoffray

2003-01-04 21:42:19 UTC

Post by Eugen Leitl

Post by Mark Hahn
there simply is no coming breakthrough that will make all networking
fast, low-latency, cheap, ubiquitous and low-power. and grid
(in the grand sense) really does require *all* those properties.

I'm not quite sure. The only hard limit on latency is relativistic (in
vacuum, 1 ns = 0.3 m; 10 ns = 3 m, 100 ns = 30 m; 1 us = 3 km; 10 us = 30
km, 100 us = 300 km).

It's it 300 Km = 1 ms ?
In a fiber it's almost half of that.

Post by Eugen Leitl
cut-through WAN transmission technology) almost purely photonically
switched networks where routing latency is negligible in regards to
relativistic latency.

With a routing latency of the order of 1 us for current GigE hardware,
it's already negligible in regards of relativistic latency on WAN or MAN
(> 300 m). On LAN, this is another story.

Post by Eugen Leitl
This clearly can't compete with dedicated ultralocal interconnects like
Myrinet & Co, but it indicates GBit based clusters need not to be located
physically close.

With clusters becoming larger and larger, the local components may not
already be physically close. For example, you may need 2 or more floors,
or several machine rooms in different building on the same site
(University). There is a real need for that, and that's why the next
Myrinet switching blade proposes Long-Haul (100 Km and more) fiber
ports.
Now, even for GigE, the difference between the routing latency and the
relativistic latency starts to be too important for distance greater
than 100 Km. So you can aggregate several clusters in a Grid way, but it
would only be interesting for embarrassingly // code if the clusters are
scattered across the US of Europe for example.

My 2 coins.

Patric

--
Patrick Geoffray, Phd
Myricom, Inc.
http://www.myri.com

Mark Hahn

2003-01-04 22:11:39 UTC

Post by Eugen Leitl

Post by Mark Hahn
there simply is no coming breakthrough that will make all networking
fast, low-latency, cheap, ubiquitous and low-power. and grid
(in the grand sense) really does require *all* those properties.

I'm not quite sure. The only hard limit on latency is relativistic (in
vacuum, 1 ns = 0.3 m; 10 ns = 3 m, 100 ns = 30 m; 1 us = 3 km; 10 us = 30
km, 100 us = 300 km).

sure, though I admit I was thinking about 1 ns per foot ;)

I think the appeal of Grids is to farm really large collections of
underutilized processors. to me, that means more than just a couple
of buildings worth (which would be in ~5 us diameter - a typical
latency for serious non-loosely-coupled clustering today.)

Post by Eugen Leitl
Right now, commercial networks based on GBit fiber
Ethernet backbones exist, delivering sub-ms latency to end consumers.

1 ms! jeez, I think that stretches the definition of clustering,
even the incredibly loose-coupled end of the field that people call Grid.

Post by Eugen Leitl
10 GBit fiber Ethernet will be starting to displace GBit Ethernet in that
niche.

a nice bandwidth improvement. no help for latency, and really, the existence
of a 10Gb campus backbone might not change real app-level performance much.
(the mere existence of 10Gb will generate competing consumers. for instance,
I expect any university with a 10Gb backbone already does VOIP and possibly
video over it, not to mention security card scanners, centralized file/backup
services, more undergrad web surfing...)

Post by Eugen Leitl
At 10 GBps fiber acts as a FIFO, containing ~50 bit/m (50 kBit/km)
of fiber allowing (admittedly, there is no impetus for developing
cut-through WAN transmission technology) almost purely photonically
switched networks where routing latency is negligible in regards to
relativistic latency. That assumes that the fiber(s) is unloaded, of
course, as store-and forward will suddenly result in lousy latency. This
can't happen on a true crossbar-switched LAN.

it's neat stuff. being a pessimist, I look at "street-level" bandwidth.
I guesstimate that over internet WANs in a typical non-podunk university,
I've see 2-4x improvement in bandwidth versus 5 years ago. no real change
in latency, since it's geographic. I don't know about you, but my DSL
at home, considered a pretty good service, is capped at 100 KB/s down
and a measly 15 KB/s up. that's only moderately better than dialup
(even ignoring the fact that cable/dsl providers are usually a lot
more hops away than a university's own modem pool, meaning vastly more lossy)

Post by Eugen Leitl
This clearly can't compete with dedicated ultralocal interconnects like
Myrinet & Co, but it indicates GBit based clusters need not to be located
physically close.

for ***@home type stuff, which is extremely latency-tolerant. of course,
this is not news at all, since mainframes have been sharing files over
~10 mile distances for a long time. I still don't see that much has changed:
smallish improvements (10X or less in 5 years) and somewhat cheaper.

I think the main point is that networking is improving much more slowly
than many other metrics related to computers. though as usual, if you
only compare network latency to disk and dram latency, they all look
pretty similarly flat.

but that flatness was my original point: Grid, if it is to break out of the
***@home ghetto, must assume that networking will improve dramatically.
that doesn't seem to be happening, for physical, practical, political
and economic reasons...

Greg Lindahl

2003-01-05 08:36:07 UTC

Post by Mark Hahn
but that flatness was my original point: Grid, if it is to break out of the
that doesn't seem to be happening, for physical, practical, political
and economic reasons...

I don't know what Grid you're talking about, but the one pioneered by
projects such as Legion and Globus has always realized that network
latency is limited by the speed of light. That and the fact that large
collections of machines have continual small failures are the two
features that separate grid computing from cluster or parallel computing.

Meanwhile, network bandwidth, which is needed by the actual Grid, is
improving quite rapidly.

-- greg

Halpern, Joshua

2003-01-04 22:32:23 UTC

SNIP....

Post by Randall Jouett
That's what I was thinking. Can you say 1 billion bits of
RSA encryption? I knew you could :^). My only worry here is
that with a few hundred thousand nodes, would the NSA be able
to decrypt stuff relating to national security? That is, could
the government actually afford to purchase on-site equipment
that could keep up with P2P, Internet-based clustering, if you
will? Peronsally, I think they'd have to jump in on the bandwagon
with everyone else, yet with the current level of "Big Brother
is wathcing us all" attitude floating around, I seriously doubt
that people would knowingly allow them to use their free cycles
for decryption without some mega-serious watch-dogging geeks
watching their every move.

You are not nearly paranoid enough to work for NSA. They
would never borrow cycles out of house, for fear that
by looking at what was happening, someone could figure
out what they are doing...

Josh Halpern

Joseph Landman

2003-01-04 23:37:51 UTC

Post by Halpern, Joshua
You are not nearly paranoid enough to work for NSA. They
would never borrow cycles out of house, for fear that
by looking at what was happening, someone could figure
out what they are doing...

Now with a slight extension of what you indicate... are they building
their own power plants for fear of tipping off the watchful eyes of
others by the size of their power bill? Are they building their own
supers (apart from clusters), or network hardware, or ...

--
Joseph Landman, Ph.D.
Scalable Informatics LLC
email: ***@scalableinformatics.com
web: http://scalableinformatics.com
voice: +1 734 612 4615
fax: +1 734 398 5774

Randall Jouett

2003-01-05 05:03:30 UTC

Howdy Josh,

Post by Halpern, Joshua
You are not nearly paranoid enough to work for NSA. They
would never borrow cycles out of house, for fear that
by looking at what was happening, someone could figure
out what they are doing...

LOL! Thanks, I think :^). All kidding aside, I often
wonder if they'll have to out-source their cycles, though,
just to keep up with something like grid-type encryption,
if such a thing exists. If doesn't exist, I'm sure it will
soon. Hmmm. Maybe they could create a really addictive,
massive-multi-player game (such as EverQuest), release it
for -- AHEM! -- "free," and hide their decryption code
inside of the thing, transmitting small groups of processed
packets back to their servers via an obfuscated path. Hmmm.
I like that idea. :^)

Ok, NSA. Are you guys impressed? If so, I'll come to work
for $300,000.00 a year, all expenses paid, and you'll also
have to find a Sandra Bolock (sic?) clone to fix me up with
:^) :^) :^).

Randall
--
Randall Jouett
Amateur Radio: AB5NI

Halpern, Joshua

2003-01-05 02:32:21 UTC

Post by Halpern, Joshua
You are not nearly paranoid enough to work for NSA. They
would never borrow cycles out of house, for fear that
by looking at what was happening, someone could figure
out what they are doing...

Now with a slight extension of what you indicate... are they building
their own power plants for fear of tipping off the watchful eyes of
others by the size of their power bill? Are they building their own
supers (apart from clusters), or network hardware, or ...

Have you ever visited Ft. Meade? You do know the standing
joke that NSA stands for no such agency.

josh halpern

Joseph Landman

2003-01-05 02:54:56 UTC

Post by Halpern, Joshua
Have you ever visited Ft. Meade? You do know the standing
joke that NSA stands for no such agency.

So I have heard...

Must have made appropriations rather interesting in the past.

--
Joseph Landman, Ph.D.
Scalable Informatics LLC
email: ***@scalableinformatics.com
web: http://scalableinformatics.com
voice: +1 734 612 4615
fax: +1 734 398 5774

Dean Johnson

2003-01-05 03:33:30 UTC

Post by Joseph Landman

Post by Halpern, Joshua
You are not nearly paranoid enough to work for NSA. They
would never borrow cycles out of house, for fear that
by looking at what was happening, someone could figure
out what they are doing...

Now with a slight extension of what you indicate... are they building
their own power plants for fear of tipping off the watchful eyes of
others by the size of their power bill? Are they building their own
supers (apart from clusters), or network hardware, or ...
Have you ever visited Ft. Meade? You do know the standing
joke that NSA stands for no such agency.

Having been to the Fort a couple of times, I would be awfully surprised
if they didn't have their own power plants of some sort. Do you really
want the lionshare of the intelligence gathering apparatus taken out
when some idiot hits the wrong lightpole? You also have to consider that
amount of toys that they employ in their day to day activities.

And yes, I would guess that they probably do build their own
supercomputers in some cases. They have very specific requirements and a
serious mandate to make it happen, even before Bin Laden decided to
rearrange any skylines. Its just a matter of efficiencies and cost
effectiveness. Regardless of how much they wish hardware fairies
existed, they are effected by the same timeline to develop hardware and
software that commercial companies are.

While I am not privy to their real magic stuff, the place is awfully
impressive and the people are rightfully steeped in paranoia. Not many
places do you hear "if you walk down that hall, you will be shot dead"
and have it be a serious statement.

-Dean

Donald Becker

2003-01-05 21:13:52 UTC

Post by Randall Jouett

Post by Donald Becker
Not at all! MPI does not handle faults. Most MPI applications just
fail when a node fails. A few periodically write checkpoint files, and
a subset ;-) of those can be re-run from the last checkpoint.

Checkpoint files? BLAH!!! :^). Admittedly, I'm a total neophyte

Application-specific checkpoint files are sometimes the only effective
way to handle node crashes.

Post by Randall Jouett
Off the top of my head, why couldn't you just plug in an old
10Base-T card to each node. Add a server node that specifically

The problem isn't just detecting that a node has failed (which is either
trivial or impossible, depending on your criteria), the problems are
- handling a system failure during a multiple-day run.
- handling partially completed work issued to a node
- killing processes/nodes that you are think have failed, lest they
complete their work later.

Post by Randall Jouett
BTW, has anyone bothered to calculate all the wasted cycles
used up by check-point files? :^).

Checkpointing is very expensive, and most of the time the checkpoint
isn't used. This is why only application-specific checkpoinging makes
sense: the application writer known which information is critical, and
when everything is consistent. Machines that save the entire memory
space have been known to take the better part of an hour to roll out a
large job.

Post by Randall Jouett

Post by Donald Becker
Although the program completes the rendering, there is still much
ugliness when a partially-failed MPI program tries to finish.

Hmmm. Why aren't folks flagging the node as dead and ignoring
any other output until the node is back up and saying it's
ready to run. This would have to be verified by the sysadmin,
of course.

The issue is the internal structure of the MPI implementation: there is
no way to say "I'm exiting sucessfully even though I know some processes
might be dead." Instead what happens is that the library call waits
around for the dead children to return.

This brings us back to the liveness test for compute nodes. When do we
decide that a node has failed? If it doesn't respond for a second? A
transient Ethernet failure might take up to three seconds to restart
link (typical is 10 msec.) Thirty seconds? A machine running a 2.4
kernel before 2.4.17 might take minutes to respond when recovering from
a tempory memory shortage, but run just fine later.

--
Donald Becker ***@scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993

Randall Jouett

2003-01-06 16:31:42 UTC

Hello again, Donald/gang.

Post by Donald Becker

Post by Randall Jouett

Post by Donald Becker
Not at all! MPI does not handle faults. Most MPI applications just
fail when a node fails. A few periodically write checkpoint files, and
a subset ;-) of those can be re-run from the last checkpoint.

Checkpoint files? BLAH!!! :^). Admittedly, I'm a total neophyte

Application-specific checkpoint files are sometimes the only effective
way to handle node crashes.

Post by Randall Jouett
Off the top of my head, why couldn't you just plug in an old
10Base-T card to each node. Add a server node that specifically

The problem isn't just detecting that a node has failed (which is either
trivial or impossible, depending on your criteria), the problems are
- handling a system failure during a multiple-day run.
- handling partially completed work issued to a node
- killing processes/nodes that you are think have failed, lest they
complete their work later.

Ah. Ok. I understand now. Thanks for the info.

Post by Donald Becker

Post by Randall Jouett
BTW, has anyone bothered to calculate all the wasted cycles
used up by check-point files? :^).

Checkpointing is very expensive, and most of the time the checkpoint
isn't used. This is why only application-specific checkpoinging makes
sense: the application writer known which information is critical, and
when everything is consistent. Machines that save the entire memory
space have been known to take the better part of an hour to roll out a
large job.

An hour? Dang.

Post by Donald Becker

Post by Randall Jouett

Post by Donald Becker
Although the program completes the rendering, there is still much
ugliness when a partially-failed MPI program tries to finish.

Hmmm. Why aren't folks flagging the node as dead and ignoring
any other output until the node is back up and saying it's
ready to run. This would have to be verified by the sysadmin,
of course.

The issue is the internal structure of the MPI implementation: there is
no way to say "I'm exiting sucessfully even though I know some processes
might be dead." Instead what happens is that the library call waits
around for the dead children to return.

I take it that you're talking about a compute node when you're saying
all of this, and I'm also reading processes here as "the other nodes."
Remember, I'm a neophyte, Donald :^).

Anywho, I was thinking that the lib call was written in an asynchronous
fashion, with various flags being set on the root node when a compute
node completed its computation. Also, the only way the root would
continue on with the application is when all nodes sent a response
saying that they're done.

I also don't see why you couldn't make a few test runs, average
out the response time of each node and the overall process (if
necessary), and stick this info into a database for a given app
on the root node. (Off the top of my head, of course. I'll have to think
about this a while.) To me, this seems like you'd be adding a certain
level of fault tolerance at the software level.

Now, if we set up a response-time window for an individual compute node
and the root node thinks that it has fallen out of the window, then it
seems to me that the root node could flag the node as having temporary
problems, and then it could shift that nodes work over to the first node
that has completed its calculations/processing. Should the problematic
node straighten itself out and start responding again -- let's say it
finished its processing -- then the data is taken from that node
verbatim and stored off to disk. That is, if the recovery node didn't
finish its work already, of course. You'd also have to tell the original
node that straightened itself out "Never mind," of course. (Said with
Lilly Tomlin intonations :^) ).

Post by Donald Becker
This brings us back to the liveness test for compute nodes. When do we
decide that a node has failed? If it doesn't respond for a second? A
transient Ethernet failure might take up to three seconds to restart
link (typical is 10 msec.) Thirty seconds?

If it gets out of its window, you set things up so that the first
node to complete its computations takes over its work load. If node
acting up straightens itself out, great. Then you just kill the
request for node recovery and things should just "keep on trucking."
(Wow. That last remark is really showing my age :^)). At this point,
I guess you'd also want to increase the size of the window on the
node that's acting up, too. Also, if the node doesn't respond in
at twice the window size, I guess you could display a message
on the console, remove the node from computations, and let the
sysadmin take a look at the machine too see if anything is awry
with the node or the network. More than likely, hiccups would involved
latency, possibly do to fragmentation or the like. God forbid a memory
leak :^).

If you really wanted to get spiffy, I guess you could work a
neural net into the system, having monitor network it traffic
and such. A setup like this might be able to warn you if glitches
were getting ready to rear their ugly heads. :^)

Post by Donald Becker
A machine running a 2.4 kernel before 2.4.17 might take minutes to
respond when recovering from a tempory memory shortage, but run just >
fine later.

In the model I just described, I don't think this would be a problem.
(Shrug.) I still have to think about all of this, of course. The one
thing I really like about a model like this is that it would be
asynchronous, and you could get away with simplistic levels of
message passing. Just open a socket, read packets, and write packets.

BTW, I have read and understood everything you've said, Donald,
and I thank you wholeheartedly for the explanation. The way I
responded, though, you'd think that I already knew what I was
talking about. Witout any doubts -- I don't! :^). I wrote my
response this way, though, so that you and others can straighten
me out if I'm looking at parallel processing in a bass-ackwards
fashion. If not, then maybe the model I descrided is exactly how
things work already. If not, then maybe it might be worth looking
into further. (Shrug.)

Ok. I've been up all night with the flu. I need to try to get
some sleep, and answer the rest of the my beowulf e-mails later
on tonight, if I'm feeling better.

73 (Best Regards in morse code...ham lingo :^),

Randall

--
Randall Jouett
Amateur Radio: AB5NI

Robert G. Brown

2003-01-05 22:50:11 UTC

Post by Randall Jouett
Yep, and let's not forget that a semi-decent sized, hardware-encryption
based cluster could be set aside to generate the initial seeds, and then
the seeds could propagate over the entire network, further reducing the
chance that an identical seed will rear its ugly head. Well, for a
certain amount of time, anyway.

All of this very interesting discussion ignores one problem -- the
"intrinsic" parallel scaling of parallel uniform deviate generation.

On most hardware, generating "the next uniform deviate" costs (say) 3 to
5 or 6 operations. That is, a GHz-class CPU should be able to generate
tens to hundreds of millions of double precision uniform deviates per
second, depending on the algorithm used. On a 100 Mbps (12.5 MBps)
network, one obviously "loses" parallelizing this on even two hosts
unless one has a VERY long and complicated uniform deviate algorithm
(one that can time-dominate the serial fraction of the computation).

To put it another way, on an expensive Gbps network, packing whole
blocks of UD's into large message for efficient delivery, I suppose that
one could conceivably "win" parallelizing uniform deviate generation on
a couple or three hosts, PROVIDED that one had an application that could
suck up the incoming stream as fast as the network DMA delivered them
without blocking. In my Monte Carlo problem it doesn't -- I have to do
a teeny bit of trig and matrix arithmetic based on the weighted outcome
produced by every uniform deviate -- at least as much computation as
generating the UD itself locally. Thus, if I worked very hard, I MIGHT
get a parallel speed up of two by pre-generating the next UD on another
node and having it totally ready to go when I need it, but -- suprise!
-- that's exactly the speedup I get from parallelizing the entire
embarrassingly parallel application on two nodes, each making its own
UD's with its own seed. Besides, real-world inefficiencies would
probably make this a lose-lose effort anyway -- MPI or PVM or even raw
sockets have some overhead.

My own problem isn't trivial, but neither is it THAT sensitive to
uniform deviate generator choice, provided that I avoid ones with
obvious and known problems. So far, I've found it easiest to simply run
a large number of independent simulations, each with its own RNG/seed
and a decently large shuffle table. I accumulate hundreds, even
thousands of these simulations, but this still leaves the probability of
a coincident run very, very small, much smaller than the statistical
error remaining. At that, all the damage one or two coincident seeds
will do is cause the unbiased statistical error estimate to be very
slightly underestimated (as "one" independent runs will be counted as
"two"). Totally negligible, as long as one doesn't make a habit of it
(and in any event, by recording the initial seeds and LOOKING, easily
eliminated).

A lot of other problems can be solved the same way. Probably MOST other
problems -- as noted above, local generation is likely to compete
favorably with remote generation unless and until uniform deviate
generation itself DOMINATES the serial fraction of the computation.
This is almost never the case -- decryption, monte carlo, almost
anything nontrivial requires some computation to be done WITH the
uniform deviate, and as soon as that computation time competes with
generation time, you are almost certain to lose. Note that (for
purists) I'm including all lattice-type decompositions of a problem
where UD's are generated AND USED on a node in the same category as EP
distribution of simulations -- this is just contrasting this sort of
parallelism with the proposed notion of a "UD-generating cluster".

rgb

--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:***@phy.duke.edu

Mark Hahn

2003-01-06 14:46:56 UTC

Post by Erik Paulson

Post by Mark Hahn

Post by Randall Jouett
Personally, I don't think so, especially if we consider the
fact that in the not-too-distant future, networking speeds
will be up to snuff with the various tasks at hand. With these

ah! I think this is the central fallacy that drives grid enthusiasm.

Then you clearly don't understand grid computing.

OK, perhaps you're right.

Post by Erik Paulson

Post by Mark Hahn
there simply is no coming breakthrough that will make all networking
fast, low-latency, cheap, ubiquitous and low-power. and grid
(in the grand sense) really does require *all* those properties.

Grid computing does not require any of this. Grid computing is all about
access and coordination.

but access and coordination require some kind of work, and from where
I sit, most interesting distributed work requires the net as described
(and which does not exist.)

Post by Erik Paulson
Grid computing is much more than just
running naturally (embarrassingly) parallel problems on spare cycles on
every computer people can find.

OK, so if it's more than loosely-coupled parallism, then it must
inherently require a fairly tight network. that was the point.

Post by Erik Paulson
Some companies will setup their own, internal distribution/grids - think of
Walmart - and inside the company they'll deal with however the cost recovery
method needs to work. Others will get it from the big boys - you'll want

OK, this is the "Grid is a batch/queueing system with elaborate accounting"
explanation - exactly my understanding of grid ala Globus.

I don't really understand the appeal of this: on my clusters, I want users
to have actual user accounts, and to write, tune and compile their programs
for the cluster's specific hardware, running them under the cluster's
resource management (queueing/batch/accounting) software. AFAIKT, the grid
approach would have them running sandbox'ed, interpreted java programs on a
generic proxy account.

OK, so grid is just cycle scavenging with its own meta-queueing,
its own meta-authentication and its own meta-accounting?

Erik Paulson

2003-01-06 15:21:16 UTC

Post by Mark Hahn

Post by Erik Paulson
Some companies will setup their own, internal distribution/grids - think of
Walmart - and inside the company they'll deal with however the cost recovery
method needs to work. Others will get it from the big boys - you'll want

OK, this is the "Grid is a batch/queueing system with elaborate accounting"
explanation - exactly my understanding of grid ala Globus.
I don't really understand the appeal of this: on my clusters, I want users
to have actual user accounts, and to write, tune and compile their programs
for the cluster's specific hardware, running them under the cluster's
resource management (queueing/batch/accounting) software.

Do you really think that's what your users want, though? And what happens when
they need more compute power than you can give them, and want to use a 128
node cluster instead of a 64 node cluster? They go to a different cluster and
learn everything again?

Post by Mark Hahn
AFAIKT, the grid
approach would have them running sandbox'ed, interpreted java programs on a
generic proxy account.

Nothing in grid computing says that it's any sort of generic account - all
authorization to use a resource is entirely up to the resource owner - if they
want every user on the resource to have an actual, seperate account they can.
If they want everyone to share a generic account they can. There's a seperate
mapping of "grid identities" to local unix accounts, for systems that it
makes sense on.

And who said anything about grid computing requireing interpreted Java
programs? If you wanna run x86 code go ahead and do so, provided you've got
access to some x86 resources.

Post by Mark Hahn
OK, so grid is just cycle scavenging with its own meta-queueing,
its own meta-authentication and its own meta-accounting?

It's not cycle scavenging.
It's a standard way of talking to queuing systems.
It's strong authentication and a seperate authorization step
Accounting is accomplished however it makes sense for the resource owners who
have put together that grid.

-Erik

Robert G. Brown

2003-01-06 15:15:37 UTC

Post by Florent Calvayrac

Post by Randall Jouett
Taking this method a step further still, MPI latency might even able to
be used (delta time between the compute nodes and the head)

this is certainly much simpler to implement than downloading sprng,
(on http://sprng.cs.fsu.edu/ )
or just implementing a node-number dependent recurrence formula on each
machine, as RGB is also doing. Measuring an MPI latency time
(in order of several thousands cycles), communicating back and forth,
and running the same calculation on another cluster to predict the
results as you suggest is also certainly faster and more elegant to
achieve independence of the streams.

Sure, but a) this sort of thing is very dangerous and b) why bother?
/dev/random is ALREADY doing a moderately sophisticated (and probably
much more sophisticated and better thought through) job of doing this
sort of thing.

If the issue is one of "random" (by which I mean "uncorrelated by all
measures of correlation") vs "uniform deviate generator" (a term that
avoids the oxymoronic descriptor "random number generator":-) then it is
extremely difficult to come up with entropy from ANY source or sources
available on OTC systems on an OTC network sufficient to provide "truly
random" numbers fast enough for a random number-hungry application.
There are all kinds of subtle time correlations in nearly anything you
choose to use as an entropy source short of a piece of specifically
engineered hardware desirnged for that purpose, and you have to wait for
several of the longest correlation times of each source in order for its
next output to be "random" relative to the previous one. By definition,
by the way.

The problem is: what are these correlation times? They turn out in many
cases to be order milliseconds to seconds, not nanoseconds to
microseconds (relative to 32 bits). Or if you prefer, order of 100
microseconds per "random" bit. And there's the rub.

If a pair of systems were executing some fairly deterministic MPI code
and passing a systematic pattern of same-size messages, you would have
to do a lengthy and nontrivial computation just to MEASURE the
correlation time, and the correlation time for a nominally periodic
process could be VERY LONG. If you fail to do this measurement and just
"assume" that you know when the times are adequately decorrelated, you
could end up with very significant correlation in your "random" number
stream. Worse, since your measurement will be heavily state dependent,
you have to ensure that a state with a longer correlation time than
whatever you measure can "never" occur, which is all but impossible on a
system running on a fixed clock. Basically, you should read "man 4
random" (which uses a variety of sources of entropy, not just one)
before deciding that you can do better, and you should recognize that
/dev/random is really only suitable for generating seeds or other
infrequently required "really random" numbers.

Or, to argue the other way, if you INSIST on using e.g. /dev/random as a
source of a stream of "really random" random numbers, accepting that it
will block until it accumulates enough entropy (by its internal
measures) to return the next one AND accepting that abusing it by USING
it to generate a stream in this way, where each number is almost by
definition "barely random enough" according to the standards of "barely"
implemented by the designers and may or may not be random "enough" for
your application, then this is indeed a suitable sort of thing to
implement on a cluster basis!

Inserting /dev/random reads into my cpu_rate timing harness, I measure a
/dev/random return rate on the order of 5 milliseconds. This is a very
believable number, suggesting correlation times averaging maybe 1
millisecond in the entropy sources used (relative to 32 bits -- ~100
microsecond bit correlation times). In order to saturate even 100BT
(call it two and a half million 32 bit random numbers per second) one
would need, lessee, 200 per second per node into 2.5x10^6, um, order of
10^4 nodes, PRESUMING that entropy accumulates as rapidly on nodes with
no keyboard, mouse, or significant multitasking as it does on my "noisy"
desktop where I just ran the benchmark.

Somehow I think that buying a hardware RNG device based on e.g.
radioactivity or thermal noise would be cheaper and would probably work
better as well.

Now, if you only wish to generate "a few" really random (heh!;-) numbers
to use as seeds for UDG's so that they can produce distinct sequences of
uniform deviates (accepting whatever degree of high-dimensional
correlation associated with the method, as usual), you simply won't
"build" an entropy-based source of the random seed that is any MORE
suitable than /dev/random, and implementing /dev/random is (with
unsigned int seed):

if ((fp = fopen("/dev/random","r")) == NULL) {
if(verbose == 10) printf("Cannot open /dev/random, setting seed to 0\n");
seed = 0;
} else {
fread(&seed,sizeof(seed),1,fp);
if(verbose == 10) printf("Got seed %u from /dev/random\n",seed);
}
gsl_rng_set(random,seed);

Hard to get any simpler. Then, as noted before, you can either wear
your pants with belt AND suspenders and keep track of all the seeds used
to ensure that the gods of randomness don't whack you via the 1/2^32
chance of getting any given unsigned long int over the sequence of your
jobs (a very good chance -- basically unity -- if you plan to run say a
million sequences, btw) or you can accept the overlap and ignore it
(reasonable for a few hundred or even a few thousand runs) or if you are
REALLY going to use a LOT of runs (order 10^9), you can create a list of
all the ulong ints and shuffle it with a good shuffle algorithm for
starters.

However, few UDG's will likely turn out to be adequately decorrelated if
you start saturating the space of potential seeds -- iterated maps
aren't really random and if you literally sample the space of starting
points densely, you will almost certainly get significant but very
difficult to detect correlations in the generated streams. So once
again, your error estimates (based on independence) will be incorrect,
and other problems can arise.

As always, if you are planning to do anything with very large numbers of
UDG's or random numbers or the like, you are well advised to do some
moderately significant research on UDG's and randomness before
implementing anything at all. For some applications (e.g. writing a
game:-) it won't matter -- "any" UDG will suffice, as long as it yields
independent play experiences. For others (doing a long running
simulation for the purposes of publication) it is essential -- there are
famous cases of results obtained at great expense and published in the
most respected journals only to discover (to the embarrassment of the
authors, the referees, everybody) that they were based on a "bad" UDG
(relative to the purpose) and were, alas, cosmic debris, incorrect,
totally erroneous, suitable for the trash can.

Truth be told, I'd say that most of us who actually work in this game
have this as one of our secret nightmares. The oxymoron is a real one.
There are no RNG's, only UDG's, and it is very, very difficult to
predict whether the correlations that you KNOW are present in a UDG will
significantly affect your answers at any given level of precision,
provided that you avoid the UDGs with known and obvious problems. Real
Computer Scientists (and theoretical physicists, mathematicians,
statisticians) spend careers studying this problem. There is almost a
heisengbergian feeling to it -- if you only generate a few UD's, you can
easily enough get adequately decorrelated ones but your statistical
accuracy (in a simulation) will suck. The more UD's you need, the
greater your statistical accuracy (presuming independence) but the
greater the contamination of those results with the generally occult
correlations.

It Is Not Easy to know when one will hit the optimum -- the best results
one can obtain that have a reliable estimate of statistical accuracy.
Indeed, one way of determining the PRESENCE of the correlations is to
look for statistically significant deviations from outcomes
theoretically predicted for model problems for perfectly random numbers
(e.g. the mean length of a random walk in N dimensions). When the
outcome isn't known a priori, and one is using a UDG that passes most of
the "simple" tests for correlation satisfactorily, one is basically
engaged in a crap shoot...maybe your problem is the one that
(eventually) will reveal a Weakness in your particular UDG, or UDGs.

rgb

Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:***@phy.duke.edu

Randall Jouett

2003-01-07 18:21:40 UTC

Howdy Robert,

Post by Robert G. Brown
Sure, but a) this sort of thing is very dangerous and b) why bother?
/dev/random is ALREADY doing a moderately sophisticated (and probably
much more sophisticated and better thought through) job of doing this
sort of thing.

Well, this was off the top of my head, Robert. Also, I know
for a fact that you and the code crankers behind /dev/random
know infinitely more about random-number generation than I
ever will :^).

[Big snip of some stuff WAY over my head :^). I'll take your
word on it, though, Robert :^)]

Post by Robert G. Brown
Or, to argue the other way, if you INSIST on using e.g. /dev/random as a
source of a stream of "really random" random numbers, accepting that it
will block until it accumulates enough entropy (by its internal
measures) to return the next one AND accepting that abusing it by USING
it to generate a stream in this way, where each number is almost by
definition "barely random enough" according to the standards of "barely"
implemented by the designers and may or may not be random "enough" for
your application, then this is indeed a suitable sort of thing to
implement on a cluster basis!

Kewl! :^). Now think up a crazy-sounding acronym to call the process.
Hmmm. Let's see if we can somehow work words like CRAP, FARTS, or DORK
into the name, too. I think it would be totally hilarious to go to a
conference and here some serious-sounding and well-meaning speaker
say something like, "The obvious solution to this type of random-
generation problem is CRAP." :^) :^). The major problem here, though,
is that the audience might be laughing too much to actually pay
attention to what was being said. They'd probably be hanging on
every word, waiting to here the term used in a context that actually
meant something else :^). Hmmm. OTOH, maybe an acronym like this would
actually make the audience pay attention and listen :^).

Best Regards,

Randall

--
Randall Jouett
Amateur Radio: AB5NI

Randall Jouett

2003-01-07 18:22:21 UTC

Hi folks,

One last thing before I do some serious parallel and beowulf
study. I was wondering if you could improve the latency generator
I described earlier by doing something like this:

When compute nodes are idle, rather than just sit there, they
could open up /dev/random and generate random streams that could
be added to the entropy pool? As soon as they'd get some real
work to do, they'd stop doing this and get on with the task
at hand. Other types of random generation could be used in
place of /dev/random, of course.

Hmmm. How about a random-number generating screen blanker in
a grid environment while we're at it, too? :^)

All my replies will be off list.

Randall
--
Randall Jouett
Amateur Radio: AB5NI

Robert G. Brown

2003-01-06 15:53:13 UTC

Post by Mark Hahn
OK, so grid is just cycle scavenging with its own meta-queueing,
its own meta-authentication and its own meta-accounting?

And perhaps most important (but not yet significantly implemented,
although there is a very serious project here at Duke to implement it
called Computers On Demand -- COD) -- a meta-OS-environment and
meta-sandbox for the distributed users that can be loaded literally on
demand (at a suitable time granularity, of course:-).
Multiuser/multitasking with a vengeance, where "the network is the
computer" on a very broad scale indeed.

Mark, you shouldn't discount the economics of cycle scavenging or refer
to it as "just" that. In once sense, all multiuser/multitasking
computing is cycle scavenging, but who would deny its benefit? Even
now, things like Scyld can be booted from e.g. floppy on a node, leaving
the node's hard disk and primary install intact. Or, people can install
two or three or ten bootable images on a modern disk and choose between
them with grub. Surely it isn't crazy to develop tools to take the
individual craft and handwork out of these one-of-a-kind solutions and
make them generally and reliably implementable?

Just to give you a single example of the economics that drive this
process, Duke has gone from a couple of clusters (mine and one over in
CS) to literally more clusters than the University per se can track over
the last five years.

Robert G. Brown

2003-01-06 16:27:43 UTC

Post by John Burton

Post by Randall Jouett
BTW, has anyone bothered to calculate all the wasted cycles
used up by check-point files? :^).

Yup, and it is significantly less than the number of cycles that would
be wasted having to rerun 24 hours worth of processing because a machine
hiccuped and the process died...

Or to emphasize the point even more strongly, it is easy enough to
estimate a priori and determine empirically when it makes sense to
checkpoint a process and when it makes sense not to, given that it may
be moderately difficult to accomplish.

Don's point about MPI jobs should not be taken lightly. Suppose one is
running a tightly coupled job (one where all the nodes advance
"together" and where failure of any node and the state data it contains
implies failure of the overall job) that will take one month to complete
on 100 nodes. Let us further suppose that (not unreasonably) the
probability that at least one node will "fail" and require at least a
restart during that month is essentially unity.

The time required to complete the project without checkpointing is
basically infinity. The time required to complete the project with a
checkpoint generated once a day, at the cost of 1/30'th of a day's work
(close to an hour!) is likely to be about 31 (best of all worlds, no
failure) and maybe 35 (1-3 failures) days, depending on the number of
actual failures that occur and how rapidly you are able to repair the
downed node(s) and restart the job.

BIG difference between 35 days and infinity...hmmmm

This, BTW, is one of the reasons that there are relatively few WinXX
clusters out there. At least some implementations and installations of
WinXX (where XX is nearly any flavor you like) have reportedly had
reliable uptimes on the order of a day under heavy load. If true, one
would damn near have to checkpoint every fifteen minutes to get through
the aforementioned computation at all and it would take a year. Even a
single failure per day per 100 nodes is enough to significantly affect
time of completion of synchronous tasks.

Without checkpointing (and a lot of folks do run without it as it IS
often a PITA to implement) one is basically gambling that one's cluster
will stay up through a computation cycle, and one sets one's
computational cycle accordingly, making it a form of "checkpointing".
Experience and arithmetic rapidly teaches one when this is a good bet --
and when it is not. The first time you run for a month, only to have a
node (and the entire computation!) crash a few hours before completion
when you were COUNTING on the results to complete the paper you're
presenting at a conference the following week the work to checkpoint may
not seem so very much after all...;-)

Last remark: Randy, you very definitely should take the time to skim
through the list archives, a book or two on parallel computing and
beowulfery in general, and maybe the howtos or FAQs before making hard
pronouncements on what does and doesn't make sense in cluster computing.
This is for a variety of reasons, and you should learn them. This is
not intended as a flame, just as a suggestion. Note the following Great
Truths:

a) nearly anything simple has been discussed a dozen times in full
detail and is in the list archives not once but a dozen times.

b) a great deal that is very complex and involved indeed has ALSO been
discussed a dozen times in full detail ditto. This list has been around
for what, eight years now (Don?) and good archives go back for at least
three or four.

c) your mileage may vary; your task is unique; the only good benchmark
is your own application; there is no substitute for careful thought
about the parallelization process; there is no simple one-size-fits-all
recipe for building a successful cluster (defined as one that does YOUR
work acceptably well with something approximating a linear speedup in
the number of nodes). These are all "list adages" that in summary mean
that any simple rule for cluster computing is probably "wrong" -- right
for THIS case, but wrong for THAT case -- which is why most of the list
experts will carefully qualify their answers rather than make sweeping
statements.

d) In that spirit, parallel computing isn't "like" serial computing in
too many ways. It is a deep and complex subject, and it is well worth
your while to (as suggested) read some books by Real Computer Scientists
on the subject. It's a case where there are often simple, obvious --
and wrong -- implementations of nearly any important numerical task in
parallel form. It's also the case that if you don't understand Amdahl's
Law (or know what it is!) and the related improved estimates associated
with parallel scaling, if you don't know what superlinear speedup is, if
you don't understand how and why both network latency and bandwidth are
important to parallel task completion -- some of the nitty-gritty
associated with homemade parallel computers to accomplish particular
tasks -- you'll end up wasting a lot of the list's time having these
ideas explained to you when you could just as easily read about them and
learn about them yourself.

One of may possible starting places (one that is free, in any event:-)
is http://www.phy.duke.edu/brahma. In particular, check out the online
book. I'm not a Real computer scientist, just a Sears computer
scientist, and I do need to update this book in the light of all sorts
of recent list discussions and wisdom as well as finish it off in terms
of planned topics, but even as it is it will make a lot of this stuff
clear to you. There are also links to other resources including (IIRC)
an online book on parallel algorithms.

rgb

Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:***@phy.duke.edu

Randall Jouett

2003-01-07 18:22:03 UTC

Post by Robert G. Brown
Don's point about MPI jobs should not be taken lightly. Suppose one is
running a tightly coupled job (one where all the nodes advance
"together" and where failure of any node and the state data it contains
implies failure of the overall job) that will take one month to complete
on 100 nodes. Let us further suppose that (not unreasonably) the
probability that at least one node will "fail" and require at least a
restart during that month is essentially unity.
The time required to complete the project without checkpointing is
basically infinity. The time required to complete the project with a
checkpoint generated once a day, at the cost of 1/30'th of a day's work
(close to an hour!) is likely to be about 31 (best of all worlds, no
failure) and maybe 35 (1-3 failures) days, depending on the number of
actual failures that occur and how rapidly you are able to repair the
downed node(s) and restart the job.
BIG difference between 35 days and infinity...hmmmm

You bet! :^). Thanks for the through explanation, Robert.
Much appreciated!

Post by Robert G. Brown
This, BTW, is one of the reasons that there are relatively few WinXX
clusters out there. At least some implementations and installations of
WinXX (where XX is nearly any flavor you like) have reportedly had
reliable uptimes on the order of a day under heavy load. If true, one
would damn near have to checkpoint every fifteen minutes to get through
the aforementioned computation at all and it would take a year. Even a
single failure per day per 100 nodes is enough to significantly affect
time of completion of synchronous tasks.

LOL. I love it. OTOH, I'm sure all of us know that WinXX has
always been a total piece of garbage and should have never,
ever won the OS war :^). With Red Hat "trying" to integrate
KDE and Gnome, though, Linux and the like may someday unseat
the all-powerful and unknowing MicroSloth :^). That is, with
a standardized GUI, I think Linux/Unix has a much better chance
breaking into the biz world in a serious way. I also think that
the Mac OS X is helping a bit in this arena, too.

Post by Robert G. Brown
Without checkpointing (and a lot of folks do run without it as it IS
often a PITA to implement) one is basically gambling that one's cluster
will stay up through a computation cycle, and one sets one's
computational cycle accordingly, making it a form of "checkpointing".
Experience and arithmetic rapidly teaches one when this is a good bet --
and when it is not. The first time you run for a month, only to have a
node (and the entire computation!) crash a few hours before completion
when you were COUNTING on the results to complete the paper you're
presenting at a conference the following week the work to checkpoint may
not seem so very much after all...;-)

That hasta suck :^).

Post by Robert G. Brown
Last remark: Randy, you very definitely should take the time to skim
through the list archives, a book or two on parallel computing and
beowulfery in general, and maybe the howtos or FAQs before making hard
pronouncements on what does and doesn't make sense in cluster computing.

Well, if my statements were coming across this way, I most humbly
apologize to you and the list! Basically, I asked my original
questions so that I could find out what exactly people in the
real world were using their clusters for, hoping to use the
garnered information for research. Unfortunately, I somehow
got tied up in conversation on the list, answering this question
and that, making statements that are relevant in serial-based computing
and seeing if they could be tied to the parallel world in some way,
shape, or form.

Post by Robert G. Brown
This is for a variety of reasons, and you should learn them. This is
not intended as a flame, just as a suggestion.

No problemo, Robert! I'm rather thick skinned, so don't even
begin to worry about it. :^)
[ABC's snipped :^)]

OK. I'll shut up and do my homework, Robert :^).
I'll just answer a few more e-mails that were
posted to the list (mainly to complete my thoughts),
and then I'll be quiet and study :^). OTOH, one has
to admit that at least a few of my remarks has stimulated
list activity between members. Do I at least get a C+
for my random-number idea? :^)

Thanks for all the great input, Robert. Much appreciated!

Randall

--
Randall Jouett
Amateur Radio: AB5NI

Mark Hahn

2003-01-06 16:53:05 UTC

Post by Erik Paulson

Post by Mark Hahn

Post by Erik Paulson
Some companies will setup their own, internal distribution/grids - think of
Walmart - and inside the company they'll deal with however the cost recovery
method needs to work. Others will get it from the big boys - you'll want

OK, this is the "Grid is a batch/queueing system with elaborate accounting"
explanation - exactly my understanding of grid ala Globus.
I don't really understand the appeal of this: on my clusters, I want users
to have actual user accounts, and to write, tune and compile their programs
for the cluster's specific hardware, running them under the cluster's
resource management (queueing/batch/accounting) software.

Do you really think that's what your users want, though?

yes, I KNOW they do. why? because it makes for efficient use of the
cluster. OK, so I add to my list of grid premises: assume cycles are
cheap and efficiency is not important.

there's nothing wrong with genericity that preserves efficiency.
the problem is that genericity often implies a noticable loss there.

Post by Erik Paulson
And what happens when
they need more compute power than you can give them, and want to use a 128
node cluster instead of a 64 node cluster? They go to a different cluster and
learn everything again?

you seem to have users who have loosely-connected, non-compute/ram/io/ipc
intensive programs. you're right: for them, grid is going to be great.
for my main machine, the priority is specifically to run tightly-coupled,
compute and memory-intensive, long-running codes. for that class of
problems, grid is just a toy.

Post by Erik Paulson

Post by Mark Hahn
AFAIKT, the grid
approach would have them running sandbox'ed, interpreted java programs on a
generic proxy account.

Nothing in grid computing says that it's any sort of generic account - all
authorization to use a resource is entirely up to the resource owner - if they
want every user on the resource to have an actual, seperate account they can.
If they want everyone to share a generic account they can. There's a seperate
mapping of "grid identities" to local unix accounts, for systems that it
makes sense on.

so what does it buy?

Post by Erik Paulson
And who said anything about grid computing requireing interpreted Java
programs? If you wanna run x86 code go ahead and do so, provided you've got
access to some x86 resources.

OK, I'm puzzled now. how does a grid user do that? he doesn't even know
what machines the code is going to run on, whether it's got 3dnow or sse2,
etc. not to mention libraries, etc.

maybe I'm totally confused. you seem to be saying that grid will
provide something new. users still have to ssh in to an account
that I still have to make for them. they still have to use our
queueing system, and obey our resource limits. grid gives them what,
just a plugin that lets them have a meta-scheduler across multiple
clusters?

Post by Erik Paulson

Post by Mark Hahn
OK, so grid is just cycle scavenging with its own meta-queueing,
its own meta-authentication and its own meta-accounting?

It's not cycle scavenging.
It's a standard way of talking to queuing systems.

ah, that's all.

Post by Erik Paulson
It's strong authentication and a seperate authorization step

ah, interesting. I know Globus uses its own random/nih PK infrastructure,
but is it in some way better than the standard (ssh)?

Post by Erik Paulson
Accounting is accomplished however it makes sense for the resource owners who
have put together that grid.

if accounting is separate, it sure looks like a cycle scavenger to me.
oops: "cycle harvester".

Mark Hahn

2003-01-07 00:33:23 UTC

Post by Randall Jouett
Anywho, I was thinking that the lib call was written in an asynchronous
fashion, with various flags being set on the root node when a compute
node completed its computation. Also, the only way the root would
continue on with the application is when all nodes sent a response
saying that they're done.

well, that means the master becomes a potential bottleneck.
also consider what happens if the master fails...

Post by Randall Jouett
verbatim and stored off to disk. That is, if the recovery node didn't
finish its work already, of course. You'd also have to tell the original
node that straightened itself out "Never mind," of course. (Said with

that's fine if each node has only trivial globally unique state.
but often, the reason you're using parallelism at all is because
you have a huge amount of global state, and each of N nodes owns 1/N of it.
can your program somehow survive when 1/N of its state disappears?

some codes don't have a lot of state. for instance, suppose you were
doing password cracking - a node's state is just its assigned subspace
within the set of possible cleartext passwords. if it dies, just
hand the space to some other node or distribute it among the survivors.

if your problem is like that, you're utterly and completely golden -
not only can you handle failures easily, but you can also run just
fine on a grid. like prime-cracking, ***@home, etc.

Randall Jouett

2003-01-07 18:21:58 UTC

Howdy Mark, and thanks for the reply.

Post by Mark Hahn

Post by Randall Jouett
Anywho, I was thinking that the lib call was written in an asynchronous
fashion, with various flags being set on the root node when a compute
node completed its computation. Also, the only way the root would
continue on with the application is when all nodes sent a response
saying that they're done.

well, that means the master becomes a potential bottleneck.

Hmmm. I think that this would be application specific and really
depend on the situation at hand. That is, I can see where some
applications would only need to worry about the data they were
crunching internally, and they wouldn't have to talk to other
nodes, other than letting the root node know that they've completed.
For other applications where nodes have to communicate with each
other, it would seem that the model could just be duplicated on each
compute node, including the windowing and statistics. Also, a compute
node that's having a problem talking to another compute node could
report this problem to the root/head, making sure that the problem
node is watchdogged or removed from scheduling completely should it fail.
That is, if I understand all of this parallel stuff correctly.

Post by Mark Hahn
also consider what happens if the master fails...

No problem. The head/root node could be setup as a
failsafe cluster. Should the head node go down, another
machine just takes over. I think that this would work,
anyway. (Shrug.)

BTW, all these statements shouldn't be seen in a "I know
what the hell I'm talking about" context. Just brainstorming,
and the replies I'm getting on and off list seem to be helping
understand all of this stuff, too. Thanks, guys!.

Post by Mark Hahn

Post by Randall Jouett
verbatim and stored off to disk. That is, if the recovery node didn't
finish its work already, of course. You'd also have to tell the original
node that straightened itself out "Never mind," of course. (Said with

that's fine if each node has only trivial globally unique state.
but often, the reason you're using parallelism at all is because
you have a huge amount of global state, and each of N nodes owns 1/N of it.
can your program somehow survive when 1/N of its state disappears?

Personally, I'd think survivability would depend on the capabilities of
the people designing the code. I guess you could also set aside a node
or two and use them as failsafe/backup nodes (or whatever the terminology
is used here), and should a node a fail, one of them could take over. This all
depends on if the node taking over would have access to the data.(Check-point
files? UGH! :^) ). BTW, when I say "UGH!" here, I'm really not trashing
check-point file usage. What I guess I'm really saying is that they're a
necessary evil, and also that they are something I'm sure all of us wish
didn't exist in any way, shape, or form. :^). I just hate seeing cycles
being used this way, and if someone could figure out a way to get rid
of them, I'm sure everyone in the field would be MUCH happier :^).

Post by Mark Hahn
some codes don't have a lot of state. for instance, suppose you were
doing password cracking - a node's state is just its assigned subspace
within the set of possible cleartext passwords. if it dies, just
hand the space to some other node or distribute it among the survivors.

Yep. Exactly what I was thinking.

Post by Mark Hahn
if your problem is like that, you're utterly and completely golden -
not only can you handle failures easily, but you can also run just

Yep :^).

Nice talking to you, Mark, and best regards,

Randall

--
Randall Jouett
Amateur Radio: AB5NI

Robert G. Brown

2003-01-07 19:34:21 UTC

Post by Randall Jouett
OK. I'll shut up and do my homework, Robert :^).
I'll just answer a few more e-mails that were
posted to the list (mainly to complete my thoughts),
and then I'll be quiet and study :^). OTOH, one has
to admit that at least a few of my remarks has stimulated
list activity between members. Do I at least get a C+
for my random-number idea? :^)

As a genuine professor, I award you with a C+ on the basis of noise,
effort, and (perhaps excessive:-) enthusiasm. It's not a C- not so much
because of the random-number idea per se, but for creative thinking,
however wrong.

Now, also as an honest-to-god professor who has to start preparing to
"teach" tomorrow morning any minute now, I'll tell you what I'm going to
tell my new crop of students: Spontaneous thought and
idea-kicking-around is indeed a component of learning, but:

a) Nobody can teach you anything. Not even me. At best we can help
you learn, but even that will work only to the extent that you have made
YOURSELF ready by the application of the fundamental precepts of (self)
discipline.

b) You therefore must first learn to teach, to discipline, yourself.

c) Teaching yourself, learning, discipline, is difficult (but
rewarding and fun!) work and a serious enterprise. One important step
is to control the interior monologue and think through your ideas on
your own before offering them up -- a bit of sorting and filtering here
decreases the noise level of the communications channel and is generally
a good thing. Another is to use YOUR time and study FIRST -- read up on
things, draw pictures, TRY to understand on your own. Get to the point
of marginal frustration, stop, study, and try again to the point of
marginal frustration for a cycle or two. Recognize that true satori is
always the result of, and satisfying to, the extent of this investment
of time.

In many cases, proper application of this ritual will lead one to a
steady stream of satori in this or any other discipline, especially if
one has a playground/cluster/computer to use for self-imposed
"homework". It is, by the way, also useful should you wish to study
zen, history, mathematics, physics, a language. ONLY WHERE IT HAS BEEN
TRIED AND APPLIED AND FAILS can the next step be fruitfully applied:

d) When a problem refuses to resolve, a thorny concept fails to become
clear AFTER you've worked to the point of frustration several times,
THEN ask for help from a Perfect Master (where you can assume that this
list contains many PM's and a few bozos, where it has long been
established -- in the FAQ yet -- that I'm a bozo;-).

At that point, with the ground fruitfully prepared by your efforts and
studies, Enlightenment can often be brought about with the proverbial
whack upon the head with a manual or a finger pointed at an Enter key.
Before that point, especially if the question has a trivial answer, you
are more likely to be whacked on the head with a sucker rod (read, e.g.
man syslogd) and told to RTFM.

Of course, I will summarize this tomorrow as: "If you don't read the
physics textbook like a bloody novel before you go to bed at night and
work on your impossibly difficult homework assignments with ritualistic
religious fervor, you ain't gonna learn Maxwell's Equations no matter
how brilliantly I lecture." My continuing students have pretty much all
figured this out after a semester of abuse at my hands. Now to torment
the incoming ones...;-)

Post by Randall Jouett
Thanks for all the great input, Robert. Much appreciated!

You are most welcome. Enjoy learning about cluster computing,
especially in a hands-on way.

rgb

Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:***@phy.duke.edu

Robert G. Brown

2003-01-07 20:09:12 UTC

Post by Randall Jouett
Hi folks,
One last thing before I do some serious parallel and beowulf
study. I was wondering if you could improve the latency generator
When compute nodes are idle, rather than just sit there, they
could open up /dev/random and generate random streams that could
be added to the entropy pool? As soon as they'd get some real
work to do, they'd stop doing this and get on with the task
at hand. Other types of random generation could be used in
place of /dev/random, of course.
Hmmm. How about a random-number generating screen blanker in
a grid environment while we're at it, too? :^)
All my replies will be off list.

If that's a PROMISE I'll make one last reply on-list.

First of all, before addressing random numbers, which is a subject most
Ph.D.-wielding scientists and mathematicians are woefully ignorant upon,
you need to bring yourself up to at least their level of woeful
ignorance. To help you, I can only suggest that you read Knuth or
Marsaglia for starters, or a chapter in a good book on numerical methods
(e.g. Forsythe, Malcom and Moler -- a classic although alas in Fortran)
or for a more up to date review visit Marsaglia's website for the
Diehard (Battery of Random Number Tests) at:

http://stat.fsu.edu/~geo/diehard.html

The diehard sources come with a tool that can generate UD sequences from
some dozen different UD algorithms (in many cases, with several distinct
common configuration/initialization choices) for direct testing with
diehard, and diehard itself contains 15 distinct tests for randomness.
It is trivial to apply diehard to any sequence of data provided only
that you can generate an appropriately formatted file copy of the data.

Alas, I could find no copyright or copyleft -- I wrote Dr. Marsaglia
suggesting that he formally GPL the sources to make it a bit easier to
port them out of their current f2c translated state (ugly, ugly, ugly)
and package them up to make them easier to build. I'm hoping/assuming
he's still alive -- he was writing papers on this back in 1968 -- but he
has not yet replied.

Anyway, the answer to your "could you improve..." question above is --
who knows? Don't ask the list, find out! Get diehard, build it, study
the generators and learn a bit about how they work, learn how to apply
the whole suite of tools and generators to problems. Test the output
stream from /dev/random with diehard -- I have no idea myself whether or
not it "passes" Marsaglia's suite, but he mentions in the diehard
documentation that not even hardware devices tend to do well on the
entire suite.

Only when you've taken the time to become at least competent in the
basic X_n = F(X_n-m,X_n-m+1...X_n)mod a idea underlying most UDG's and
familiar with the characteristics of random processes can you even THINK
about being able to answer your question, and nobody else is likely to
answer it for you. You also might want to do a bit of a literature
search on the statistics of network latencies. Until you understand
what "poissonian" means and how it differs from "uniform" or
"exponential" or "gaussian", until you understand correlation and
covariance, you also won't be able to answer your own question.

To summarize (and reiterate my previous reply), throwing open-ended
questions out to the list, especially on topics like random number
generation, isn't likely to be a productive way to learn cluster
computing OR about random number generation, compared to reading a
chapter or two in any decent book on the subjects.

In some cases (RNG's in particular) you are likely to find that you have
to start with a decent knowledge of e.g. probability and statistics or
networking in order to even properly frame your own question, and
reading one book will (and should) lead you to the next, and so on.
However, there is NO SUBSTITUTE for trying things out on your own, or
for taking a relevant course at e.g. a university or technical school,
where advanced technical subjects are concerned. Nobody on-list has the
time to summarize a course in statistics and random number generation
theory (and iterated maps, and chaos and fractals, and random processes,
and all the rest of the connected concepts). Hell, nobody on-list
probably COULD summarize all of these subjects -- lots of us know
something about some of the topics, but unless UDGs are your speciality
why would you know about them all?

Now remember, you promised!

rgb

Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:***@phy.duke.edu

Christoph Wasshuber

2003-01-07 20:54:09 UTC

One practical pointer I can give is a book
I wrote about a different subject, but
it has a very practical chapter on random
number generation and the parallelization of
it.

The book is entitled
Computational Single-Electronics
Christoph Wasshuber
Springer Verlag; ISBN: 321183558X; (July 2001)

Of course there are dozens of great books
and articles of which some are referenced in my
book. Forgive me this plug. Over and out.

Chris....

Donald Becker

2003-01-07 21:54:44 UTC

Post by Joe Nellis
I purchased the scyld CDROM from Linux Central about June 2001. Is the CD
currently being sold the same one or a new version since then?

I believe that they are still selling the older 27Bz-8 version.
The current version we ship to customers is 28Cz-5, with the 29 series
in development.

The Zero-Install Demo version that we distributed at SC2002 was numbered
28Dz-5, is the most recent low-cost, unsupported version. However,
unlike the previous basic editions which were full installs with basic
tools, the new "D" demo disks are intended as single-application,
zero-install turn-key cluster packages.

--
Donald Becker ***@scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993

Robert G. Brown

2003-01-14 22:00:53 UTC

On Tue, 14 Jan 2003, Bryce Bockman wrote:

Bryce Bockman

2003-01-15 03:23:24 UTC

Thanks for the info Robert. This is good stuff.

Cheers,
Bryce

52 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Randall Jouett 2003-01-03 08:46:41 UTC

Donald Becker 2003-01-03 18:42:21 UTC

Randall Jouett 2003-01-04 05:59:57 UTC

Rupert Davey 2003-01-03 23:49:54 UTC

Randall Jouett 2003-01-04 07:29:49 UTC

Randall Jouett 2003-01-04 12:11:35 UTC

Donald Becker 2003-01-04 07:05:13 UTC

Randall Jouett 2003-01-04 12:07:10 UTC

Florent.Calvayrac 2003-01-04 16:04:33 UTC

Donald Becker 2003-01-04 17:29:40 UTC

Randall Jouett 2003-01-05 04:31:13 UTC

Florent Calvayrac 2003-01-06 08:28:25 UTC

Randall Jouett 2003-01-07 18:21:22 UTC

Bryce Bockman 2003-01-14 19:07:52 UTC

Donald Becker 2003-01-04 17:58:51 UTC

Randall Jouett 2003-01-05 07:27:35 UTC

John Burton 2003-01-06 15:36:27 UTC

Randall Jouett 2003-01-07 18:22:12 UTC

Greg Lindahl 2003-01-07 19:42:45 UTC

Joe Nellis 2003-01-07 20:22:02 UTC

Mark Hahn 2003-01-04 18:19:40 UTC

Erik Paulson 2003-01-05 17:51:59 UTC

Randall Jouett 2003-01-05 04:31:46 UTC

Eugen Leitl 2003-01-04 19:20:50 UTC

Patrick Geoffray 2003-01-04 21:42:19 UTC

Mark Hahn 2003-01-04 22:11:39 UTC

Greg Lindahl 2003-01-05 08:36:07 UTC

Halpern, Joshua 2003-01-04 22:32:23 UTC

Joseph Landman 2003-01-04 23:37:51 UTC

Randall Jouett 2003-01-05 05:03:30 UTC

Halpern, Joshua 2003-01-05 02:32:21 UTC

Joseph Landman 2003-01-05 02:54:56 UTC

Dean Johnson 2003-01-05 03:33:30 UTC

Donald Becker 2003-01-05 21:13:52 UTC

Randall Jouett 2003-01-06 16:31:42 UTC

Robert G. Brown 2003-01-05 22:50:11 UTC

Mark Hahn 2003-01-06 14:46:56 UTC

Erik Paulson 2003-01-06 15:21:16 UTC

Robert G. Brown 2003-01-06 15:15:37 UTC

Randall Jouett 2003-01-07 18:21:40 UTC

Randall Jouett 2003-01-07 18:22:21 UTC

Robert G. Brown 2003-01-06 15:53:13 UTC

Robert G. Brown 2003-01-06 16:27:43 UTC

Randall Jouett 2003-01-07 18:22:03 UTC

Mark Hahn 2003-01-06 16:53:05 UTC

Mark Hahn 2003-01-07 00:33:23 UTC

Randall Jouett 2003-01-07 18:21:58 UTC

Robert G. Brown 2003-01-07 19:34:21 UTC

Robert G. Brown 2003-01-07 20:09:12 UTC

Christoph Wasshuber 2003-01-07 20:54:09 UTC

Donald Becker 2003-01-07 21:54:44 UTC

Robert G. Brown 2003-01-14 22:00:53 UTC

Bryce Bockman 2003-01-15 03:23:24 UTC

about - legalese

Loading...