Discussion:
GPU Beowulf Clusters
(too old to reply)
Jon Forrest
2010-01-28 17:38:14 UTC
Permalink
I'm about to spend ~$20K on a new cluster
that will be a proof-of-concept for doing
GPU-based computing in one of the research
groups here.

A GPU cluster is different from a traditional
HPC cluster in several ways:

1) The CPU speed and number of cores are not
that important because most of the computing will
be done inside the GPU.

2) Serious GPU boards are large enough that
they don't easily fit into standard 1U pizza
boxes. Plus, they require more power than the
standard power supplies in such boxes can
provide. I'm not familiar with the boxes
that therefore should be used in a GPU cluster.

3) Ideally, I'd like to put more than one GPU
card in each computer node, but then I hit the
issues in #2 even harder.

4) Assuming that a GPU can't be "time shared",
this means that I'll have to set up my batch
engine to treat the GPU as a non-sharable resource.
This means that I'll only be able to run as many
jobs on a compute node as I have GPUs. This also means
that it would be wasteful to put CPUs in a compute
node with more cores than the number GPUs in the
node. (This is assuming that the jobs don't do
anything parallel on the CPUs - only on the GPUs).
Even if GPUs can be time shared, given the expense
of copying between main memory and GPU memory,
sharing GPUs among several processes will degrade
performance.

Are there any other issues I'm leaving out?

Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
***@berkeley.edu
Jonathan Aquilina
2010-01-28 17:50:16 UTC
Permalink
are you goign for the nvidia teslas or you looking to squeeze 4 cards into
one box? getting them powered shouldnt be a problem there if you plan on
using plane custom built desktops 2000w psus out there if not more now a
days. im not sure though wiht the teslas you can quad sli them, and if sli
would make any difference in regards to gpu clustered computing
Michael Di Domenico
2010-01-28 17:53:42 UTC
Permalink
The way I do it is, but your mileage may vary...

We allocate two CPU's per GPU and use the Nvidia Tesla S1070 1U
chassis product.

So a standard quad/core - dual/socket server with four GPU's attached

We've found that even though you expect the GPU to do most of the
work, it really takes a CPU to drive the GPU and keep it busy

Having a second CPU to pre-stage/post-stage the memory has worked
pretty well also.

For scheduling, we use SLURM and allocate one entire node per job, no sharing
Post by Jon Forrest
I'm about to spend ~$20K on a new cluster
that will be a proof-of-concept for doing
GPU-based computing in one of the research
groups here.
A GPU cluster is different from a traditional
1) The CPU speed and number of cores are not
that important because most of the computing will
be done inside the GPU.
2) Serious GPU boards are large enough that
they don't easily fit into standard 1U pizza
boxes. Plus, they require more power than the
standard power supplies in such boxes can
provide. I'm not familiar with the boxes
that therefore should be used in a GPU cluster.
3) Ideally, I'd like to put more than one GPU
card in each computer node, but then I hit the
issues in #2 even harder.
4) Assuming that a GPU can't be "time shared",
this means that I'll have to set up my batch
engine to treat the GPU as a non-sharable resource.
This means that I'll only be able to run as many
jobs on a compute node as I have GPUs. This also means
that it would be wasteful to put CPUs in a compute
node with more cores than the number GPUs in the
node. (This is assuming that the jobs don't do
anything parallel on the CPUs - only on the GPUs).
Even if GPUs can be time shared, given the expense
of copying between main memory and GPU memory,
sharing GPUs among several processes will degrade
performance.
Are there any other issues I'm leaving out?
Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Singh, Tajendra
2010-01-28 20:57:05 UTC
Permalink
This is not a problem in your setup as you are assigning a whole node
together. In general how one can deal with problem of binding a
particular gpu device to scheduler?

Sorry if I am asking something which is already known and there are ways
to bind the devices within scheduler.

Thanks,
TV




-----Original Message-----
From: beowulf-***@beowulf.org [mailto:beowulf-***@beowulf.org]
On Behalf Of Michael Di Domenico
Sent: Thursday, January 28, 2010 9:54 AM
To: Beowulf Mailing List
Subject: Re: [Beowulf] GPU Beowulf Clusters

The way I do it is, but your mileage may vary...

We allocate two CPU's per GPU and use the Nvidia Tesla S1070 1U
chassis product.

So a standard quad/core - dual/socket server with four GPU's attached

We've found that even though you expect the GPU to do most of the
work, it really takes a CPU to drive the GPU and keep it busy

Having a second CPU to pre-stage/post-stage the memory has worked
pretty well also.

For scheduling, we use SLURM and allocate one entire node per job, no
sharing
Post by Jon Forrest
I'm about to spend ~$20K on a new cluster
that will be a proof-of-concept for doing
GPU-based computing in one of the research
groups here.
A GPU cluster is different from a traditional
1) The CPU speed and number of cores are not
that important because most of the computing will
be done inside the GPU.
2) Serious GPU boards are large enough that
they don't easily fit into standard 1U pizza
boxes. Plus, they require more power than the
standard power supplies in such boxes can
provide. I'm not familiar with the boxes
that therefore should be used in a GPU cluster.
3) Ideally, I'd like to put more than one GPU
card in each computer node, but then I hit the
issues in #2 even harder.
4) Assuming that a GPU can't be "time shared",
this means that I'll have to set up my batch
engine to treat the GPU as a non-sharable resource.
This means that I'll only be able to run as many
jobs on a compute node as I have GPUs. This also means
that it would be wasteful to put CPUs in a compute
node with more cores than the number GPUs in the
node. (This is assuming that the jobs don't do
anything parallel on the CPUs - only on the GPUs).
Even if GPUs can be time shared, given the expense
of copying between main memory and GPU memory,
sharing GPUs among several processes will degrade
performance.
Are there any other issues I'm leaving out?
Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
_______________________________________________
Computing
Post by Jon Forrest
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
David Mathog
2010-01-28 20:11:54 UTC
Permalink
Post by Jon Forrest
Are there any other issues I'm leaving out?
Yes, the time and expense of rewriting your code from a CPU model to a
GPU model, and the learning curve for picking up this new skill. (Unless
you are lucky and somebody has already ported the software you use.)

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
Micha Feigin
2010-01-30 12:31:45 UTC
Permalink
On Thu, 28 Jan 2010 09:38:14 -0800
Post by Jon Forrest
I'm about to spend ~$20K on a new cluster
that will be a proof-of-concept for doing
GPU-based computing in one of the research
groups here.
A GPU cluster is different from a traditional
1) The CPU speed and number of cores are not
that important because most of the computing will
be done inside the GPU.
The speed not so much but the number of cores does matter. You should have at
least one core per GPU as the CPU is in charge of scheduling and initiating
memory transfers (and if not setup as DMA also handling the memory transfer).

When latency is an issue (especially jobs with a lot of CPU related scheduling)
the CPU polls the GPU results which can bump CPU usage. Nehalem raises another
issue where there is no north side bus and memory goes via the CPU.

It is recommended BTW, that you have at least the same amount of system memory
as GPU memory, so with tesla it is 4GB per GPU.
Post by Jon Forrest
2) Serious GPU boards are large enough that
they don't easily fit into standard 1U pizza
boxes. Plus, they require more power than the
standard power supplies in such boxes can
provide. I'm not familiar with the boxes
that therefore should be used in a GPU cluster.
You use dedicated systems. Either one 1u pizza box for the CPU and a matched 1u
tesla s1070 pizza box which has 4 tesla GPUs
http://www.nvidia.com/object/product_tesla_s1070_us.html
or there are several vendors out there that match two tesla GPU (usually the
tesla m1060 in this case which is a passively cooled version of the c1060)
http://www.nvidia.com/object/product_tesla_m1060_us.html
to a dual cpu xeon in a 1u system.
you can start here (the links page from nvidia)
http://www.nvidia.com/object/tesla_preconfigured_clusters_wtb.html

There are other specialized options if you want, but most of them aimed at
higher budget clusters.

You can push it in terms of power as each tesla takes 160W, adding to that what
the cpu and the rest of the system requires, a 1000W power supply should do.

The s1070 comes with a 1200w power supply on board.
Post by Jon Forrest
3) Ideally, I'd like to put more than one GPU
card in each computer node, but then I hit the
issues in #2 even harder.
You are looking for the tesla s1070 or previously mentioned solutions
Post by Jon Forrest
4) Assuming that a GPU can't be "time shared",
this means that I'll have to set up my batch
engine to treat the GPU as a non-sharable resource.
This means that I'll only be able to run as many
jobs on a compute node as I have GPUs. This also means
that it would be wasteful to put CPUs in a compute
node with more cores than the number GPUs in the
node. (This is assuming that the jobs don't do
anything parallel on the CPUs - only on the GPUs).
Even if GPUs can be time shared, given the expense
of copying between main memory and GPU memory,
sharing GPUs among several processes will degrade
performance.
It doesn't have a swap in/swap out mechanism, so the way it may time share is
by alternating kernels as long as there is enough memory. Shouldn't be done for
HPC (same with CPU by the way due to numa/l2 cache and context switching
issues).

What you would want to do is to setup the cards in exclusive mode and then tell
the users not to choose a card explicitly. The context creation function would
then choose the next available card automatically. You would then with the
tesla s1070 setup the machine as having 4 cores for scheduling.

The processes will be sharing the pci bus though for communications so you may
prefer to setup the system as 1 job per machine or at least a round robin
scheduler.
Post by Jon Forrest
Are there any other issues I'm leaving out?
Take note that the s1070 is ~6k$ so you are talking at most two to three
machines here with your budget.

Also don't even think about putting that s1070 anywhere but a server room, or
at least nowhere with users near by as it makes a lot of noise.
Post by Jon Forrest
Cordially,
Jon Forrest
2010-01-30 18:24:09 UTC
Permalink
Post by Micha Feigin
It is recommended BTW, that you have at least the same amount of system memory
as GPU memory, so with tesla it is 4GB per GPU.
I'm not going to get Teslas, for several reasons:

1) This is a proof of concept cluster. Spending $1200
per graphics card means that the GPUs alone, assuming
2 GPUs, would cost as much as a whole node with
2 consumer-grade cards. (See below)

2) We know that the Fermi cards are coming out
soon. If we were going to spend big bucks
on GPUs, we'd wait for them. But, our funding
runs out before the Fermis will be available.
This is too bad but there's nothing I can do
about it.

See below for comments regarding CPUs and cores.
Post by Micha Feigin
You use dedicated systems. Either one 1u pizza box for the CPU and a matched 1u
tesla s1070 pizza box which has 4 tesla GPUs
Since my first post I've learned about the Supermicro boxes
that have space for two GPUs
(http://www.supermicro.com/products/system/1U/6016/SYS-6016GT-TF.cfm?GPU=) .
This looks like a good way to go for a proof-of-concept cluster. Plus,
since we have to pay $10/U/month at the Data Center, it's a good
way to use space.

The GPU that looks the most promising is the GeForce GTX275.
(http://www.evga.com/products/moreInfo.asp?pn=017-P3-1175-AR)
It has 1792MB of RAM and is only ~$300. I realize that there
are better cards but for this proof-of-concept cluster we
want to get the best bang for the buck. Later, after we've
ported our programs, and have some experience optimizing them,
then we'll consider something better, probably using whatever
the best Fermi-based card is.

The research group that will be purchasing this cluster does
molecular dynamics simulations that usually take 24 hours or more
to complete using quad-core Xeons. We hope to bring down this
time substantially.
Post by Micha Feigin
It doesn't have a swap in/swap out mechanism, so the way it may time share is
by alternating kernels as long as there is enough memory. Shouldn't be done for
HPC (same with CPU by the way due to numa/l2 cache and context switching
issues).
Right. So this means 4 cores should be good enough for 2 GPUs.
I wish somebody made a motherboard that would allow 6-core
AMD Istanbuls, but they don't. Putting 2 4-cores CPUs on the
motherboard might not be worth the cost. I'm not sure.
Post by Micha Feigin
The processes will be sharing the pci bus though for communications so you may
prefer to setup the system as 1 job per machine or at least a round robin
scheduler.
This is another reason not to go crazy with lots of cores.
They'll be sitting idle most of the time, unless I also
create queues for normal non-GPU jobs.
Post by Micha Feigin
Take note that the s1070 is ~6k$ so you are talking at most two to three
machines here with your budget.
Ha, ha!! ~$6K should get me two compute nodes, complete
with graphics cards.

I appreciate everyone's comments, and I welcome more.

Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
***@berkeley.edu
Jon Forrest
2010-01-31 01:30:31 UTC
Permalink
Hi Jon,
I must emphasize what David Mathog said about the importance of the gpu
programming model.
I don't doubt this at all. Fortunately, we have lots
of very smart people here at UC Berkeley. I have
the utmost confidence that they will figure this
stuff out. My job is to purchase and configure the
cluster.
My perspective (with hopefully not too much opinion added)
OpenCL vs CUDA - OpenCL is 1/10th as popular, lacks in features, more
tedious to write and in an effort to stay generic loses the potential to
fully exploit the gpu. At one point the performance of the drivers from
Nvidia was not equivalent, but I think that's been fixed. (This does not
mean all vendors are unilaterally doing a good job)
This is very interesting news. As far as I know, nobody is doing
anything with OpenCL in the College of Chemistry around here.
On the other hand, we've been following all the press about how
it's going to be the great unifier so that it won't be necessary
to use a proprietary API such as CUDA anymore. At this point it's too
early to doing anything with OpenCL until our colleagues in
the Computer Science department have made a pass at it and
have experiences to talk about.
Have you considered sharing access with another research lab that has
already purchased something similar?
(Some vendors may also be willing to let you run your codes in exchange
for feedback.)
There's nobody else at UC Berkeley I know of who has a GPU
cluster.

I don't know of any vendor who'd be willing to volunteer
their cluster. If anybody would like to volunteer, step
right up.
1) sw thread synchronization chews up processor time
Right, but let's say right now 80% of the CPU time is spent
in routines that will eventually be done in the GPU (I'm
just making this number up). I don't see how having a faster
CPU would help overall.
2) Do you already know if your code has enough computational complexity
to outweigh the memory access costs?
In general, yes. A couple of grad students have ported some
of their code to CUDA with excellent results. Plus, molecular
dynamics is well suited to GPU programming, or so I'm told.
Several of the popular opensource MD packages have already
been ported also with excellent results.
3) Do you know if the GTX275 has enough vram? Your benchmarks will
suffer if you start going to gart and page faulting
The one I mentioned in my posting has 1.8GB of RAM. If this isn't
enough then we're in trouble. The grad student I mentioned
has been using the 898MB version of this card without problems.
4) I can tell you 100% that not all gpu are created equally when it
comes to handling cuda code. I don't have experience with the GTX275,
but if you do hit issues I would be curious to hear about them.
I've heard that it's much better than the 9500GT that we first
started using. Since the 9500GT is a much cheaper card we didn't expect
much performance out of it, but the grad student who was trying
to use it said that there were problems with it not releasing memory,
resulting in having to reboot the host. I don't know the details.
Some questions in return..
Is your code currently C, C++ or Fortran?
The most important program for this group is in Fortran.
We're going to keep it in Fortran, but we're going to
write C interfaces to the routines that will run on
the GPU, and then write these routines in C.
Is there any interest in optimizations at the compiler level which could
benefit molecular dynamics simulations?
Of course, but at what price? I'm talking both about
both the price in dollars, and the price in non-standard
directives.

I'm not a chemist so I don't know what would speed up MD calculations
more than a good GPU.

Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
***@berkeley.edu
Micha Feigin
2010-01-31 17:33:58 UTC
Permalink
On Sat, 30 Jan 2010 17:30:31 -0800
Post by Jon Forrest
Hi Jon,
I must emphasize what David Mathog said about the importance of the gpu
programming model.
I don't doubt this at all. Fortunately, we have lots
of very smart people here at UC Berkeley. I have
the utmost confidence that they will figure this
stuff out. My job is to purchase and configure the
cluster.
My perspective (with hopefully not too much opinion added)
OpenCL vs CUDA - OpenCL is 1/10th as popular, lacks in features, more
tedious to write and in an effort to stay generic loses the potential to
fully exploit the gpu. At one point the performance of the drivers from
Nvidia was not equivalent, but I think that's been fixed. (This does not
mean all vendors are unilaterally doing a good job)
This is very interesting news. As far as I know, nobody is doing
anything with OpenCL in the College of Chemistry around here.
On the other hand, we've been following all the press about how
it's going to be the great unifier so that it won't be necessary
to use a proprietary API such as CUDA anymore. At this point it's too
early to doing anything with OpenCL until our colleagues in
the Computer Science department have made a pass at it and
have experiences to talk about.
People are starting to work with OpenCL but I don't think that it's ready yet.
The nvidia implementation is still buggy and not up to par against cuda in
terms of performance. Code is longer and more tedious (mostly matches the
nvidia driver model instead of the much easier to use c api). I know that
although NVidia say that they fully support it, they don't like it too much.
NVidia techs told me that the performance difference can be about 1:2.

Cuda exists for 5 years (and another 2 internally in NVidia). Version 1 of
OpenCL was released December 2008 and they started working on 1.1 immediately
after that. It has also been broken almost from the start due to too many
companies controling it (it's designed by a consortium) and trying to solve the
problem for too many scenarios at the same time.

ATI also started supporting OpenCL but I don't have any experience with that.
Their upside is that it also allows compiling cpu versions.

I would start with cuda as the move to OpenCL is very simple afterwords if you
wish and Cuda is easier to start with.

Also note that OpenCL gives you functional portability but not performance
portability. You will not write the same OpenCL code for NVidia, ATI, CPUs etc.
The vectorization should be all different (NVidia discourage vectorization, ATI
require vectorization, SSE requires different vectorization), the memory model
is different, the size of the work groups should be different, etc.
Post by Jon Forrest
Have you considered sharing access with another research lab that has
already purchased something similar?
(Some vendors may also be willing to let you run your codes in exchange
for feedback.)
There's nobody else at UC Berkeley I know of who has a GPU
cluster.
I don't know of any vendor who'd be willing to volunteer
their cluster. If anybody would like to volunteer, step
right up.
Are you aware of the NVidia professor partnership program? We got a Tesla S1070
for free from them.

http://www.nvidia.com/page/professor_partnership.html
Post by Jon Forrest
1) sw thread synchronization chews up processor time
Right, but let's say right now 80% of the CPU time is spent
in routines that will eventually be done in the GPU (I'm
just making this number up). I don't see how having a faster
CPU would help overall.
My experience is that unless you wish to write hybrid code (code that partly
runs on the GPU and partly on the CPU in parallel to fully utilize the system)
you don't need to care too much about the CPU power.

Note that the Cuda model is asynchronous so you can run code in parallel
between the GPU and CPU.
Post by Jon Forrest
2) Do you already know if your code has enough computational complexity
to outweigh the memory access costs?
In general, yes. A couple of grad students have ported some
of their code to CUDA with excellent results. Plus, molecular
dynamics is well suited to GPU programming, or so I'm told.
Several of the popular opensource MD packages have already
been ported also with excellent results.
The issue is not only computation complexity but also regular memory accesses.
Random memory accesses on the GPU can seriously kill you performance.

Also note that until fermi comes out the double precision performance is
horrible. If you can't use single precision then GPUs are probably not for you
at the moment. Double precision on g200 is around an 1/8 of single precision
and g80/g90 don't have double precision at all.

Fermi improves that by finally providing double precision running an 1/2 the
single precision speed (basically combining two FPUs into on double precision
unit).
Post by Jon Forrest
3) Do you know if the GTX275 has enough vram? Your benchmarks will
suffer if you start going to gart and page faulting
You don't have page faulting on the GPU, GPUs don't have virtual memory. If you
don't have enough memory the allocation will just fail.
Post by Jon Forrest
The one I mentioned in my posting has 1.8GB of RAM. If this isn't
enough then we're in trouble. The grad student I mentioned
has been using the 898MB version of this card without problems.
4) I can tell you 100% that not all gpu are created equally when it
comes to handling cuda code. I don't have experience with the GTX275,
but if you do hit issues I would be curious to hear about them.
I've heard that it's much better than the 9500GT that we first
started using. Since the 9500GT is a much cheaper card we didn't expect
much performance out of it, but the grad student who was trying
to use it said that there were problems with it not releasing memory,
resulting in having to reboot the host. I don't know the details.
I don't have any issues with releasing memory. The big differences are between
the g80/g90 series (including the 9500GT) which is a 1.1 Cuda model and the
g200 which uses the 1.3 cuda model.

Memory handling is much better on the 1.3 GPUs (memory accesses for fully
utilizing the memory bandwidth are much more lenient). The g200 also has double
precision support (although at about 1/8 the speed of single precision). There
is also more support for atomic operations and a few other differences,
although the biggest difference is the memory bandwidth utilization.

Don't bother with the 8000 and 9000 for HPC and Cuda. Cheaper for learning but
not so much for deployment.
Post by Jon Forrest
Some questions in return..
Is your code currently C, C++ or Fortran?
The most important program for this group is in Fortran.
We're going to keep it in Fortran, but we're going to
write C interfaces to the routines that will run on
the GPU, and then write these routines in C.
You may want to look into the pgi compiler. They introduced Cuda support for
Fortran, I believe since November.
http://www.pgroup.com/resources/cudafortran.htm
Post by Jon Forrest
Is there any interest in optimizations at the compiler level which could
benefit molecular dynamics simulations?
Of course, but at what price? I'm talking both about
both the price in dollars, and the price in non-standard
directives.
I'm not a chemist so I don't know what would speed up MD calculations
more than a good GPU.
On the cpu side you can utilize SSE. You can also use single precision on the
CPU along with SSE and good cache utilization to greatly speed up things also
on the CPU.

My personal experience though is that it's much harder to use such optimization
on the CPU than on the GPU for most problems.
Post by Jon Forrest
Cordially,
Micha Feigin
2010-01-31 19:17:41 UTC
Permalink
On Sun, 31 Jan 2010 21:15:12 +0300
Post by Micha Feigin
On Sat, 30 Jan 2010 17:30:31 -0800
[snip]
Post by Micha Feigin
People are starting to work with OpenCL but I don't think that it's ready yet.
The nvidia implementation is still buggy and not up to par against cuda in
terms of performance. Code is longer and more tedious (mostly matches the
nvidia driver model instead of the much easier to use c api). I know that
although NVidia say that they fully support it, they don't like it too much.
NVidia techs told me that the performance difference can be about 1:2.
That used to be true, but I thought they fixed that? (How old is your
information)
C. Bergström
2010-01-31 18:15:12 UTC
Permalink
Post by Micha Feigin
On Sat, 30 Jan 2010 17:30:31 -0800
Post by Jon Forrest
Hi Jon,
I must emphasize what David Mathog said about the importance of the gpu
programming model.
I don't doubt this at all. Fortunately, we have lots
of very smart people here at UC Berkeley. I have
the utmost confidence that they will figure this
stuff out. My job is to purchase and configure the
cluster.
My perspective (with hopefully not too much opinion added)
OpenCL vs CUDA - OpenCL is 1/10th as popular, lacks in features, more
tedious to write and in an effort to stay generic loses the potential to
fully exploit the gpu. At one point the performance of the drivers from
Nvidia was not equivalent, but I think that's been fixed. (This does not
mean all vendors are unilaterally doing a good job)
This is very interesting news. As far as I know, nobody is doing
anything with OpenCL in the College of Chemistry around here.
On the other hand, we've been following all the press about how
it's going to be the great unifier so that it won't be necessary
to use a proprietary API such as CUDA anymore. At this point it's too
early to doing anything with OpenCL until our colleagues in
the Computer Science department have made a pass at it and
have experiences to talk about.
People are starting to work with OpenCL but I don't think that it's ready yet.
The nvidia implementation is still buggy and not up to par against cuda in
terms of performance. Code is longer and more tedious (mostly matches the
nvidia driver model instead of the much easier to use c api). I know that
although NVidia say that they fully support it, they don't like it too much.
NVidia techs told me that the performance difference can be about 1:2.
That used to be true, but I thought they fixed that? (How old is your
information)
Post by Micha Feigin
Cuda exists for 5 years (and another 2 internally in NVidia). Version 1 of
OpenCL was released December 2008 and they started working on 1.1 immediately
after that. It has also been broken almost from the start due to too many
companies controling it (it's designed by a consortium) and trying to solve the
problem for too many scenarios at the same time.
The problem isn't too many companies.. It was IBM's cell requirements
afaik.. Thank god that's dead now..
Post by Micha Feigin
ATI also started supporting OpenCL but I don't have any experience with that.
Their upside is that it also allows compiling cpu versions.
I would start with cuda as the move to OpenCL is very simple afterwords if you
wish and Cuda is easier to start with.
I would start with a directive based approach that's entirely more sane
than CUDA or OpenCL.. Especially if his code is primarily Fortran. I
think writing C interfaces so that you can call the GPU is a maintenance
nightmare and will not only be time consuming, but will later will make
optimizing the application *a lot* harder. (I say this with my gpu
compiler hat on and more than happy to go into specifics)
Post by Micha Feigin
Also note that OpenCL gives you functional portability but not performance
portability. You will not write the same OpenCL code for NVidia, ATI, CPUs etc.
The vectorization should be all different (NVidia discourage vectorization, ATI
require vectorization, SSE requires different vectorization), the memory model
is different, the size of the work groups should be different, etc.
Please look at HMPP and see if it may solve this..
Post by Micha Feigin
Post by Jon Forrest
Have you considered sharing access with another research lab that has
already purchased something similar?
(Some vendors may also be willing to let you run your codes in exchange
for feedback.)
There's nobody else at UC Berkeley I know of who has a GPU
cluster.
I don't know of any vendor who'd be willing to volunteer
their cluster. If anybody would like to volunteer, step
right up.
Are you aware of the NVidia professor partnership program? We got a Tesla S1070
for free from them.
http://www.nvidia.com/page/professor_partnership.html
Post by Jon Forrest
1) sw thread synchronization chews up processor time
Right, but let's say right now 80% of the CPU time is spent
in routines that will eventually be done in the GPU (I'm
just making this number up). I don't see how having a faster
CPU would help overall.
My experience is that unless you wish to write hybrid code (code that partly
runs on the GPU and partly on the CPU in parallel to fully utilize the system)
you don't need to care too much about the CPU power.
Note that the Cuda model is asynchronous so you can run code in parallel
between the GPU and CPU.
Post by Jon Forrest
2) Do you already know if your code has enough computational complexity
to outweigh the memory access costs?
In general, yes. A couple of grad students have ported some
of their code to CUDA with excellent results. Plus, molecular
dynamics is well suited to GPU programming, or so I'm told.
Several of the popular opensource MD packages have already
been ported also with excellent results.
The issue is not only computation complexity but also regular memory accesses.
Random memory accesses on the GPU can seriously kill you performance.
I think I mentioned memory accesses.. Are you talking about page faults
or what specifically? (My perspective is skewed and I may be using a
different term.)
Post by Micha Feigin
Also note that until fermi comes out the double precision performance is
horrible. If you can't use single precision then GPUs are probably not for you
at the moment. Double precision on g200 is around an 1/8 of single precision
and g80/g90 don't have double precision at all.
Fermi improves that by finally providing double precision running an 1/2 the
single precision speed (basically combining two FPUs into on double precision
unit).
Post by Jon Forrest
3) Do you know if the GTX275 has enough vram? Your benchmarks will
suffer if you start going to gart and page faulting
You don't have page faulting on the GPU, GPUs don't have virtual memory. If you
don't have enough memory the allocation will just fail.
Whatever you want to label it at a hardware level nvidia cards *do* have
vram and the drivers *can* swap to system memory. They use two things
to deal with this a) hw based page fault mechanism and b) dma copying to
reduce cpu overhead. If you try to allocate more that's available on
the card yes it will probably just fail. (We are working on the
drivers) My point was about what happens between the context switches
of kernels.
Post by Micha Feigin
Post by Jon Forrest
The one I mentioned in my posting has 1.8GB of RAM. If this isn't
enough then we're in trouble. The grad student I mentioned
has been using the 898MB version of this card without problems.
4) I can tell you 100% that not all gpu are created equally when it
comes to handling cuda code. I don't have experience with the GTX275,
but if you do hit issues I would be curious to hear about them.
I've heard that it's much better than the 9500GT that we first
started using. Since the 9500GT is a much cheaper card we didn't expect
much performance out of it, but the grad student who was trying
to use it said that there were problems with it not releasing memory,
resulting in having to reboot the host. I don't know the details.
I don't have any issues with releasing memory. The big differences are between
the g80/g90 series (including the 9500GT) which is a 1.1 Cuda model and the
g200 which uses the 1.3 cuda model.
Memory handling is much better on the 1.3 GPUs (memory accesses for fully
utilizing the memory bandwidth are much more lenient). The g200 also has double
precision support (although at about 1/8 the speed of single precision). There
is also more support for atomic operations and a few other differences,
although the biggest difference is the memory bandwidth utilization.
Don't bother with the 8000 and 9000 for HPC and Cuda. Cheaper for learning but
not so much for deployment.
Post by Jon Forrest
Some questions in return..
Is your code currently C, C++ or Fortran?
The most important program for this group is in Fortran.
We're going to keep it in Fortran, but we're going to
write C interfaces to the routines that will run on
the GPU, and then write these routines in C.
You may want to look into the pgi compiler. They introduced Cuda support for
Fortran, I believe since November.
http://www.pgroup.com/resources/cudafortran.htm
Can anyone give positive feedback? (Disclaimer: I'm biased, but since
we are making specific recommendations)
http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36
Post by Micha Feigin
Post by Jon Forrest
Is there any interest in optimizations at the compiler level which could
benefit molecular dynamics simulations?
Of course, but at what price? I'm talking both about
both the price in dollars, and the price in non-standard
directives.
I'm not a chemist so I don't know what would speed up MD calculations
more than a good GPU.
On the cpu side you can utilize SSE. You can also use single precision on the
CPU along with SSE and good cache utilization to greatly speed up things also
on the CPU.
My personal experience though is that it's much harder to use such optimization
on the CPU than on the GPU for most problems.
CUDA/OpenCL and friends implicitly identify which areas can be
vectorized and then explicitly offload them. You are comparing
apple/oranges here..
Prentice Bisbal
2010-02-03 14:56:36 UTC
Permalink
Post by Micha Feigin
NVidia techs told me that the performance difference can be about 1:2.
That used to be true, but I thought they fixed that? (How old is your
information)
I heard this myself many times SC09. And that was in reference to Fermi,
so doubt it's changed much since then.
--
Prentice
Micha Feigin
2010-01-31 17:06:48 UTC
Permalink
On Sat, 30 Jan 2010 10:24:09 -0800
Post by Jon Forrest
Post by Micha Feigin
It is recommended BTW, that you have at least the same amount of system memory
as GPU memory, so with tesla it is 4GB per GPU.
1) This is a proof of concept cluster. Spending $1200
per graphics card means that the GPUs alone, assuming
2 GPUs, would cost as much as a whole node with
2 consumer-grade cards. (See below)
Be very very sure that consumer geforces can go in 1u boxes. It's not so much
the space as much as I'm skeptical with their ability of handling the thermal
issues. They are just not designed for this kind of work.

Note that geforces are overclocked (my gtx 285 by 30% compared to a tesla with
the same chip) and are actively cooled, which means that you need to get air
flowing into the side fan. That's exactly why they put the tesla m and not the
c into those boxes.

The geforce driver also throttles the card under load to solve thermal issues.

You will probably want to under clock the cards to the tesla spec and be sure
to monitor the thermal state.

I know someone who works with 3 gtx295 in a desktop box and he initially had
some thermal shutdown issues with older drivers. I'm guessing that the newer
drivers just throttle the cards more aggressively under load.
Post by Jon Forrest
2) We know that the Fermi cards are coming out
soon. If we were going to spend big bucks
on GPUs, we'd wait for them. But, our funding
runs out before the Fermis will be available.
This is too bad but there's nothing I can do
about it.
Check out the mad scientist program, it's supposed to end today, but maybe if you talk to NVidia they can still get you into it (they are rather flexible, esspecially with universities, and they also offer if for companies)
http://www.nvidia.com/object/mad_science_promo.html
You can buy a current telsa (t10 core) and upgrade it for a fermi (t20 core)
when it comes out for the cost difference. May be more cost effective if you do
plan to build a fermi cluster later on. It is designed to upgrade to the same
line though (c, m or s) so you may want to consider now which one to go with.
Post by Jon Forrest
See below for comments regarding CPUs and cores.
Post by Micha Feigin
You use dedicated systems. Either one 1u pizza box for the CPU and a matched 1u
tesla s1070 pizza box which has 4 tesla GPUs
Since my first post I've learned about the Supermicro boxes
that have space for two GPUs
(http://www.supermicro.com/products/system/1U/6016/SYS-6016GT-TF.cfm?GPU=) .
This looks like a good way to go for a proof-of-concept cluster. Plus,
since we have to pay $10/U/month at the Data Center, it's a good
way to use space.
See my previous comment
Post by Jon Forrest
The GPU that looks the most promising is the GeForce GTX275.
(http://www.evga.com/products/moreInfo.asp?pn=017-P3-1175-AR)
It has 1792MB of RAM and is only ~$300. I realize that there
are better cards but for this proof-of-concept cluster we
want to get the best bang for the buck. Later, after we've
ported our programs, and have some experience optimizing them,
then we'll consider something better, probably using whatever
the best Fermi-based card is.
The research group that will be purchasing this cluster does
molecular dynamics simulations that usually take 24 hours or more
to complete using quad-core Xeons. We hope to bring down this
time substantially.
Post by Micha Feigin
It doesn't have a swap in/swap out mechanism, so the way it may time share is
by alternating kernels as long as there is enough memory. Shouldn't be done for
HPC (same with CPU by the way due to numa/l2 cache and context switching
issues).
Right. So this means 4 cores should be good enough for 2 GPUs.
I wish somebody made a motherboard that would allow 6-core
AMD Istanbuls, but they don't. Putting 2 4-cores CPUs on the
motherboard might not be worth the cost. I'm not sure.
Post by Micha Feigin
The processes will be sharing the pci bus though for communications so you may
prefer to setup the system as 1 job per machine or at least a round robin
scheduler.
This is another reason not to go crazy with lots of cores.
They'll be sitting idle most of the time, unless I also
create queues for normal non-GPU jobs.
Post by Micha Feigin
Take note that the s1070 is ~6k$ so you are talking at most two to three
machines here with your budget.
Ha, ha!! ~$6K should get me two compute nodes, complete
with graphics cards.
I appreciate everyone's comments, and I welcome more.
Cordially,
Gerry Creager
2010-01-31 20:31:40 UTC
Permalink
I employ a GX285 in a dedicated remote-access graphics box for
data-local visualization and run into some of these issues, too. More
inline, but Micha has it right.
Post by Micha Feigin
On Sat, 30 Jan 2010 10:24:09 -0800
Post by Jon Forrest
Post by Micha Feigin
It is recommended BTW, that you have at least the same amount of system memory
as GPU memory, so with tesla it is 4GB per GPU.
1) This is a proof of concept cluster. Spending $1200
per graphics card means that the GPUs alone, assuming
2 GPUs, would cost as much as a whole node with
2 consumer-grade cards. (See below)
Be very very sure that consumer geforces can go in 1u boxes. It's not so much
the space as much as I'm skeptical with their ability of handling the thermal
issues. They are just not designed for this kind of work.
I've had to go to 2u and eventually to larger boxes because of power
supply and air-flow requirements. This is a big issue.
Post by Micha Feigin
Note that geforces are overclocked (my gtx 285 by 30% compared to a tesla with
the same chip) and are actively cooled, which means that you need to get air
flowing into the side fan. That's exactly why they put the tesla m and not the
c into those boxes.
Depends on where you get your gx from. I've got one that claims to not
be overclocked but also claims to be as fast as one that says it IS
overclocked. Since I'm not yet interested enough to actually look at the
onboard chip speeds, I don't know. However, the one I've now got is in
a 4u with additional forced air in the case to support an overtemp
problem we had that was primarily flow-related (extra fans in the 2u and
3u cases we tried). We've not wandered too far into GPGPU processing...
our user community has not shown an interest in it, but for graphics,
it's useful.
Post by Micha Feigin
The geforce driver also throttles the card under load to solve thermal issues.
I believe this depends on onboard temp monitoring. Again, sufficient
airflow is your friend.
Post by Micha Feigin
You will probably want to under clock the cards to the tesla spec and be sure
to monitor the thermal state.
I know someone who works with 3 gtx295 in a desktop box and he initially had
some thermal shutdown issues with older drivers. I'm guessing that the newer
drivers just throttle the cards more aggressively under load.
Post by Jon Forrest
2) We know that the Fermi cards are coming out
soon. If we were going to spend big bucks
on GPUs, we'd wait for them. But, our funding
runs out before the Fermis will be available.
This is too bad but there's nothing I can do
about it.
Check out the mad scientist program, it's supposed to end today, but maybe if you talk to NVidia they can still get you into it (they are rather flexible, esspecially with universities, and they also offer if for companies)
http://www.nvidia.com/object/mad_science_promo.html
You can buy a current telsa (t10 core) and upgrade it for a fermi (t20 core)
when it comes out for the cost difference. May be more cost effective if you do
plan to build a fermi cluster later on. It is designed to upgrade to the same
line though (c, m or s) so you may want to consider now which one to go with.
Post by Jon Forrest
See below for comments regarding CPUs and cores.
Post by Micha Feigin
You use dedicated systems. Either one 1u pizza box for the CPU and a matched 1u
tesla s1070 pizza box which has 4 tesla GPUs
Since my first post I've learned about the Supermicro boxes
that have space for two GPUs
(http://www.supermicro.com/products/system/1U/6016/SYS-6016GT-TF.cfm?GPU=) .
This looks like a good way to go for a proof-of-concept cluster. Plus,
since we have to pay $10/U/month at the Data Center, it's a good
way to use space.
See my previous comment
Post by Jon Forrest
The GPU that looks the most promising is the GeForce GTX275.
(http://www.evga.com/products/moreInfo.asp?pn=017-P3-1175-AR)
It has 1792MB of RAM and is only ~$300. I realize that there
are better cards but for this proof-of-concept cluster we
want to get the best bang for the buck. Later, after we've
ported our programs, and have some experience optimizing them,
then we'll consider something better, probably using whatever
the best Fermi-based card is.
The research group that will be purchasing this cluster does
molecular dynamics simulations that usually take 24 hours or more
to complete using quad-core Xeons. We hope to bring down this
time substantially.
Post by Micha Feigin
It doesn't have a swap in/swap out mechanism, so the way it may time share is
by alternating kernels as long as there is enough memory. Shouldn't be done for
HPC (same with CPU by the way due to numa/l2 cache and context switching
issues).
Right. So this means 4 cores should be good enough for 2 GPUs.
I wish somebody made a motherboard that would allow 6-core
AMD Istanbuls, but they don't. Putting 2 4-cores CPUs on the
motherboard might not be worth the cost. I'm not sure.
Post by Micha Feigin
The processes will be sharing the pci bus though for communications so you may
prefer to setup the system as 1 job per machine or at least a round robin
scheduler.
This is another reason not to go crazy with lots of cores.
They'll be sitting idle most of the time, unless I also
create queues for normal non-GPU jobs.
Post by Micha Feigin
Take note that the s1070 is ~6k$ so you are talking at most two to three
machines here with your budget.
Ha, ha!! ~$6K should get me two compute nodes, complete
with graphics cards.
gerry
Mark Hahn
2010-01-31 22:06:34 UTC
Permalink
Post by Micha Feigin
Be very very sure that consumer geforces can go in 1u boxes. It's not so much
the space as much as I'm skeptical with their ability of handling the thermal
issues. They are just not designed for this kind of work.
I've had to go to 2u and eventually to larger boxes because of power supply
and air-flow requirements. This is a big issue.
I'm a bit puzzled here. sumermicro sells servers that take either two
M1060's, or two C1060's, or two of any pcie 2 x16 cpus. their airflow
design seems at least thought-about, and their PSU is 1400W.

C1060 specs merely say "200W max, 160W typical" - which is probably
about the same as gtx275 according to wikipedia. so something like
600W expected from 1U - not really that hard, especially if you don't
have a wall of 40u racks full of them...
Post by Micha Feigin
Note that geforces are overclocked (my gtx 285 by 30% compared to a tesla with
the same chip)
well, they're tuned differently: gf cards have substantially higher memory
clocks and lower shader clocks. tesla has higher shader and substantially
slower memory clocks (presumably because there are more loads on the bus.)
Post by Micha Feigin
and are actively cooled, which means that you need to get
air
flowing into the side fan. That's exactly why they put the tesla m and not the
c into those boxes.
why is this a problem with 1U? or do you really mean "double-wide cards
don't provide enough clearance in 1U to get air to the card's intake"?

-mark hahn
Micha
2010-02-01 00:31:30 UTC
Permalink
Post by Gerry Creager
Post by Micha Feigin
Be very very sure that consumer geforces can go in 1u boxes. It's not so much
the space as much as I'm skeptical with their ability of handling the thermal
issues. They are just not designed for this kind of work.
I've had to go to 2u and eventually to larger boxes because of power
supply and air-flow requirements. This is a big issue.
I'm a bit puzzled here. sumermicro sells servers that take either two
M1060's, or two C1060's, or two of any pcie 2 x16 cpus. their airflow
design seems at least thought-about, and their PSU is 1400W.
The PSU is enough
C1060 specs merely say "200W max, 160W typical" - which is probably
about the same as gtx275 according to wikipedia. so something like 600W
expected from 1U - not really that hard, especially if you don't
have a wall of 40u racks full of them...
Post by Gerry Creager
Post by Micha Feigin
Note that geforces are overclocked (my gtx 285 by 30% compared to a tesla with
the same chip)
well, they're tuned differently: gf cards have substantially higher memory
clocks and lower shader clocks. tesla has higher shader and substantially
slower memory clocks (presumably because there are more loads on the bus.)
yes, they're tuned differently, but because they are meant for different
markets. gf are for the gamer market and are assumed to be run for several hours
at a time, without too many of them in the machine fighting for airflow (gamer
setup). Throttling is no big issue if needed.

tesla is a server product that needs to run 24/7 for days/months without
throttling (consistent output). Usually there are several of them in one machine
(or shared quadro + tesla)

Another issue is tolerance to memory errors. Higher temp/clock can cause more
memory errors. These may cause small unnoticeable glitches for game graphics but
will ruin hpc results.

The two main issues taken into account for tuning is running time, and leniency
to throttling.
Post by Gerry Creager
Post by Micha Feigin
and are actively cooled, which means that you need to get air
flowing into the side fan. That's exactly why they put the tesla m and not the
c into those boxes.
why is this a problem with 1U? or do you really mean "double-wide cards
don't provide enough clearance in 1U to get air to the card's intake"?
all the cards we are talking about are double wide. c1060 is actively cooled and
is designed for a desktop pc. m1060 is passively cooled and designed for a 1u
server.

the c1060 assumes side air intake and rear exhaust. m1060 expects through flow
and no external access to exhaust.

Different design based on different airflow paradigms.

I never built some systems so I'm not talking from experience but assumption (we
are using c1060 in desktops and s1070 in servers). I'm not sure if a double wide
card with side air intake in a 1u box allow any airflow to reach the air intake
and thus the GPU. Maybe you can mod the card by taking the plastic off to
improve airflow though.

It looks from their site that they support double wide cards in their boxes so I
guess that they tested the cooling. They definitely have more experience than me
with such setups.

I didn't say that it doesn't work, I just advised that you make sure as it
sounded borderline to me and as noted previously by someone else, it has caused
problem for people.
-mark hahn
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Gerry Creager
2010-02-01 00:32:01 UTC
Permalink
Post by Mark Hahn
Post by Gerry Creager
Post by Micha Feigin
Be very very sure that consumer geforces can go in 1u boxes. It's not so much
the space as much as I'm skeptical with their ability of handling the thermal
issues. They are just not designed for this kind of work.
I've had to go to 2u and eventually to larger boxes because of power
supply and air-flow requirements. This is a big issue.
I'm a bit puzzled here. sumermicro sells servers that take either two
M1060's, or two C1060's, or two of any pcie 2 x16 cpus. their airflow
design seems at least thought-about, and their PSU is 1400W.
C1060 specs merely say "200W max, 160W typical" - which is probably
about the same as gtx275 according to wikipedia. so something like 600W
expected from 1U - not really that hard, especially if you don't
have a wall of 40u racks full of them...
A little over a year ago, a 1u 600w supply was a bit difficult to find
for < $400, and additional fans for one required buying a specialty 1u
case. I could have driven node price over $4k with CPUs, memory, a large
onboard scratch, etc. I, too, was building a proof-of-concept box at the
time. Now, it's used almost daily by several folks, and I'm thinking of
building a new POC to house 4x gx's...

And still no user interest in CUDA, and I don't have enough time to play
without user interest, with my own research program that isn't
computational science.
Post by Mark Hahn
Post by Gerry Creager
Post by Micha Feigin
Note that geforces are overclocked (my gtx 285 by 30% compared to a tesla with
the same chip)
well, they're tuned differently: gf cards have substantially higher memory
clocks and lower shader clocks. tesla has higher shader and substantially
slower memory clocks (presumably because there are more loads on the bus.)
Didn't realize. Thanks.
Post by Mark Hahn
Post by Gerry Creager
Post by Micha Feigin
and are actively cooled, which means that you need to get air
flowing into the side fan. That's exactly why they put the tesla m and not the
c into those boxes.
why is this a problem with 1U? or do you really mean "double-wide cards
don't provide enough clearance in 1U to get air to the card's intake"?
That's what *I* found, anyway. Yeah, what you said. Sorry, I thought
that was obvious. When you turn one of those cards on its side, you do
have trouble with card-width clearance.

gerry
C. Bergström
2010-01-30 22:52:43 UTC
Permalink
Post by Jon Forrest
Post by Micha Feigin
It is recommended BTW, that you have at least the same amount of system memory
as GPU memory, so with tesla it is 4GB per GPU.
1) This is a proof of concept cluster. Spending $1200
per graphics card means that the GPUs alone, assuming
2 GPUs, would cost as much as a whole node with
2 consumer-grade cards. (See below)
2) We know that the Fermi cards are coming out
soon. If we were going to spend big bucks
on GPUs, we'd wait for them. But, our funding
runs out before the Fermis will be available.
This is too bad but there's nothing I can do
about it.
See below for comments regarding CPUs and cores.
Post by Micha Feigin
You use dedicated systems. Either one 1u pizza box for the CPU and a matched 1u
tesla s1070 pizza box which has 4 tesla GPUs
Since my first post I've learned about the Supermicro boxes
that have space for two GPUs
(http://www.supermicro.com/products/system/1U/6016/SYS-6016GT-TF.cfm?GPU=) .
This looks like a good way to go for a proof-of-concept cluster. Plus,
since we have to pay $10/U/month at the Data Center, it's a good
way to use space.
The GPU that looks the most promising is the GeForce GTX275.
(http://www.evga.com/products/moreInfo.asp?pn=017-P3-1175-AR)
It has 1792MB of RAM and is only ~$300. I realize that there
are better cards but for this proof-of-concept cluster we
want to get the best bang for the buck. Later, after we've
ported our programs, and have some experience optimizing them,
then we'll consider something better, probably using whatever
the best Fermi-based card is.
The research group that will be purchasing this cluster does
molecular dynamics simulations that usually take 24 hours or more
to complete using quad-core Xeons. We hope to bring down this
time substantially.
Post by Micha Feigin
It doesn't have a swap in/swap out mechanism, so the way it may time share is
by alternating kernels as long as there is enough memory. Shouldn't be done for
HPC (same with CPU by the way due to numa/l2 cache and context switching
issues).
Right. So this means 4 cores should be good enough for 2 GPUs.
I wish somebody made a motherboard that would allow 6-core
AMD Istanbuls, but they don't. Putting 2 4-cores CPUs on the
motherboard might not be worth the cost. I'm not sure.
Post by Micha Feigin
The processes will be sharing the pci bus though for communications so you may
prefer to setup the system as 1 job per machine or at least a round robin
scheduler.
This is another reason not to go crazy with lots of cores.
They'll be sitting idle most of the time, unless I also
create queues for normal non-GPU jobs.
Post by Micha Feigin
Take note that the s1070 is ~6k$ so you are talking at most two to three
machines here with your budget.
Ha, ha!! ~$6K should get me two compute nodes, complete
with graphics cards.
I appreciate everyone's comments, and I welcome more.
Hi Jon,

I must emphasize what David Mathog said about the importance of the gpu
programming model.

My perspective (with hopefully not too much opinion added)
OpenCL vs CUDA - OpenCL is 1/10th as popular, lacks in features, more
tedious to write and in an effort to stay generic loses the potential to
fully exploit the gpu. At one point the performance of the drivers from
Nvidia was not equivalent, but I think that's been fixed. (This does
not mean all vendors are unilaterally doing a good job)

HMPP and everything else I'm far too biased to offer my comments
publicly. (Feel free to email me offlist if curious)

Have you considered sharing access with another research lab that has
already purchased something similar?
(Some vendors may also be willing to let you run your codes in exchange
for feedback.)

I'd not completely disregard the importance of the host processor.

1) sw thread synchronization chews up processor time
2) Do you already know if your code has enough computational
complexity to outweigh the memory access costs?
3) Do you know if the GTX275 has enough vram? Your benchmarks will
suffer if you start going to gart and page faulting
4) I can tell you 100% that not all gpu are created equally when it
comes to handling cuda code. I don't have experience with the GTX275,
but if you do hit issues I would be curious to hear about them.

Some questions in return..
Is your code currently C, C++ or Fortran?
Is there any interest in optimizations at the compiler level which could
benefit molecular dynamics simulations?


Best,

./Christopher
r***@comcast.net
2010-02-01 15:24:13 UTC
Permalink
Post by David Mathog
Post by Jon Forrest
Are there any other issues I'm leaving out?
Yes, the time and expense of rewriting your code from a CPU model to a
GPU model, and the learning curve for picking up this new skill. (Unless
you are lucky and somebody has already ported the software you use.)
Coming in on this late, but to reduce this work load there is PGI's version
10.0 compiler suite which supports accelerator compiler directives. This
will reduce the coding effort, but probably suffer from the classical "if it is
easy, it won't perform as well" trade-off. My experience is limited, but
a nice intro can be found at:


http://www.pgroup.com/lit/articles/insider/v1n1a1.htm


You might also inquire with PGI about their SC09 course and class notes
or Google for them.


rbw


_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Jon Forrest
2010-02-01 19:53:30 UTC
Permalink
Post by r***@comcast.net
Coming in on this late, but to reduce this work load there is PGI's version
10.0 compiler suite which supports accelerator compiler directives. This
will reduce the coding effort, but probably suffer from the classical "if it is
easy, it won't perform as well" trade-off. My experience is limited, but
I'm not sure how much traction such a thing will get.
Let's say you have a big Fortran program that you want
to port to CUDA. Let's assume you already know where the
program spends its time, so you know which routines
are good candidates for running on the GPU.

Rather than rewriting the whole program in C[++],
wouldn't it be easiest to leave all the non-CUDA
parts of the program in Fortran, and then to call
CUDA routines written in C[++]. Since the CUDA
routines will have to be rewritten anyway, why
write them in a language which would require
purchasing yet another compiler?

Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
***@berkeley.edu
Patrick LeGresley
2010-01-31 19:45:49 UTC
Permalink
I've found this presentation from John Stone at SC09 to be a very
good comparison of CUDA versus OpenCL performance on real code:
http://www.ks.uiuc.edu/Research/gpu/files/openclbof_stone2009.pdf

My take away from this presentation, which matches my personal
experience comparing the two, is that CUDA and OpenCL performance on
NVIDIA hardware are within a few percent. Trying to use the same
source code on hardware from different vendors obviously has the
expected performance pitfalls.

The biggest thing to watch out for may be performance regressions
from one release of CUDA to the next, and even among slightly
different driver versions. You can see an example of this from John
on slide 17.

Cheers,

Patrick
r***@comcast.net
2010-02-01 20:54:45 UTC
Permalink
Post by Jon Forrest
Post by r***@comcast.net
Coming in on this late, but to reduce this work load there is PGI's version
10.0 compiler suite which supports accelerator compiler directives. This
will reduce the coding effort, but probably suffer from the classical
"if it is
easy, it won't perform as well" trade-off. My experience is limited, but
I'm not sure how much traction such a thing will get.
Let's say you have a big Fortran program that you want
to port to CUDA. Let's assume you already know where the
program spends its time, so you know which routines
are good candidates for running on the GPU.
Rather than rewriting the whole program in C[++],
wouldn't it be easiest to leave all the non-CUDA
parts of the program in Fortran, and then to call
CUDA routines written in C[++]. Since the CUDA
routines will have to be rewritten anyway, why
write them in a language which would require
purchasing yet another compiler?
Mmm ... not sure I understand the response, but perhaps this response
was to a different message ... ?? In any case, the PGI software supports
accelerator directives for both C and Fortran, so for those languages I do
not see a need to rewrite whole applications. The question presented is
the same as always, what does the performance-programming effort function
look like and how well does your code perform with directives to start
with. The PGI models is also hardware generic and the code runs on
the CPU in parallel when there is no GPU around I believe. What will
gate interest is how well PGI compiler group does at delivering performance
and how important portability is to the person developing the code.


HMPP make offers a similar proposition ...


rbw

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Micha
2010-02-01 23:56:44 UTC
Permalink
Post by Micha Feigin
Post by Jon Forrest
Post by r***@comcast.net
Coming in on this late, but to reduce this work load there is PGI's
version
Post by Jon Forrest
Post by r***@comcast.net
10.0 compiler suite which supports accelerator compiler directives. This
will reduce the coding effort, but probably suffer from the classical
"if it is
easy, it won't perform as well" trade-off. My experience is limited, but
I'm not sure how much traction such a thing will get.
Let's say you have a big Fortran program that you want
to port to CUDA. Let's assume you already know where the
program spends its time, so you know which routines
are good candidates for running on the GPU.
Rather than rewriting the whole program in C[++],
wouldn't it be easiest to leave all the non-CUDA
parts of the program in Fortran, and then to call
CUDA routines written in C[++]. Since the CUDA
routines will have to be rewritten anyway, why
write them in a language which would require
purchasing yet another compiler?
Mmm ... not sure I understand the response, but perhaps this response
was to a different message ... ?? In any case, the PGI software supports
accelerator directives for both C and Fortran, so for those languages I do
not see a need to rewrite whole applications. The question presented is
the same as always, what does the performance-programming effort function
look like and how well does your code perform with directives to start
with. The PGI models is also hardware generic and the code runs on
the CPU in parallel when there is no GPU around I believe. What will
gate interest is how well PGI compiler group does at delivering performance
and how important portability is to the person developing the code.
As far as I know pgi also has a Cuda Fortran similar to cuda c, not only a
directive based approach, but I have to admit that I don't have any experience
with it.

As for why spend money on a compiler since the code has to be re-written. Even
an expensive compiler is cheap with regards to a programmer's time. Even for the
salary of a cheap programmer you can buy the compiler in at most two weeks
salary's worth.

On the other hand, you have a programmer that already knows fortran and a piece
of code that is already written and debugged in fortran. Quite a few programs
can produce a first unoptimized version with very little work.

Just sorting through counter based bugs and memory order bugs can cost you a lot
more than the compiler. Fortran is 1 based compared to c that is 0 based
(actually fortran 90/95 can use any index range for matrices). Fortran is column
order while c is row order. Do you know how much head ache that can bring into
the porting?

Translating matlab code into fortran is also much easier that into c due to
these issues.
Post by Micha Feigin
HMPP make offers a similar proposition ...
rbw
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Continue reading on narkive:
Loading...