Discussion:
scratch File system for small cluster
(too old to reply)
Glen Beane
2008-09-25 13:40:54 UTC
Permalink
I am considering adding a small parallel file system ~(5-10TB) my small
cluster (~32 2x dual core Opteron nodes) that is used mostly by a handful of
regular users. Currently the only storage accessible to all nodes is home
directory space which is provided by the Lab's IT department (this is a SAN
volume connected to the head node by 2x FC links, and NFS exported to the
compute nodes). I don't have to "worry" about the IT provided SAN space -
they back it up, provide redundant hardware, etc. The parallel file system
would be scratch space (and not backed up by IT). We have a mix of home
grown apps doing a pretty wide range of things (some do a lot of I/O, others
don't), and things like BLAST and BLAT.

Can anyone out there provide recommendations for a good solution for fast
scratch space for a cluster of this size?

Right now I was thinking about PVFS2. How many I/O servers should I have,
and how many cores and RAM per I/O server?
Are there other recommendations for fast scratch space (it doesn't have to
be a parallel file system, something with less hardware would be nice)

--
Glen L. Beane
Software Engineer
The Jackson Laboratory
http://www.jax.org
Joe Landman
2008-09-25 14:19:26 UTC
Permalink
Post by Glen Beane
I am considering adding a small parallel file system ~(5-10TB) my small
cluster (~32 2x dual core Opteron nodes) that is used mostly by a handful of
regular users. Currently the only storage accessible to all nodes is home
directory space which is provided by the Lab's IT department (this is a SAN
volume connected to the head node by 2x FC links, and NFS exported to the
compute nodes). I don't have to "worry" about the IT provided SAN space -
they back it up, provide redundant hardware, etc. The parallel file system
would be scratch space (and not backed up by IT). We have a mix of home
grown apps doing a pretty wide range of things (some do a lot of I/O, others
don't), and things like BLAST and BLAT.
Hi Glen:

BLAST uses mmap'ed IO. This has some interesting ... interactions
... with parallel file systems.
Post by Glen Beane
Can anyone out there provide recommendations for a good solution for fast
scratch space for a cluster of this size?
Yes, but we are biased, as this is in part what we
design/build/sell/support. Linky in .sig .
Post by Glen Beane
Right now I was thinking about PVFS2. How many I/O servers should I have,
and how many cores and RAM per I/O server?
It turns out that PVFS2 sadly has a significant problem with BLAST
and mpiBLAST due to the mmap'ed files. We found this out when trying
to help a customer with a small tier-1 cluster deal with file system
instability. We saw this in PVFS2 2.6.9, 2.7.0 on 32 and 64 bit
platforms. The customer was going to update the PVFS2 group, haven't
heard if they have had a chance to do anything to trace this down and
fix it (I don't think it is a priority, as BLAST doesn't use MPI-IO,
which PVFS2 is quite good at).
Post by Glen Beane
Are there other recommendations for fast scratch space (it doesn't have to
be a parallel file system, something with less hardware would be nice)
Pure software: GlusterFS currently, ceph in the near future. GFS won't
give you very good performance (meta-data shuttling limits what you can
do). You could go Lustre, but then you need to build MDS/ODS setups so
this is hybrid.

Pure hardware: Panasas (awesome kit, but not for the light-of-wallet),
DDN, Bluearc (same comments for these as well).

Reasonable cost HW with good performance: us and a few others. Put any
parallel FS atop this, or pure NFS. We have measured NFSoverRDMA speeds
(on SDR IB at that) at 460 MB/s, on an RDMA adapter reporting 750 MB/s
(in a 4x PCIe slot, so ~860 MB/s max is what we should expect for this).
Faster IB hardware should result in better performance, though you
still have to walk through the various software stacks, and they ...
remove efficiency ... (nice PC way to say that they slow things down a
bit :( )
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: ***@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
Scott Atchley
2008-09-25 14:58:59 UTC
Permalink
We have measured NFSoverRDMA speeds (on SDR IB at that) at 460 MB/s,
on an RDMA adapter reporting 750 MB/s (in a 4x PCIe slot, so ~860 MB/
s max is what we should expect for this). Faster IB hardware should
result in better performance, though you still have to walk through
the various software stacks, and they ... remove efficiency ...
(nice PC way to say that they slow things down a bit :( )
Joe,

Even though recent kernels allow rsize and wsize of 1 MB for TCP,
RPCRDMA only supports 32 KB. This will limit your throughput some
regardless of faster hardware.

Scott
Joe Landman
2008-09-25 15:03:26 UTC
Permalink
Post by Scott Atchley
We have measured NFSoverRDMA speeds (on SDR IB at that) at 460 MB/s,
on an RDMA adapter reporting 750 MB/s (in a 4x PCIe slot, so ~860 MB/s
max is what we should expect for this). Faster IB hardware should
result in better performance, though you still have to walk through
the various software stacks, and they ... remove efficiency ... (nice
PC way to say that they slow things down a bit :( )
Joe,
Even though recent kernels allow rsize and wsize of 1 MB for TCP,
RPCRDMA only supports 32 KB. This will limit your throughput some
regardless of faster hardware.
I saw some messages to that effect a while ago (from you as I remember)
on another list.

10 or more Gb network, and the RDMA version lets you do a spoonful of
data at a time ... Do'h!
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: ***@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
Tim Cutts
2008-09-25 15:06:01 UTC
Permalink
Post by Joe Landman
BLAST uses mmap'ed IO. This has some interesting ...
interactions ... with parallel file systems.
It's not *too* bad on Lustre. We use it in production that way.
Post by Joe Landman
Post by Glen Beane
Are there other recommendations for fast scratch space (it doesn't have to
be a parallel file system, something with less hardware would be nice)
Pure software: GlusterFS currently, ceph in the near future. GFS
won't give you very good performance (meta-data shuttling limits
what you can do). You could go Lustre, but then you need to build
MDS/ODS setups so this is hybrid.
Lustre still has some interesting performance corners. Random access
with small reads is weak, so don't try putting DBM files on it, for
example.
Post by Joe Landman
Pure hardware: Panasas (awesome kit, but not for the light-of-
wallet), DDN, Bluearc (same comments for these as well).
We have seen some scaling/stability issues with BlueArc NFS heads, at
least on our SAN hardware. At the scale the OP is suggesting though,
it'll be fine (and they certainly are fast).

Regards,

Tim
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Greg Lindahl
2008-09-25 17:46:32 UTC
Permalink
BLAST uses mmap'ed IO. This has some interesting ... interactions ...
with parallel file systems.
The PathScale compilers use mmap on their temporary files. This led to
some interesting bugs being reported... fortunately, we were able to
pinpoint the parallel filesystems as being the guilty parties without
too much work.

-- greg
Joe Landman
2008-09-25 18:08:23 UTC
Permalink
Post by Greg Lindahl
BLAST uses mmap'ed IO. This has some interesting ... interactions ...
with parallel file systems.
The PathScale compilers use mmap on their temporary files. This led to
some interesting bugs being reported... fortunately, we were able to
pinpoint the parallel filesystems as being the guilty parties without
too much work.
It looks like people use mmap files to explicitly avoid seeks,
replacing semantics of file IO with memory access semantics. We have a
customer who uses mmap for some large files (multiple GB). Sadly, mmap
on linux uses the paging mechanism which is pretty much stuck at 4kB
pages for most distributions. I think the SiCortex folks and a few
others are working with 64 kB page kernels.

I am sure there are good reasons for using mmap. I just don't know
what they are, and in what contexts. I would rather have
direct/explicit control over the IO if possible.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: ***@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
Greg Lindahl
2008-09-25 23:08:23 UTC
Permalink
Post by Joe Landman
It looks like people use mmap files to explicitly avoid seeks,
replacing semantics of file IO with memory access semantics.
Well, it explicitly avoids having to call I/O functions all the time
as you skip around a file. It also is better at sharing, if several
processes have the same file open, or if the file was just written.
Remember that read() usually has to copy the data. mmap() sometimes
reduces the number of copies.
Post by Joe Landman
Sadly, mmap
on linux uses the paging mechanism which is pretty much stuck at 4kB
pages for most distributions. I think the SiCortex folks and a few
others are working with 64 kB page kernels.
That's a function of architecture. The Linux mips and powerpc ports
have supported larger-than-4k pages for a long time -- it's easy to do
if you pick a single page-size at boot. x86 only has Overly Small and
Really Too Huge pages.

-- greg

Glen Beane
2008-09-25 14:22:19 UTC
Permalink
Post by Glen Beane
I am considering adding a small parallel file system ~(5-10TB) my small
cluster (~32 2x dual core Opteron nodes) that is used mostly by a handful of
regular users. Currently the only storage accessible to all nodes is home
directory space which is provided by the Lab's IT department (this is a SAN
volume connected to the head node by 2x FC links, and NFS exported to the
compute nodes). I don't have to "worry" about the IT provided SAN space -
they back it up, provide redundant hardware, etc. The parallel file system
would be scratch space (and not backed up by IT). We have a mix of home
grown apps doing a pretty wide range of things (some do a lot of I/O, others
don't), and things like BLAST and BLAT.
Hi Glen:

BLAST uses mmap'ed IO. This has some interesting ... interactions
... with parallel file systems.


for what its worth, we use Paracel BLAST and are also considering mpiBLAST-pio to take advantage of a parallel file system
Post by Glen Beane
Can anyone out there provide recommendations for a good solution for fast
scratch space for a cluster of this size?
Yes, but we are biased, as this is in part what we
design/build/sell/support. Linky in .sig .
Post by Glen Beane
Right now I was thinking about PVFS2. How many I/O servers should I have,
and how many cores and RAM per I/O server?
It turns out that PVFS2 sadly has a significant problem with BLAST
and mpiBLAST due to the mmap'ed files. We found this out when trying
to help a customer with a small tier-1 cluster deal with file system
instability. We saw this in PVFS2 2.6.9, 2.7.0 on 32 and 64 bit
platforms. The customer was going to update the PVFS2 group, haven't
heard if they have had a chance to do anything to trace this down and
fix it (I don't think it is a priority, as BLAST doesn't use MPI-IO,
which PVFS2 is quite good at).
Post by Glen Beane
Are there other recommendations for fast scratch space (it doesn't have to
be a parallel file system, something with less hardware would be nice)
Pure software: GlusterFS currently, ceph in the near future. GFS won't
give you very good performance (meta-data shuttling limits what you can
do). You could go Lustre, but then you need to build MDS/ODS setups so
this is hybrid.

Pure hardware: Panasas (awesome kit, but not for the light-of-wallet),
DDN, Bluearc (same comments for these as well).

Reasonable cost HW with good performance: us and a few others. Put any
parallel FS atop this, or pure NFS. We have measured NFSoverRDMA speeds
(on SDR IB at that) at 460 MB/s, on an RDMA adapter reporting 750 MB/s
(in a 4x PCIe slot, so ~860 MB/s max is what we should expect for this).
Faster IB hardware should result in better performance, though you
still have to walk through the various software stacks, and they ...
remove efficiency ... (nice PC way to say that they slow things down a
bit :( )

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: ***@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615


--
Glen L. Beane
Software Engineer
The Jackson Laboratory
Phone (207) 288-6153
Joe Landman
2008-09-25 14:26:51 UTC
Permalink
Glen Beane wrote:
[...]
Post by Joe Landman
BLAST uses mmap'ed IO. This has some interesting ... interactions
... with parallel file systems.
for what its worth, we use Paracel BLAST and are also considering
mpiBLAST-pio to take advantage of a parallel file system
Cool. mpiBLAST also uses mmap'ed IO though. That hasn't changed. They
just spread the IO to each process to take advantage of the parallel
file systems.

We have a nice RPM build for mpi-BLAST-pio that a number of folks are
using. You can grab it from
http://downloads.scalableinformatics.com/downloads/mpiblast/ . And a
shameless plug for mpiHMMer: http://www.mpihmmer.org/ while we are at it :)

Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: ***@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
Marian Marinov
2008-09-25 16:03:30 UTC
Permalink
Post by Joe Landman
[...]
Post by Joe Landman
BLAST uses mmap'ed IO. This has some interesting ... interactions
... with parallel file systems.
for what its worth, we use Paracel BLAST and are also considering
mpiBLAST-pio to take advantage of a parallel file system
Cool. mpiBLAST also uses mmap'ed IO though. That hasn't changed. They
just spread the IO to each process to take advantage of the parallel
file systems.
We have a nice RPM build for mpi-BLAST-pio that a number of folks are
using. You can grab it from
http://downloads.scalableinformatics.com/downloads/mpiblast/ . And a
shameless plug for mpiHMMer: http://www.mpihmmer.org/ while we are at it :)
Joe
Have you looked at GlusterFS: http://www.gluster.org/docs/index.php/GlusterFS

Or maybe GFarm: http://datafarm.apgrid.org/

They are both good solutions.

Marian
Greg Keller
2008-09-25 16:52:40 UTC
Permalink
Glen,

I have had great success with the *right* 10GbE nic and NFS. The
important things to consider are:

How much bandwidth will your backend storage provide? 2 x FC 4 I'm
guessing best case is 600Mb but likely less.
What access patterns do the "typical apps" have?
All nodes read from a single file (no prob for NFS, and fscache may
help even more)
All nodes write to a single file (NFS may need some help or may be too
slow when tuned for this)
All nodes read and write to separate files (NFS is fine if the files
aren't too big for the OS to cache reasonably).

The number of IO servers really is a function of how much disk
throughput you have on the backend, frontend, and through the kernel/
filesystem goo. My experience is a 10GbE nic from Myricom can easily
sustain 500-700MB/s if the storage behind it can and the access
patterns aren't evil. Other nics from large and small vendors can
fall apart at 3-4 Gb so be careful and test the network first before
assuming your FS is the troublemaker. There are cheap switches with 2
or 4 10GbE CX4 connectors that make this much simpler and safer with
or without the Parallel FS options.

Depending on how big/small and how "scratch" the need is... a big
tmpfs/ramdisk can be fun :)

Good luck!
Greg
Date: Thu, 25 Sep 2008 09:40:54 -0400
Subject: [Beowulf] scratch File system for small cluster
Content-Type: text/plain; charset="iso-8859-1"
I am considering adding a small parallel file system ~(5-10TB) my
small
cluster (~32 2x dual core Opteron nodes) that is used mostly by a
handful of
regular users. Currently the only storage accessible to all nodes
is home
directory space which is provided by the Lab's IT department (this
is a SAN
volume connected to the head node by 2x FC links, and NFS exported
to the
compute nodes). I don't have to "worry" about the IT provided SAN
space -
they back it up, provide redundant hardware, etc. The parallel file
system
would be scratch space (and not backed up by IT). We have a mix of
home
grown apps doing a pretty wide range of things (some do a lot of I/
O, others
don't), and things like BLAST and BLAT.
Can anyone out there provide recommendations for a good solution for
fast
scratch space for a cluster of this size?
Right now I was thinking about PVFS2. How many I/O servers should I
have,
and how many cores and RAM per I/O server?
Are there other recommendations for fast scratch space (it doesn't
have to
be a parallel file system, something with less hardware would be nice)
--
Glen L. Beane
Software Engineer
The Jackson Laboratory
http://www.jax.org
Jan Heichler
2008-09-25 17:44:24 UTC
Permalink
Hallo Greg,

Donnerstag, 25. September 2008, meintest Du:


Glen,

I have had great success with the *right* 10GbE nic and NFS. The important things to consider are:


I have to say my experience was different.



How much bandwidth will your backend storage provide? 2 x FC 4 I'm guessing best case is 600Mb but likely less.


600 MB/s is already a good value for a SAN-based Storage ;-)


What access patterns do the "typical apps" have?
All nodes read from a single file (no prob for NFS, and fscache may help even more)
All nodes write to a single file (NFS may need some help or may be too slow when tuned for this)
All nodes read and write to separate files (NFS is fine if the files aren't too big for the OS to cache reasonably).

The number of IO servers really is a function of how much disk throughput you have on the backend, frontend, and through the kernel/filesystem goo. My experience is a 10GbE nic from Myricom can easily sustain 500-700MB/s if the storage behind it can and the access patterns aren't evil. Other nics


My experience was this: you get app. half of what you have on blockdevice-level to the network. So i had a setup with 16 x 15k rpm SAS drives. RAID5 on them showed 1.1 GB/s read (limited by PCIe x8 probably) and 550 MB/s write (Controller was LSI 8888ELP). With exporting this to a number of clients i was not able to get more than app. 500 MB/s read and 400 MB/s write with multiple clients. I could show the real measurements if that is of interest.

If you look at the hardware that was thrown on the problem the result is a little pathetic.

My experience with lustre is that it eats up 10 to 15% of the blockdevice-speed. And the rest you have in the network.

So a cheap lustre-setup for scratch would probably include 2 Servers with internal storage and exporting it to the cluster with 10GE or IB. Internal storage is cheap and it is easy to achieve 500+ MB/s on SATA drives. That way you can reach 1 GB/s with just having 2 Servers and 32 to 48 disks involved.


from large and small vendors can fall apart at 3-4 Gb so be careful and test the network first before assuming your FS is the troublemaker. There are cheap switches with 2 or 4 10GbE CX4 connectors that make this much simpler and safer with or without the Parallel FS options.


I never tested anything but Myricom 10GE but you can find cheap Intel-Based cards with CX4 (and i doubt that they are bad) . The Dell PowerConnect 62xx-Series can give you cheap CX4 uplinks - and you get a decent switch that is stackable.



Depending on how big/small and how "scratch" the need is... a big tmpfs/ramdisk can be fun :)


I tried once to export tmpfs via NFS - didn't work out of the box.

Bye Jan
Huw Lynes
2008-09-25 14:31:58 UTC
Permalink
Post by Glen Beane
I am considering adding a small parallel file system ~(5-10TB) my small
cluster (~32 2x dual core Opteron nodes) that is used mostly by a handful of
regular users. Currently the only storage accessible to all nodes is home
directory space which is provided by the Lab's IT department (this is a SAN
volume connected to the head node by 2x FC links, and NFS exported to the
compute nodes). I don't have to "worry" about the IT provided SAN space -
they back it up, provide redundant hardware, etc. The parallel file system
would be scratch space (and not backed up by IT). We have a mix of home
grown apps doing a pretty wide range of things (some do a lot of I/O, others
don't), and things like BLAST and BLAT.
For a cluster this small you have to wonder whether the complexity and
expense of a clustered filesystem is worth it. I would be very surprised
if a decent NFS server couldn't keep up with your demands.

I'm sure Joe will be along shortly to recommend Jackrabbit.

Thanks,
Huw
--
Huw Lynes | Advanced Research Computing
HEC Sysadmin | Cardiff University
| Redwood Building,
Tel: +44 (0) 29208 70626 | King Edward VII Avenue, CF10 3NB
David Mathog
2008-09-25 19:00:37 UTC
Permalink
Post by Joe Landman
Post by Glen Beane
I am considering adding a small parallel file system ~(5-10TB) my small
cluster (~32 2x dual core Opteron nodes) that is used mostly by a
handful of
Post by Joe Landman
Post by Glen Beane
regular users. Currently the only storage accessible to all nodes
is home
Post by Joe Landman
Post by Glen Beane
directory space which is provided by the Lab's IT department (this
is a SAN
Post by Joe Landman
Post by Glen Beane
volume connected to the head node by 2x FC links, and NFS exported
to the
Post by Joe Landman
Post by Glen Beane
compute nodes). I don't have to "worry" about the IT provided SAN
space -
Post by Joe Landman
Post by Glen Beane
they back it up, provide redundant hardware, etc. The parallel file
system
Post by Joe Landman
Post by Glen Beane
would be scratch space (and not backed up by IT). We have a mix of home
grown apps doing a pretty wide range of things (some do a lot of
I/O, others
Post by Joe Landman
Post by Glen Beane
don't), and things like BLAST and BLAT.
BLAST uses mmap'ed IO. This has some interesting ... interactions
... with parallel file systems.
Right, and it isn't just the mapping of the databases and input file.
One must also be careful with how BLAST output is directed. Sending it
all to the same NFS mounted file system as "node01.out", "node02.out",
etc. will do very unpleasant things to both your network and the file
server. Far better to write those locally to /tmp/nodeXX.out, and then
take some care in moving them back to the central file system later, so
that the data transfer can proceed without interference.

This doesn't mean you have to wait until the end of the run and send
each node's entire output file back at once. It can be more efficient,
but more complicated, to write the output files on each node in
reasonable sized chunks and then interleave the transfer of those to the
central store with the ongoing run. Whether this is worth the extra
effort depends mostly on the number of queries in the input file and
the verbosity of the output file.

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
Continue reading on narkive:
Loading...