[Beowulf] Clearing out scratch space

Discussion:

John Hearns via Beowulf

2018-06-12 08:06:06 UTC

Our trick in Slurm is to use the slurmdprolog script to set an XFS project
quota for that job ID on the per-job directory (created by a plugin which
also makes subdirectories there that it maps to /tmp and /var/tmp for the
job) on the XFS partition used for local scratch on the node.

Dmitri Chubarov

2018-06-12 09:06:22 UTC

Permalink

Hello, John,

In HLRS they have what they call a Workspace mechanism (
https://wickie.hlrs.de/platforms/index.php/Workspace_mechanism) where each
user
creates a scratch directory for their project under $SCRATCH_ROOT that has
end-of-life time encoded in the name and a symlink to this directory
in their persistent storage directory tree. A cronjob enforces the end of
life policy.

One advantage is that it is very easy for the admin to extend the lifespan
when it is absolutely needed. It requires only renaming one directory to
extend, for example, the lifetime
of millions of files from genomic applications.

Here at Novosibirsk University where users are getting their resources for
free this mechanism has been reimplemented to ensure that shared storage
does not turn into a file archive.
The main shared storage is an expensive PanFS system that is split into two
partitions: a larger scratch partitions with a directory lifetime limit of
90 days and a smaller $HOME partition.

Some users in fact are abusing the system by recreating a new scratch
directory every 90 days and copying the data along effectively creating
persistent storage. However most of the users do evacuate their valuable
data on time.

Greetings from sunny Siberia,
Dima
sys and it works by setting draconian limits

Post by John Hearns via Beowulf

Jeff White

2018-06-12 15:49:14 UTC

Permalink

We also use a "workspace" mechanism to control our scratch filesystem.Â
It's a home-grown system which works like so:

0. /scratch filesystem configured with root as its owner and no other
write permission allowed.

1. User calls mkworkspace which does a setuid, creates the directory,
chowns it to the user, then writes a small record file about the
workspace somewhere outside of it.

2. From cron, rmworkspace runs which reads each record file, determines
which workspaces have expired, then removes the expired directories and
record file.

It works well and we have not had a problem with our scratch space
filling up after 2-3 years of running it.Â Each compute node has a local
SSD and mkworkspace controls access to that too. Almost nobody uses this
"local scratch" though.

The major downside with this is it that confuses a small portion of
users.Â They would rather just use mkdir by itself instead of using our
mkworkspace, despite it just being a wrapper for mkdir with some extra
options.Â We opted not to use prolog/epilog scripts or other Slurm
features to automatically create or remove workspaces.

(it should run as a special user instead of root but I would need to
give it CAP_CHOWN to be able to chown and my filesystem doesn't support
Linux capabilities)

Jeff White
HPC Systems Engineer - ITS
Question about or help with Kamiak? Please submit a Service Request
<https://hpc.wsu.edu/support/service-requests/>.

Post by Dmitri Chubarov
Hello, John,
In HLRS they have what they call a Workspace mechanism
(https://wickie.hlrs.de/platforms/index.php/Workspace_mechanism
<https://urldefense.proofpoint.com/v2/url?u=https-3A__wickie.hlrs.de_platforms_index.php_Workspace-5Fmechanism&d=DwMFaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=DhM5WMgdrH-xWhI5BzkRTzoTvz8C-BRZ05t9kW9SXZk&m=BuKVxMh1wyVgWXiKoJaXUL7QP0qpxl50rxXlc8JDc84&s=x9rK7qB6v7BhVjwXDATRPqsC57eL1jnCVrBDsSpNtY0&e=>)
where each user
creates a scratch directory for their project under $SCRATCH_ROOT that
has end-of-life time encoded in the name and a symlink to this directory
in their persistent storage directory tree. A cronjob enforces the end
of life policy.
One advantage is that it is very easy for the admin to extend the
lifespan when it is absolutely needed. It requires only renaming one
directory to extend, for example, the lifetime
of millions of files from genomic applications.
Here at Novosibirsk University where users are getting their resources
for free this mechanism has been reimplemented to ensure that shared
storage does not turn into a file archive.
The main shared storage is an expensive PanFS system that is split
into two partitions: a larger scratch partitions with a directory
lifetime limit of 90 days and a smaller $HOME partition.
Some users in fact are abusing the system by recreating a new scratch
directory every 90 days and copying the data along effectively
creatingÂ persistent storage. However most of the users do evacuate
their valuable data on time.
Greetings from sunny Siberia,
Â Dima
sys and it works by setting draconian limits
On Tue, 12 Jun 2018 at 15:06, John Hearns via Beowulf

Our trick in Slurm is to use the slurmdprolog script to set an

XFS project

quota for that job ID on the per-job directory (created by a

plugin which

also makes subdirectories there that it maps to /tmp and /var/tmp

for the

job) on the XFS partition used for local scratch on the node.

I had never thought of that, and it is a very neat thing to do.
What I would like to discuss is the more general topic of clearing
files from 'fast' storage.
Many sites I have seen have dedicated fast/parallel storage which
is referred to as scratch space.
The intention is to use this scratch space for the duration of a
project, as it is expensive.
However I have often seen that the scratch space i used as
permanent storage, contrary to the intentions of whoever sized it,
paid for it and installed it.
I feel that the simplistic 'run a cron job and delete files older
than N days' is outdated.
My personal take is that heirarchical storage is the answere,
automatically pushing files to slower and cheaper tiers.
But the thought struck me - in the Slurm prolog script create a file called
THESE-FILES-WILL-SELF-DESTRUCT-IN-14-DAYS
Then run a cron job to decrement the figure 14
I guess that doesnt cope with running multiple jobs on the same
data set - but then again running a job marks that data as 'hot'
an dyou reset the timer to 14 days.
What do most sites do for scratch space?
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.beowulf.org_mailman_listinfo_beowulf&d=DwMFaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=DhM5WMgdrH-xWhI5BzkRTzoTvz8C-BRZ05t9kW9SXZk&m=BuKVxMh1wyVgWXiKoJaXUL7QP0qpxl50rxXlc8JDc84&s=J8PMbp5NubCpV3X9xng3I9DoT0cO1gDRn92UyMrjhZo&e=>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit https://urldefense.proofpoint.com/v2/url?u=http-3A__www.beowulf.org_mailman_listinfo_beowulf&d=DwIGaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=DhM5WMgdrH-xWhI5BzkRTzoTvz8C-BRZ05t9kW9SXZk&m=BuKVxMh1wyVgWXiKoJaXUL7QP0qpxl50rxXlc8JDc84&s=J8PMbp5NubCpV3X9xng3I9DoT0cO1gDRn92UyMrjhZo&e=

Chris Samuel

2018-06-12 09:52:07 UTC

Permalink

Post by John Hearns via Beowulf
What do most sites do for scratch space?

At ${JOB-1} we used GPFS and so for the scratch filesystem we used the GPFS
policy engine to identify and remove files that had not been read/written for
more than the defined number of days (twice the length of time of our longest
permitted job, so 2 x 30 days).

With the policy engine you can parallelise that across NSD servers too (we had
separate metadata and data NSD servers, so identifying the files was just
parallelised across the former which accessed the same shared SSD array).

Our compute nodes were diskless too so for jobs we used a plugin which created
per-job directories under scratch and then the epilog would clean them up.

cheers!
Chris

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Skylar Thompson

2018-06-12 12:21:04 UTC

Permalink

Post by John Hearns via Beowulf
What do most sites do for scratch space?

We give users access to local disk space on nodes (spinning disk for older
nodes, SSD for newer nodes), which (for the most part) GE will address with
the $TMPDIR job environment variable. We have a "ssd" boolean complex that
users can place in their job to request SSD nodes if they know they will
benefit from them.

We also have labs that use non-backed up portions of their network storage
(Isilon for the older storage, DDN/GPFS for the newer) for scratch space
for processing of pipeline data, where different stages of the pipeline run
on different nodes.

--
Skylar
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listi

Nick Evans

2018-06-12 12:54:14 UTC

Permalink

at {$job -1} we used local scratch and tmpwatch. This had a wrapper script
that would exclude files and folders for any user that currently running a
job on the node.

This way nothing got removed until the users job had finished even if they
hadn't accessed the files for a while and you don't have predict how long a
job could run for..

Post by Skylar Thompson

Post by John Hearns via Beowulf
What do most sites do for scratch space?

We give users access to local disk space on nodes (spinning disk for older
nodes, SSD for newer nodes), which (for the most part) GE will address with
the $TMPDIR job environment variable. We have a "ssd" boolean complex that
users can place in their job to request SSD nodes if they know they will
benefit from them.
We also have labs that use non-backed up portions of their network storage
(Isilon for the older storage, DDN/GPFS for the newer) for scratch space
for processing of pipeline data, where different stages of the pipeline run
on different nodes.
--
Skylar
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Ellis H. Wilson III

2018-06-12 13:41:35 UTC

Permalink

Post by John Hearns via Beowulf

Disclaimer: I work for one such parallel and fast (and often used for
scratch) company called Panasas.

I disagree with the notion that hierarchical storage is a silver bullet
in many HPC-oriented cases. In fact, I would argue in some environments
it poses serious risks to being able to keep a lid on your storage cost
and DC footprint, whether that's for scratch, home, or archive storage.
People (including myself in my own systems testing) can generate an
enormous amount of data that has near-zero value past the project
currently being worked on, which may only be measured in weeks or low
numbers of months. In many of my cases it's cheaper to regenerate those
many TBs of data then to hold onto it for a year or more. Auto-tiering
scratch data to cheaper storage as it gets colder seems like an easy
answer as it takes some of this responsibility away from the users, but
you'll still want to /someday/ ditch that data entirely (for
scratch-like data that is). Culling through piles of likely
mechanically named files you haven't looked at in a long time is
difficult as a human exercise, and without sufficiently complex media
asset management it's also difficult from a storage perspective as your
data may take a /long/ time to even list, much less grep through, when
pulling from true archive storage.

For true scratch I think the solution presented by many of the posters
relating to automatic deletion policies managed by administrators, which
develops and forces good data habits, is ultimately the cleanest
solution in the long-term.

Now, tiering within a storage layer based on different types or access
frequencies of data is perfectly reasonable and is something we do in
our systems. Also, using external software to automatically tier cold
but persistent data (i.e., home dir data) from fast to archive storage
is also reasonable. But there are a lot of pitfalls from trying to
automatically tier data that isn't supposed to be (eternally) persistent
IMHO.

Best,

ellis

--
Ellis H. Wilson III, Ph.D.
www.ellisv3.com
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailma

John Hearns via Beowulf

2018-06-12 14:25:04 UTC

Permalink

Post by Ellis H. Wilson III
Disclaimer: I work for one such parallel and fast (and often used for

scratch) company called Panasas.
Ellis, I know Panasas well of course. You are a great bunch of guys and
girls, and have pulled my chestnuts from the fire many times (such as the
plaintive call from the customer - we can't access our data. What are all
these red lights for - wwe have seen them for weeks. Me - puts in call to
800-PANASAS immediately)
Having been there on my hands and knees installing Panasas systems for UK
Government defence customers.
Installed Panasas in Formula 1 where it had a dramatic effect on our solver
times.
Also installing the largest Panasas setup in the world at the UK Rutherford
Appleton Laboratory, which is being used for lcimate resarch (JASMIN).
Been there, seen it, unboxed hundreds of blades.

Post by Ellis H. Wilson III
Culling through piles of likely mechanically named files you haven't

looked at in a long time is difficult as a human exercise, and without
sufficiently complex media asset management it's also difficult from a
storage perspective > > as your data may take a /long/ time to even list,
much less grep through, when pulling from true archive storage.

Excellent point. In Formula 1 I made a serious study of Architeca
Mediaflux for this very purpose (or rather the SGI rebadge)
http://www.arcitecta.com/Products
We finally did not implement it.
I recall being asked to put terabytes of wind tunnel data on our system,
which consisted of thousands of high resolution frames grabs. In reality
no-one was ever going to look at that data again.

Regarding information lifecycle management, may I go back a long way to
data acquisition at CERN. My epxeriment, and in the other experiments of
that generation, stored data on round magnetic tapes.
We knew that there was a limited capacity to keep data long term. So raw
data was quickly filtered within the detector. Full readots were kept, but
quickly distilled into compact 'Data Summary Tapes'
For instance, verices would be reconstructed into tracks and the track
information kept on the DST. OK, that meant that the full data set coul
dnever be re-analyzed in that level of detail,
but the important aspects for the physics are stored long term.

Post by Ellis H. Wilson III

Our trick in Slurm is to use the slurmdprolog script to set an XFS

project

quota for that job ID on the per-job directory (created by a plugin

which

also makes subdirectories there that it maps to /tmp and /var/tmp for

the

job) on the XFS partition used for local scratch on the node.

Disclaimer: I work for one such parallel and fast (and often used for
scratch) company called Panasas.
I disagree with the notion that hierarchical storage is a silver bullet in
many HPC-oriented cases. In fact, I would argue in some environments it
poses serious risks to being able to keep a lid on your storage cost and DC
footprint, whether that's for scratch, home, or archive storage. People
(including myself in my own systems testing) can generate an enormous
amount of data that has near-zero value past the project currently being
worked on, which may only be measured in weeks or low numbers of months.
In many of my cases it's cheaper to regenerate those many TBs of data then
to hold onto it for a year or more. Auto-tiering scratch data to cheaper
storage as it gets colder seems like an easy answer as it takes some of
this responsibility away from the users, but you'll still want to /someday/
ditch that data entirely (for scratch-like data that is). Culling through
piles of likely mechanically named files you haven't looked at in a long
time is difficult as a human exercise, and without sufficiently complex
media asset management it's also difficult from a storage perspective as
your data may take a /long/ time to even list, much less grep through, when
pulling from true archive storage.
For true scratch I think the solution presented by many of the posters
relating to automatic deletion policies managed by administrators, which
develops and forces good data habits, is ultimately the cleanest solution
in the long-term.
Now, tiering within a storage layer based on different types or access
frequencies of data is perfectly reasonable and is something we do in our
systems. Also, using external software to automatically tier cold but
persistent data (i.e., home dir data) from fast to archive storage is also
reasonable. But there are a lot of pitfalls from trying to automatically
tier data that isn't supposed to be (eternally) persistent IMHO.
Best,
ellis
--
Ellis H. Wilson III, Ph.D.
www.ellisv3.com
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Ellis H. Wilson III

2018-06-12 14:51:54 UTC

Permalink

Post by Ellis H. Wilson III

Post by Ellis H. Wilson III
Disclaimer: I work for one such parallel and fast (and often used for

scratch) company called Panasas.
Ellis, I know Panasas well of course. You are a great bunch of guys and
girls, and have pulled my chestnuts from the fire many times (such as
the plaintive call from the customer - we can't access our data. What
are all these red lights for - wwe have seen them for weeks. Me - puts
in call to 800-PANASAS immediately)
Having been there on my hands and knees installing Panasas systems for
UK Government defence customers.
Installed Panasas in Formula 1 where it had a dramatic effect on our
solver times.
Also installing the largest Panasas setup in the world at the UK
Rutherford Appleton Laboratory, which is being used for lcimate resarch
(JASMIN).
Been there, seen it, unboxed hundreds of blades.

Very glad you've had such good experiences John. As we like to say
internally, people buy for the filesystem engineering at Panasas, but
stay for the service, and that's still probably giving our architecture
and engineering teams too much credit. Our support team puts the rest
of us to shame.

Returning to the subject at hand, I'm wondering aloud if people prefer
automatic deletion policies for scratch volumes to be incorporated into
the filesystem or not. I ask because while external software is more
flexible in that it should run fine on any POSIX-compliant filesystem,
it has no good way of assuring thousands of deletions per second when a
batch of files age-out aren't interfering with foreground performance.
If instead such aging policies for a scratch volume were more tightly
integrated into the filesystem in question it could balance issuing the
deletes with maintaining a QoS level for foreground performance.

We don't have such a feature, but we have much of the infrastructure to
support a specialized volume type like that without a huge lift. If
people were sufficiently interested in it I could definitely see us
dedicating some engineering resources to it.

Best,

ellis

Matt Wallis

2018-06-13 08:45:14 UTC

Permalink

Post by John Hearns via Beowulf
My personal take is that heirarchical storage is the answere,
automatically pushing files to slower and cheaper tiers.

This is my preference as well, if manual intervention is required, it
won't get done, but you do need to tune it a fair bit to ensure things
are not being pushed between tiers inappropriately. You do want to make
sure that you don't end up on tape without essentially being considered
archived.

Post by John Hearns via Beowulf
What do most sites do for scratch space?

I personally haven't used this one as yet, but there's a lot of interest
around BeeGFS and BeeOND.
BeeGFS is a parallel file system from the Fraunhofer Institute in
Germany, was originally FhGFS. Very fast, very simple, easy to manage.

BeeOND, or BeeGFS On Demand allows you to create a temporary file system
per job, typically with something like node local SSD/NVMe devices. I
believe this is being done on TSUBAME 3.0, and one of my customers in
Queensland is running it as well.

BeeGFS by itself is a pretty interesting PFS, BeeOND sounds great as a
concept, my concern would be expectation management around yet another
resource. As in, users getting upset because the number of nodes in your
job now also impacts the amount of scratch space you can write to and at
what speed. Then add to that staging data in and out of the space.

That said, my Queensland customer is absolutely stoked with the
performance he's getting, and it does eliminate the whole question of
cleaning up scratch space, when the job is over, the scratch space is gone.

I have another system based on BeeGFS coming online in the second half
of the year, that I can't talk about right now, but I will be looking
for new adjectives for speed when it hits.

Matt.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman