[Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

Discussion:

Chris Samuel

2018-06-08 07:21:56 UTC

Hi all,

I'm curious to know what/how/where/if sites do to try and reduce the impact of
fragmentation of resources by small/narrow jobs on systems where you also have
to cope with large/wide parallel jobs?

For my purposes a small/narrow job is anything that will fit on one node
(whether a single core job, multi-threaded or MPI).

One thing we're considering is to use overlapping partitions in Slurm to have
a subset of nodes that are available to these types of jobs and then have
large parallel jobs use a partition that can access any node.

This has the added benefit of letting us set a higher priority on that
partition to let Slurm try and place those jobs first, before smaller ones.

We're already using a similar scheme for GPU jobs where they get put into a
partition that can access all 36 cores on a node whereas non-GPU jobs get put
into a partition that can only access 32 cores on a node, so effectively we
reserve 4 cores a node for GPU jobs.

But really I'm curious to know what people do about this, or do you not worry
about it at all and just let the scheduler do its best?

All the best,
Chris

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowul

John Hearns via Beowulf

2018-06-08 07:55:19 UTC

Permalink

Chris, good question. I can't give a direct asnwer there, but let me share
my experiences.

In the past I managed SGI ICE clusters and a large memory UV system with
PBSPro queuing.
The engineers submitted CFD solver jobs using scripts, and we only allowed
them to use a multiple of N cpus,
in fact there were queues named after lets say 2N or 4N cpu cores. The
number of cores were cunningly arranged to fit into
what SGI term an IRU, or everyone else would call a blade chassis.
We had job exclusivity, and engineers were not allowed to choose how many
CPUs they used.
This is a very efficient way to run HPC - as you have a clear view of how
many jobs fit on a cluster.

Yes, before you say it this does not cater for the mixed workload with lots
of single CPU jobs, Matlab, Python etc....

When the UV arrived I configured bladesets (placement sets) such that the
scheduler tried to allocate CPUs and memory from blades
adjacent to each other. Again much better efficiency. If I'm not wrong you
do that in Slurm by defining switches.

When the high core count AMDs came along again I configured blade sets and
the number of CPUs per job was increased to cope with
larger core count CPUs but again cunningly arranged to equal the number of
cores in a placement set (placement sets were configured to be
half, full or two IRUs)

At another place of employment recently we had a hugely mixed workload,
ranging from interactive graphics, to the Matlab type jobs,
to multinode CFD jobs. In addition to that we had different CPU generations
and GPUs in the mix.
That setup was a lot harder to manage and keep up the efficiency of use, as
you can imagine.

I agree with you about the overlapping partitions. If I was to arrange
things in my ideal worls, I would have a set of the latest generation CPUs
using the latest generation interconnect and reserve them for 'job
exclusive' jobs - ie parallel jobs, and leave other nodes exclusively for
one node or one core jobs.
Then have some mechanism to grow/shrink the partitions.

Ont thing again which I found difficult in my last job was users 'hard
wiring' the number of CPUs they use. In fact I have seen that quite often
on other projects.
What happens is that a new Phd or Postdoc or new engineer is gifted a job
submission script from someone who is leaving, or moving on.
The new person doesnt really understand why (say) six nodes with eight CPU
cores are requested.
But (a) they just want to get on and do the job (b) they are scared of
breaking things by altering the script.
So the number of CPUs doesnt change and with the latest generation 20plus
cores on a node you get wasted cores.
Also having mixed generations of CPUs with different core counts does not
help here.

Yes I know we as HPC admins can easily adjust job scripts to mpirun with N
equal to the number of cores on a node (etc).
In fact when I have worked with users and showed them how to do this it has
been a source of satisfaction to me.

Post by Chris Samuel
Hi all,
I'm curious to know what/how/where/if sites do to try and reduce the impact of
fragmentation of resources by small/narrow jobs on systems where you also have
to cope with large/wide parallel jobs?
For my purposes a small/narrow job is anything that will fit on one node
(whether a single core job, multi-threaded or MPI).
One thing we're considering is to use overlapping partitions in Slurm to have
a subset of nodes that are available to these types of jobs and then have
large parallel jobs use a partition that can access any node.
This has the added benefit of letting us set a higher priority on that
partition to let Slurm try and place those jobs first, before smaller ones.
We're already using a similar scheme for GPU jobs where they get put into a
partition that can access all 36 cores on a node whereas non-GPU jobs get put
into a partition that can only access 32 cores on a node, so effectively we
reserve 4 cores a node for GPU jobs.
But really I'm curious to know what people do about this, or do you not worry
about it at all and just let the scheduler do its best?
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Paul Edmon

2018-06-08 14:16:16 UTC

Permalink

Yeah this one is tricky.Â In general we take the wildwest approach here,
but I've had users use --contiguous and their job takes forever to run.

I suppose one method would would be enforce that each job take a full
node and parallel jobs always have contiguous.Â As I recall Slurm will
preferentially fill up nodes to try to leave as large of contiguous
blocks as it can.

The other other option would be to use requeue to your advantage.Â
Namely just have a high priority queue only for large contiguous jobs
and it just requeues all the jobs it needs to to run.Â That would depend
on your single node/core users tolerances for being requeued.

-Paul Edmon-

Post by John Hearns via Beowulf
Chris, good question. I can't give a direct asnwer there, but let me
share my experiences.
In the past I managed SGI ICE clusters and a large memory UV system
with PBSPro queuing.
The engineers submitted CFD solver jobs using scripts, and we only
allowed them to use a multiple of N cpus,
in fact there were queues named after lets say 2N or 4N cpu cores. The
number of cores were cunningly arranged to fit into
what SGI term an IRU, or everyone else would call a blade chassis.
We had job exclusivity, and engineers were not allowed to choose how
many CPUs they used.
This is a very efficient way to run HPC - as you have a clear view of
how many jobs fit on a cluster.
Yes, before you say it this does not cater for the mixed workload with
lots of single CPU jobs, Matlab, Python etc....
When the UV arrived I configured bladesets (placement sets) such that
the scheduler tried to allocate CPUs and memory from blades
adjacent to each other. Again much better efficiency. If I'm not wrong
you do that in Slurm by defining switches.
When the high core count AMDs came along again I configured blade sets
and the number of CPUs per job was increased to cope with
larger core count CPUs but again cunningly arranged to equal the
number of cores in a placement set (placement sets were configured to be
half, full or two IRUs)
At another place of employment recently we had a hugely mixed
workload, ranging from interactive graphics, to the Matlab type jobs,
to multinode CFD jobs. In addition to that we had different CPU
generations and GPUs in the mix.
That setup was a lot harder to manage and keep up the efficiency of
use, as you can imagine.
I agree with you about the overlapping partitions. If I was to arrange
things in my ideal worls, I would have a set of the latest generation CPUs
using the latest generation interconnect and reserve them for 'job
exclusive' jobs - ie parallel jobs, and leave other nodes exclusively for
one node or one core jobs.
Then have some mechanism to grow/shrink the partitions.
Ont thing again which I found difficult in my last job was users 'hard
wiring' the number of CPUs they use. In fact I have seen that quite
often on other projects.
What happens is that a new Phd or Postdoc or new engineer is gifted a
job submission script from someone who is leaving, or moving on.
The new person doesnt really understand why (say) six nodes with eight
CPU cores are requested.
But (a) they just want to get on and do the job (b) they are scared of
breaking things by altering the script.
So the number of CPUs doesnt change and with the latest generation
20plus cores on a node you get wasted cores.
Also having mixed generations of CPUs with different core counts does
not help here.
Yes I know we as HPC admins can easily adjust job scripts to mpirun
with N equal to the number of cores on a node (etc).
In fact when I have worked with users and showed them how to do this
it has been a source of satisfaction to me.
Hi all,
I'm curious to know what/how/where/if sites do to try and reduce the impact of
fragmentation of resources by small/narrow jobs on systems where you also have
to cope with large/wide parallel jobs?
For my purposes a small/narrow job is anything that will fit on one node
(whether a single core job, multi-threaded or MPI).
One thing we're considering is to use overlapping partitions in Slurm to have
a subset of nodes that are available to these types of jobs and then have
large parallel jobs use a partition that can access any node.
This has the added benefit of letting us set a higher priority on that
partition to let Slurm try and place those jobs first, before smaller ones.
We're already using a similar scheme for GPU jobs where they get put into a
partition that can access all 36 cores on a node whereas non-GPU jobs get put
into a partition that can only access 32 cores on a node, so effectively we
reserve 4 cores a node for GPU jobs.
But really I'm curious to know what people do about this, or do you not worry
about it at all and just let the scheduler do its best?
All the best,
Chris
--
Â Chris SamuelÂ : http://www.csamuel.org/Â : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Bill Abbott

2018-06-08 14:39:02 UTC

Permalink

We set PriorityFavorSmall=NO and PriorityWeightJobSize to some
appropriately large value in slurm.conf, which helps.

We also used to limit the number of total jobs a single user could run
to something like 30% of the cluster, so a user could run a single mpi
job that takes all nodes, but couldn't run single-core jobs that take
all nodes. We switched away from that to a owner/preemption system.
Now if a user pays for access they can run whatever they want on their
allocation, and if they don't pay we don't have to care what happens to
them. Sort of.

One idea we're working towards is to have a vm cluster in one of the
commercial cloud providers that only accepts small jobs, and use slurm
federation to steer the smaller jobs there, leaving the on-prem nodes
for big mpi jobs. We're not there yet but shouldn't be a problem to
implement technically.

Bill

Yeah this one is tricky. In general we take the wildwest approach here,
but I've had users use --contiguous and their job takes forever to run.
I suppose one method would would be enforce that each job take a full
node and parallel jobs always have contiguous. As I recall Slurm will
preferentially fill up nodes to try to leave as large of contiguous
blocks as it can.
The other other option would be to use requeue to your advantage.
Namely just have a high priority queue only for large contiguous jobs
and it just requeues all the jobs it needs to to run. That would depend
on your single node/core users tolerances for being requeued.
-Paul Edmon-

Post by John Hearns via Beowulf
Chris, good question. I can't give a direct asnwer there, but let me
share my experiences.
In the past I managed SGI ICE clusters and a large memory UV system
with PBSPro queuing.
The engineers submitted CFD solver jobs using scripts, and we only
allowed them to use a multiple of N cpus,
in fact there were queues named after lets say 2N or 4N cpu cores. The
number of cores were cunningly arranged to fit into
what SGI term an IRU, or everyone else would call a blade chassis.
We had job exclusivity, and engineers were not allowed to choose how
many CPUs they used.
This is a very efficient way to run HPC - as you have a clear view of
how many jobs fit on a cluster.
Yes, before you say it this does not cater for the mixed workload with
lots of single CPU jobs, Matlab, Python etc....
When the UV arrived I configured bladesets (placement sets) such that
the scheduler tried to allocate CPUs and memory from blades
adjacent to each other. Again much better efficiency. If I'm not wrong
you do that in Slurm by defining switches.
When the high core count AMDs came along again I configured blade sets
and the number of CPUs per job was increased to cope with
larger core count CPUs but again cunningly arranged to equal the
number of cores in a placement set (placement sets were configured to be
half, full or two IRUs)
At another place of employment recently we had a hugely mixed
workload, ranging from interactive graphics, to the Matlab type jobs,
to multinode CFD jobs. In addition to that we had different CPU
generations and GPUs in the mix.
That setup was a lot harder to manage and keep up the efficiency of
use, as you can imagine.
I agree with you about the overlapping partitions. If I was to arrange
things in my ideal worls, I would have a set of the latest generation CPUs
using the latest generation interconnect and reserve them for 'job
exclusive' jobs - ie parallel jobs, and leave other nodes exclusively for
one node or one core jobs.
Then have some mechanism to grow/shrink the partitions.
Ont thing again which I found difficult in my last job was users 'hard
wiring' the number of CPUs they use. In fact I have seen that quite
often on other projects.
What happens is that a new Phd or Postdoc or new engineer is gifted a
job submission script from someone who is leaving, or moving on.
The new person doesnt really understand why (say) six nodes with eight
CPU cores are requested.
But (a) they just want to get on and do the job (b) they are scared of
breaking things by altering the script.
So the number of CPUs doesnt change and with the latest generation
20plus cores on a node you get wasted cores.
Also having mixed generations of CPUs with different core counts does
not help here.
Yes I know we as HPC admins can easily adjust job scripts to mpirun
with N equal to the number of cores on a node (etc).
In fact when I have worked with users and showed them how to do this
it has been a source of satisfaction to me.
Hi all,
I'm curious to know what/how/where/if sites do to try and reduce the impact of
fragmentation of resources by small/narrow jobs on systems where you also have
to cope with large/wide parallel jobs?
For my purposes a small/narrow job is anything that will fit on one node
(whether a single core job, multi-threaded or MPI).
One thing we're considering is to use overlapping partitions in Slurm to have
a subset of nodes that are available to these types of jobs and then have
large parallel jobs use a partition that can access any node.
This has the added benefit of letting us set a higher priority on that
partition to let Slurm try and place those jobs first, before smaller ones.
We're already using a similar scheme for GPU jobs where they get put into a
partition that can access all 36 cores on a node whereas non-GPU jobs get put
into a partition that can only access 32 cores on a node, so effectively we
reserve 4 cores a node for GPU jobs.
But really I'm curious to know what people do about this, or do you not worry
about it at all and just let the scheduler do its best?
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/
<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=02%7C01%7Cbabbott%40rutgers.edu%7C82085187da2741e11dff08d5cd4a7338%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636640642010316016&sdata=r%2B1Nv6vo2JDzj8fiIl6vhUIIE90AfgFpf151p2MZPvY%3D&reserved=0>
: Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.beowulf.org%2Fmailman%2Flistinfo%2Fbeowulf&data=02%7C01%7Cbabbott%40rutgers.edu%7C82085187da2741e11dff08d5cd4a7338%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636640642010316016&sdata=o1QGNNA0yxJAQ%2BumiFhbrh6HYb%2FPH0mpekDPlc809pI%3D&reserved=0>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
To change your subscription (digest mode or unsubscribe) visit https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.beowulf.org%2Fmailman%2Flistinfo%2Fbeowulf&data=02%7C01%7Cbabbott%40rutgers.edu%7C82085187da2741e11dff08d5cd4a7338%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636640642010472260&sdata=Jxm0xyKz%2FZeSeYGCPGCZWkGrv%2FtjgplbbI%2BUeDljU%2BM%3D&reserved=0

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/

Chris Samuel

2018-06-09 06:56:20 UTC

Permalink

Post by Bill Abbott
We set PriorityFavorSmall=NO and PriorityWeightJobSize to some
appropriately large value in slurm.conf, which helps.

I guess that helps getting jobs going (and we use something similar), but my
question was more about placement. It's a hard one..

Scott Atchley

2018-06-09 15:22:07 UTC

Permalink

Hi Chris,

We have looked at this _a_ _lot_ on Titan:

A Multi-faceted Approach to Job Placement for Improved Performance on
Extreme-Scale Systems

https://ieeexplore.ieee.org/document/7877165/

This issue we have is small jobs "inside" large jobs interfering with the
larger jobs. The item that is easy to implement with our scheduler was
"Dual-Ended Scheduling". We set a threshold of 16 nodes to demarcate small.
Jobs using more than 16 nodes, schedule from the top/front of the list and
smaller schedule from the bottom/back of the list.

Scott

Post by Chris Samuel

Post by Bill Abbott
We set PriorityFavorSmall=NO and PriorityWeightJobSize to some
appropriately large value in slurm.conf, which helps.

I guess that helps getting jobs going (and we use something similar), but my
question was more about placement. It's a hard one..
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Chris Samuel

2018-06-10 08:53:24 UTC

Permalink

Post by Andrew Mather
Hi Chris,

Hey Scott,

Post by Andrew Mather
A Multi-faceted Approach to Job Placement for Improved Performance on
Extreme-Scale Systems
https://ieeexplore.ieee.org/document/7877165/

Thanks! IEEE has it paywalled but it turns out ACM members can read it here:

https://dl.acm.org/citation.cfm?id=3015021

Post by Andrew Mather
This issue we have is small jobs "inside" large jobs interfering with the
larger jobs. The item that is easy to implement with our scheduler was
"Dual-Ended Scheduling". We set a threshold of 16 nodes to demarcate small.
Jobs using more than 16 nodes, schedule from the top/front of the list and
smaller schedule from the bottom/back of the list.

I'm guessing for "list" you mean a list of nodes? It's an interesting idea
and possibly something that might be doable in Slurm with some patching, for
us it might be more like allocate sub-node jobs from the start of the list (to
hopefully fill up holes left by other small jobs) and full node jobs from the
end of the list (where here list is a set of nodes of the same weight).

You've got me thinking... ;-)

All the best!
Chris

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/li

Scott Atchley

2018-06-10 12:33:22 UTC

Permalink

Post by Chris Samuel

Post by Andrew Mather
Hi Chris,

Hey Scott,

Post by Andrew Mather
A Multi-faceted Approach to Job Placement for Improved Performance on
Extreme-Scale Systems
https://ieeexplore.ieee.org/document/7877165/

https://dl.acm.org/citation.cfm?id=3015021

small.

Post by Andrew Mather
Jobs using more than 16 nodes, schedule from the top/front of the list

and

Post by Andrew Mather
smaller schedule from the bottom/back of the list.

I'm guessing for "list" you mean a list of nodes?

Yes. It may be specific to Cray/Moab.

Post by Chris Samuel
It's an interesting idea
and possibly something that might be doable in Slurm with some patching, for
us it might be more like allocate sub-node jobs from the start of the list (to
hopefully fill up holes left by other small jobs) and full node jobs from the
end of the list (where here list is a set of nodes of the same weight).
You've got me thinking... ;-)
All the best!
Chris

Good luck. If you want to discuss, please do not hesitate to ask. We have
another paper pending along the same lines.

Scott

Chris Samuel

2018-06-11 11:54:08 UTC

Permalink

On Sunday, 10 June 2018 10:33:22 PM AEST Scott Atchley wrote:

[lists]

Post by Scott Atchley
Yes. It may be specific to Cray/Moab.

No, I think that applies quite nicely to Slurm too.

Post by Scott Atchley
Good luck. If you want to discuss, please do not hesitate to ask. We have
another paper pending along the same lines.

Thanks! Much appreciated.

All the best,
Chris

Chris Samuel

2018-06-09 06:54:19 UTC

Permalink

Yeah this one is tricky. In general we take the wildwest approach here, but
I've had users use --contiguous and their job takes forever to run.

:-)

I suppose one method would would be enforce that each job take a full node
and parallel jobs always have contiguous.

For us that would be wasteful though if their job can only scale to a small
number of cores.

As I recall Slurm will
preferentially fill up nodes to try to leave as large of contiguous blocks
as it can.

That's what I thought, but I'm not sure that's actually true. In this Slurm
bug report one of the support folks says:

https://bugs.schedmd.com/show_bug.cgi?id=3505#c16

# After the available nodes and cores are identified, the _eval_nodes()
# function is not preferring idle nodes, but sequentially going through
# the lowest weight nodes and accumulating cores.

The other other option would be to use requeue to your advantage. Namely
just have a high priority queue only for large contiguous jobs and it just
requeues all the jobs it needs to to run. That would depend on your single
node/core users tolerances for being requeued.

Yeah, I suspect with the large user base we have that's not much of an option.
This is one of the times where migration of tasks would be really handy. It's
one of the reasons I was really interested with the last presentation about
the Spanish group working on DMTCP checkpoint/restart at the Slurm User Group
last year which claims to be able to do a lot of this:

https://slurm.schedmd.com/SLUG17/ciemat-cr.pdf

cheers!
Chris

Andrew Mather

2018-06-08 12:59:03 UTC

Permalink

Hi Chris,

Message: 2
Date: Fri, 08 Jun 2018 17:21:56 +1000
Subject: [Beowulf] Avoiding/mitigating fragmentation of systems by
small jobs?
Content-Type: text/plain; charset="us-ascii"
Hi all,
I'm curious to know what/how/where/if sites do to try and reduce the
impact of
fragmentation of resources by small/narrow jobs on systems where you also
have
to cope with large/wide parallel jobs?
For my purposes a small/narrow job is anything that will fit on one node
(whether a single core job, multi-threaded or MPI).

Somewhat ancient history for me now and we didn't have to deal with
multi-node jobs... :)

Hopefully my memory doesn't let me down and no doubt my successor has
tweaked things :)

But on our Torque/Moab system at $JOB -1 we just used to place a higher
priority multiplier on larger jobs, which had the effect of the scheduler
shuffling things around so they'd run as soon as possible. We had a fairly
complex prioritisation setup there, so job size was only one factor.

........ 8< Snip .....

But really I'm curious to know what people do about this, or do you not

worry
about it at all and just let the scheduler do its best?

Pretty much this really, given the other priority multipliers

All the best,
Chris

Andrew
(wondering why he can't let this list go)

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
-

https://picasaweb.google.com/107747436224613508618
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
âWe need the tonic of wildness.... We can never have enough of nature.ââ Henry
David Thoreau
<http://www.goodreads.com/author/show/10264.Henry_David_Thoreau>, Walden:
Or, Life in the Woods <http://www.goodreads.com/work/quotes/2361393>
-

David Mathog

2018-06-08 16:44:53 UTC

Permalink

This isn't quite the same issue, but several times I have observed a
large multiCPU machine lock up because the accounting records associates
with a zillion tiny rapidly launched jobs made an enormous
/var/account/pacct file and filled the small root filesystem. Actually
it wasn't usually pacct itself that brought the system to its knees but
the cron scheduled gzip of that file which applied the coup de grace.
That left the original big pacct and a very large partial pacct-$DATE.gz
which used up the last few free bytes.

As far as I know there is no way to selectively disable saving process
accounting records for "all children of process PID". Accounting is
either on or off. Now when I run scripts prone to this accounting is
turned off first.

This was on Centos 6.9, on machines reporting (via /proc/cpuinfo) 48 and
56 cpus.

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit

Skylar Thompson

2018-06-09 15:48:18 UTC

Permalink

We're a Grid Engine shop, and we have the execd/shepherds place each job in
its own cgroup with CPU and memory limits in place. This lets our users
make efficient use of our HPC resources whether they're running single-slot
jobs, or multi-node jobs. Unfortunately we don't have a mechanism to limit
network usage or local scratch usage, but the former is becoming less of a
problem with faster edge networking, and we have an opt-in bookkeeping mechanism
for the latter that isn't enforced but works well enough to keep people
happy.

--
Skylar
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) v

Chris Samuel

2018-06-10 08:46:04 UTC

Permalink

Post by Skylar Thompson
We're a Grid Engine shop, and we have the execd/shepherds place each job in
its own cgroup with CPU and memory limits in place.

Slurm has supports cgroups as well (and we use it extensively), the idea here
is more to try and avoid/minimise unnecessary inter-node MPI traffic.

All the best,
Chris

Skylar Thompson

2018-06-10 10:26:15 UTC

Permalink

Post by Chris Samuel

Post by Skylar Thompson
We're a Grid Engine shop, and we have the execd/shepherds place each job in
its own cgroup with CPU and memory limits in place.

Slurm has supports cgroups as well (and we use it extensively), the idea here
is more to try and avoid/minimise unnecessary inter-node MPI traffic.

We have very little MPI, but if I had to solve this in GE, I would try to
fill up one node before sending jobs to another. The queue sort order
(defaults to instance load, but can be set to a simple sequence number) is
a general way, while the allocation rule for parallel environments
(defaults to round_robin, but can be set to fill_up) is another specific to
multi-slot jobs.

Not sure the specifics for Slurm, though.

--
Skylar
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/be

John Hearns via Beowulf

2018-06-11 12:36:14 UTC

Permalink

Post by Skylar Thompson
Unfortunately we don't have a mechanism to limit
network usage or local scratch usage, but the former is becoming less of a
problem with faster edge networking, and we have an opt-in bookkeeping mechanism
for the latter that isn't enforced but works well enough to keep people

happy.
That is interesting to me. At ASML I worked on setting up Quality of
Service, ie bandwidth limits, for GPFS storage and MPI traffic.
GPFS does have QoS limits inbuilt, but these are intended to limit the
backgrouns housekeeping tasks rather than to limit user processes.
But it does have the concept.
With MPI you can configure different QoS levels for different traffic.

More relevently I did have a close discussion with Parav Pandit who is
working on the network QoS stuff.
I am sure there is something more up to date than this
https://www.openfabrics.org/images/eventpresos/2016presentations/115rdmacont.pdf
Sadly this RDMA stuff needs a recent 4-series kernel. I guess the
discussion on whether or not you should go with a bleeding edge kernel is
for another time!
But yes cgroups have configurable network limits with the latest kernels.

Also being cheeky, and I probably have mentioned them before, here is a
plug for Ellexus https://www.ellexus.com/
Worth mentioning I have no connection with them!

Skylar Thompson

2018-06-11 13:18:58 UTC

Permalink

Post by John Hearns via Beowulf

mechanism

Post by Skylar Thompson
for the latter that isn't enforced but works well enough to keep people

happy.
That is interesting to me. At ASML I worked on setting up Quality of
Service, ie bandwidth limits, for GPFS storage and MPI traffic.
GPFS does have QoS limits inbuilt, but these are intended to limit the
backgrouns housekeeping tasks rather than to limit user processes.
But it does have the concept.
With MPI you can configure different QoS levels for different traffic.
More relevently I did have a close discussion with Parav Pandit who is
working on the network QoS stuff.
I am sure there is something more up to date than this
https://www.openfabrics.org/images/eventpresos/2016presentations/115rdmacont.pdf
Sadly this RDMA stuff needs a recent 4-series kernel. I guess the
discussion on whether or not you should go with a bleeding edge kernel is
for another time!
But yes cgroups have configurable network limits with the latest kernels.
Also being cheeky, and I probably have mentioned them before, here is a
plug for Ellexus https://www.ellexus.com/
Worth mentioning I have no connection with them!

Thanks for the pointer to Ellexus - their I/O profiling does look like
something that could be useful for us. Since we're a bioinformatics shop
and mostly storage-bound rather than network-bound, we haven't really
needed to worry about node network limitations (though occassionally have
had to worry about ToR or chassis switch limitations), but have really
suffered at times when people assume that disk performance is limitless,
and random access is the same as sequential access.

Chris Samuel

2018-06-12 04:28:25 UTC

Permalink

Post by Skylar Thompson
Unfortunately we don't have a mechanism to limit
network usage or local scratch usage

Our trick in Slurm is to use the slurmdprolog script to set an XFS project
quota for that job ID on the per-job directory (created by a plugin which
also makes subdirectories there that it maps to /tmp and /var/tmp for the
job) on the XFS partition used for local scratch on the node.

If they don't request an amount via the --tmp= option then they get a default
of 100MB. Snipping the relevant segments out of our prolog...

JOBSCRATCH=/jobfs/local/slurm/${SLURM_JOB_ID}.${SLURM_RESTART_COUNT}

if [ -d ${JOBSCRATCH} ]; then
QUOTA=$(/apps/slurm/latest/bin/scontrol show JobId=${SLURM_JOB_ID} | egrep MinTmpDiskNode=[0-9] | awk -F= '{print $NF}')
if [ "${QUOTA}" == "0" ]; then
QUOTA=100M
fi
/usr/sbin/xfs_quota -x -c "project -s -p ${JOBSCRATCH} ${SLURM_JOB_ID}" /jobfs/local
/usr/sbin/xfs_quota -x -c "limit -p bhard=${QUOTA} ${SLURM_JOB_ID}" /jobfs/local

Hope that is useful!

All the best,
Chris

Skylar Thompson

2018-06-12 12:15:26 UTC

Permalink

Post by Chris Samuel

Post by Skylar Thompson
Unfortunately we don't have a mechanism to limit
network usage or local scratch usage

Our trick in Slurm is to use the slurmdprolog script to set an XFS project
quota for that job ID on the per-job directory (created by a plugin which
also makes subdirectories there that it maps to /tmp and /var/tmp for the
job) on the XFS partition used for local scratch on the node.
If they don't request an amount via the --tmp= option then they get a default
of 100MB. Snipping the relevant segments out of our prolog...
JOBSCRATCH=/jobfs/local/slurm/${SLURM_JOB_ID}.${SLURM_RESTART_COUNT}
if [ -d ${JOBSCRATCH} ]; then
QUOTA=$(/apps/slurm/latest/bin/scontrol show JobId=${SLURM_JOB_ID} | egrep MinTmpDiskNode=[0-9] | awk -F= '{print $NF}')
if [ "${QUOTA}" == "0" ]; then
QUOTA=100M
fi
/usr/sbin/xfs_quota -x -c "project -s -p ${JOBSCRATCH} ${SLURM_JOB_ID}" /jobfs/local
/usr/sbin/xfs_quota -x -c "limit -p bhard=${QUOTA} ${SLURM_JOB_ID}" /jobfs/local

Thanks, Chris! We've been considering doing this with GE prolog/epilog
scripts (and boot-time logic to clean up if a node dies with scratch space
still allocated) but haven't gotten around to it. I think we might also
need to get buy-in from some groups that are happy with the unenforced
state right now.

Prentice Bisbal

2018-06-11 18:11:55 UTC

Permalink

Chris,

I'm dealing with this problem myself right now. We use Slurm here. We
really have one large, very heterogeneous cluster that's treated as
multiple smaller clusters through creating multiple partitions, each
with their own QOS. We also have some users who don't understand the
difference between -n and -N when specifying a job size. This has lead
to jobs specified with -N to stay in the queue for an unusually long
time. Yes, part of the solution is definitely user education, but there
are still times when a user user should required nodes and not tasks
(using OpenMP within a node, etc.)

Here's how I'm going to tackle this problem: Most of our nodes are
32-cores, but some older nodes still in use are 16-core, so we're going
to make sure that jobs going to our larger partitions request a multiple
of 16 tasks. That way, a job will either occupy whole nodes, or leave
1/2 a node available.

We have one partition meant for single-node or smaller jobs. That
partition has only Ethernet, since it shouldn't be supporting inter-node
jobs. On that partition, jobs can use 16-cores or less.

I to make this work, I will be using job_submit.lua to apply this logic
and assign a job to a partition. If a user requests a specific partition
not in line with these specifications, job_submit.lua will reassign the
job to the appropriate QOS.

I'll be happy to share how this works after it's been in place for a few
months.

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http:/

Chris Samuel

2018-06-12 04:33:20 UTC

Permalink

Hi Prentice!

Post by Prentice Bisbal
I to make this work, I will be using job_submit.lua to apply this logic
and assign a job to a partition. If a user requests a specific partition
not in line with these specifications, job_submit.lua will reassign the
job to the appropriate QOS.

Yeah, that's very much like what we do for GPU jobs (redirect them to the
partition with access to all cores, and ensure non-GPU jobs go to the
partition with fewer cores) via the submit filter at present..

I've already coded up something similar in Lua for our submit filter (that only
affects my jobs for testing purposes) but I still need to handle memory
correctly, in other words only pack jobs when the per-task memory request *
tasks per node < node RAM (for now we'll let jobs where that's not the case go
through to the keeper for Slurm to handle as now).

However, I do think Scott's approach is potentially very useful, by directing
jobs < full node to one end of a list of nodes and jobs that want full nodes
to the other end of the list (especially if you use the partition idea to
ensure that not all nodes are accessible to small jobs).

cheers!
Chris

John Hearns via Beowulf

2018-06-12 06:16:39 UTC

Permalink

Post by Chris Samuel
However, I do think Scott's approach is potentially very useful, by directing
jobs < full node to one end of a list of nodes and jobs that want full nodes
to the other end of the list (especially if you use the partition idea to
ensure that not all nodes are accessible to small jobs).

Yes, not First In Last Out scheduling, more like
Fragmentary Entry Fractional Incoming First Out Full Unreserved for MPI
FEFIFOFUM

I shall get my coat on the way out.

Post by Chris Samuel
Hi Prentice!

Yeah, that's very much like what we do for GPU jobs (redirect them to the
partition with access to all cores, and ensure non-GPU jobs go to the
partition with fewer cores) via the submit filter at present..
I've already coded up something similar in Lua for our submit filter (that only
affects my jobs for testing purposes) but I still need to handle memory
correctly, in other words only pack jobs when the per-task memory request *
tasks per node < node RAM (for now we'll let jobs where that's not the case go
through to the keeper for Slurm to handle as now).
However, I do think Scott's approach is potentially very useful, by directing
jobs < full node to one end of a list of nodes and jobs that want full nodes
to the other end of the list (especially if you use the partition idea to
ensure that not all nodes are accessible to small jobs).
cheers!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Kilian Cavalotti

2018-06-12 08:13:49 UTC

Permalink

Slurm has a scheduler option that could probably help with that:
https://slurm.schedmd.com/slurm.conf.html#OPT_pack_serial_at_end

Cheers,

--
Kilian
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo

Chris Samuel

2018-06-12 10:26:46 UTC

Permalink

Post by Kilian Cavalotti
https://slurm.schedmd.com/slurm.conf.html#OPT_pack_serial_at_end

Ah I knew I'd seen something like that before! I got fixated on CR_Pack_Nodes
which is not for scheduling jobs but for laying out job steps inside of a job.

Very much obliged!

All the best,
Chris

Prentice Bisbal

2018-06-12 15:08:44 UTC

Permalink

Post by Chris Samuel
Hi Prentice!

Yeah, that's very much like what we do for GPU jobs (redirect them to the
partition with access to all cores, and ensure non-GPU jobs go to the
partition with fewer cores) via the submit filter at present..
I've already coded up something similar in Lua for our submit filter (that only
affects my jobs for testing purposes) but I still need to handle memory
correctly, in other words only pack jobs when the per-task memory request *
tasks per node < node RAM (for now we'll let jobs where that's not the case go
through to the keeper for Slurm to handle as now).
However, I do think Scott's approach is potentially very useful, by directing
jobs < full node to one end of a list of nodes and jobs that want full nodes
to the other end of the list (especially if you use the partition idea to
ensure that not all nodes are accessible to small jobs).

This was something that was very easy to do with SGE. It's been a while
since I worked with SGE so I forget all the details, but in essence, you
could assign nodes a 'serial number' which would specify the preferred
order in which nodes would be assigned to jobs, and I believe that order
was specific to each queue, so if you had 64 nodes, one queue could
assign jobs starting at node 1 and work it's way up to node 64, while
another queue could start at node 64 and work its way down to node 1.
This technique was mentioned in the SGE documentation to allow MPI and
shared memory jobs to share the cluster.

At the time, I used it, for exactly that purpose, but I didn't think it
was that big a deal. Now that I don't have that capability, I miss it.

Prentice

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visi

Ryan Novosielski

2018-06-12 15:11:08 UTC

Permalink

Post by Chris Samuel
Hi Prentice!

Yeah, that's very much like what we do for GPU jobs (redirect them to the
partition with access to all cores, and ensure non-GPU jobs go to the
partition with fewer cores) via the submit filter at present..
I've already coded up something similar in Lua for our submit filter (that only
affects my jobs for testing purposes) but I still need to handle memory
correctly, in other words only pack jobs when the per-task memory request *
tasks per node < node RAM (for now we'll let jobs where that's not the case go
through to the keeper for Slurm to handle as now).
However, I do think Scott's approach is potentially very useful, by directing
jobs < full node to one end of a list of nodes and jobs that want full nodes
to the other end of the list (especially if you use the partition idea to
ensure that not all nodes are accessible to small jobs).

This was something that was very easy to do with SGE. It's been a while since I worked with SGE so I forget all the details, but in essence, you could assign nodes a 'serial number' which would specify the preferred order in which nodes would be assigned to jobs, and I believe that order was specific to each queue, so if you had 64 nodes, one queue could assign jobs starting at node 1 and work it's way up to node 64, while another queue could start at node 64 and work its way down to node 1. This technique was mentioned in the SGE documentation to allow MPI and shared memory jobs to share the cluster.
At the time, I used it, for exactly that purpose, but I didn't think it was that big a deal. Now that I don't have that capability, I miss it.

SLURM has the ability to do priority âweightsâ as well for nodes, to somewhat the same affect â so far as I know. At our site, though, that does not work as it apparently conflicts with the topology plugin, which we also use, instead of layering or something more useful.

--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - ***@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

Skylar Thompson

2018-06-12 15:15:45 UTC

Permalink

Post by Prentice Bisbal

Post by Chris Samuel
Hi Prentice!

Yeah, that's very much like what we do for GPU jobs (redirect them to the
partition with access to all cores, and ensure non-GPU jobs go to the
partition with fewer cores) via the submit filter at present..
I've already coded up something similar in Lua for our submit filter (that only
affects my jobs for testing purposes) but I still need to handle memory
correctly, in other words only pack jobs when the per-task memory request *
tasks per node < node RAM (for now we'll let jobs where that's not the case go
through to the keeper for Slurm to handle as now).
However, I do think Scott's approach is potentially very useful, by directing
jobs < full node to one end of a list of nodes and jobs that want full nodes
to the other end of the list (especially if you use the partition idea to
ensure that not all nodes are accessible to small jobs).

This was something that was very easy to do with SGE. It's been a while
since I worked with SGE so I forget all the details, but in essence, you
could assign nodes a 'serial number' which would specify the preferred order
in which nodes would be assigned to jobs, and I believe that order was
specific to each queue, so if you had 64 nodes, one queue could assign jobs
starting at node 1 and work it's way up to node 64, while another queue
could start at node 64 and work its way down to node 1. This technique was
mentioned in the SGE documentation to allow MPI and shared memory jobs to
share the cluster.
At the time, I used it, for exactly that purpose, but I didn't think it was
that big a deal. Now that I don't have that capability, I miss it.

Yep, this is still the case. It's not actually a setting of the exec host,
but of each queue instance that the exec host is providing. By default GE
sorts queue instances by load but you can set sequence number in the
scheduler configuration. Unfortunately, this is a cluster-wide setting, so
you can't have some queues sorted by load and others sorted by sequence
number.