Discussion:
[Beowulf] latency and bandwidth micro benchmarks
Lawrence Stewart
2006-08-15 13:02:12 UTC
Permalink
As has been mentioned here, the canonical bandwidth benchmark is
streams.

AFAIK, the canonical latency benchmark is lat_mem_rd, which is part of
the lmbench suite.


Streams is ultimately a test of the bandwidth path between the drams
and the core
in that if you turn up the buffer size sufficiently high, you will
overflow any cache.
If you keep turning it up enough above that, you will wash out the
edge effects
such as not needing to write the dirty cache lines at the end of the
test.
Secondarily, streams is a compiler test of loop unrolling, software
pipelining,
and prefetch.

Streams is easy meat for hardware prefetch units, since the access
patterns are
sequential, but that is OK. It is a bandwidth test.

latency is much harder to get at. lat_mem_rd tries fairly hard to
defeat hardware
prefetch units by threading a chain of pointers through a random set
of cache
blocks. Other tests that don't do this get screwy results.

lat_mem_rd produces a graph, and it is easy to see the L1, L2, and
main memory plateaus.

This is all leadup to asking for lat_mem_rd results for Woodcrest
(and Conroe, if there
are any out there), and for dual-core Opterons (275)

With both streams and lat_mem_rd, one can run one copy or multiple
copies, or use a
single copy in multithread mode. Many cited test results I have been
able to find use
very vague english to describe exactly what they have tested. I
prefer running
two copies of stream rather than using OpenMP - I want to measure
bandwidth, not
inter-core synchronization. For lat_mem_rd, the -P 2 switch seems
fine, it just
forks two copies of the test.

I'm interested in results for a single thread, but I am also
interested in results for
multiple threads on dual-core chips and in machines with multiple
sockets of single
or dual core chips.

The bandwidth of a two-socket single-core machine, for example,
should be nearly twice
the bandwidth of a single-socket dual-core machine simply because the
threads are
using different memory controllers. Is this borne out by tests?
Four threads on
a dual-dual should give similar bandwidth per core to a single socket
dual-core. True?

Next, considering a dual-core chip, to the extent that a single core
can saturate the memory
controller, when both cores are active, there should be a substantial
drop in bandwidth
per core.

Latency is much more difficult. I would expect that dual-core
lat_mem_rd results with
both cores active should show only a slight degradation of latency,
due to occasional
bus contention or resource scheduling conflicts between the cores. A
single memory
controller should be able to handle pointer chasing activity from
multiple cores. True?

Our server farm here is all dual-processor single core (Opteron 248)
and they seem
to behave as expected: running two copies of stream gives nearly
double performance,
and the latency degradation due to running two copies of lat_mem_rd
is nearly
indetectable. We don't have any dual-core chips or any Intel chips.

-Larry
Bill Broadley
2006-08-29 05:47:51 UTC
Permalink
Post by Lawrence Stewart
As has been mentioned here, the canonical bandwidth benchmark is
streams.
Agreed.
Post by Lawrence Stewart
AFAIK, the canonical latency benchmark is lat_mem_rd, which is part of
the lmbench suite.
Really? Seems like more of a prefetch test then a latency benchmark.
A fixed stride allows a guess at where the n+1 address before the n'th
address is loaded.

I ran the full lmbench:
Host OS Description Mhz tlb cache mem scal
pages line par load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
amd-2214 Linux 2.6.9-3 x86_64-linux-gnu 2199 32 128 4.4800 1
xeon-5150 Linux 2.6.9-3 x86_64-linux-gnu 2653 8 128 5.5500 1

Strangely, the linux kernel disagrees on the cache line size for the amd
(from dmesg):
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
Post by Lawrence Stewart
Secondarily, streams is a compiler test of loop unrolling, software
pipelining, and prefetch.
Indeed.
Post by Lawrence Stewart
Streams is easy meat for hardware prefetch units, since the access
patterns are
sequential, but that is OK. It is a bandwidth test.
Agreed.
Post by Lawrence Stewart
latency is much harder to get at. lat_mem_rd tries fairly hard to
defeat hardware
prefetch units by threading a chain of pointers through a random set
of cache
blocks. Other tests that don't do this get screwy results.
A random set of cache blocks?

You mean:
http://www.bitmover.com/lmbench/

I got the newest lmbench3.
The benchmark runs as two nested loops. The outer loop is the stride
size. The inner loop is the array size.

The memory results:
Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
------------------------------------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Rand mem Guesses
--------- ------------- --- ---- ---- -------- -------- -------
amd-2214 Linux 2.6.9-3 2199 1.3650 5.4940 68.4 111.3
xeon-5150 Linux 2.6.9-3 2653 1.1300 5.3000 101.5 114.2
Post by Lawrence Stewart
lat_mem_rd produces a graph, and it is easy to see the L1, L2, and
main memory plateaus.
This is all leadup to asking for lat_mem_rd results for Woodcrest
(and Conroe, if there
are any out there), and for dual-core Opterons (275)
The above amd-2214 is the ddr2 version of the opteron 275.

My latency numbers with plat are 98.5ns for a 38MB array. A bit better than
lmbench.
Post by Lawrence Stewart
With both streams and lat_mem_rd, one can run one copy or multiple
copies, or use a
single copy in multithread mode. Many cited test results I have been
able to find use
very vague english to describe exactly what they have tested. I
My code is pretty simple, for an array of N ints I do:
while (p != 0)
{
p = a[p];
}

That to me is random memory latency. Although doing a 2 stage loop
for 0 to N pages
pick a random page
for 0 to M (cachelines per page)
pick a random cacheline

Would minimize time spent with the page overhead.
Post by Lawrence Stewart
prefer running
two copies of stream rather than using OpenMP - I want to measure
bandwidth, not
inter-core synchronization.
I prefer is synchronized. Otherwise 2 streams might get out of sync, and
while one gets 8GB/sec, and another gets 8GB/sec, they didn't do it at the
same time. In my benchmark I take the min of all start times and the max
of all stop times. That way there is no cheating.
Post by Lawrence Stewart
I'm interested in results for a single thread, but I am also
interested in results for
multiple threads on dual-core chips and in machines with multiple
sockets of single
or dual core chips.
Since your spending most of your time waiting on dram, there isn't much
contention:
Loading Image...
Post by Lawrence Stewart
The bandwidth of a two-socket single-core machine, for example,
should be nearly twice
the bandwidth of a single-socket dual-core machine simply because the
threads are
using different memory controllers.
Judge for yourself:
Loading Image... (quad opteron)
Loading Image...
Loading Image... (woodcrest + ddr2-667)
Post by Lawrence Stewart
Is this borne out by tests?
Four threads on
a dual-dual should give similar bandwidth per core to a single socket
dual-core. True?
Yes, alas I don't have graphs of single socket dual core systems
handy.
Post by Lawrence Stewart
Next, considering a dual-core chip, to the extent that a single core
can saturate the memory
controller, when both cores are active, there should be a substantial
drop in bandwidth
per core.
Right.
Post by Lawrence Stewart
Latency is much more difficult. I would expect that dual-core
lat_mem_rd results with
both cores active should show only a slight degradation of latency,
due to occasional
bus contention or resource scheduling conflicts between the cores. A
single memory
controller should be able to handle pointer chasing activity from
multiple cores. True?
Right, see above graphs for 1 vs 4t.
Post by Lawrence Stewart
Our server farm here is all dual-processor single core (Opteron 248)
and they seem
to behave as expected: running two copies of stream gives nearly
double performance,
and the latency degradation due to running two copies of lat_mem_rd
is nearly
indetectable. We don't have any dual-core chips or any Intel chips.
Right.
--
Bill Broadley
Computational Science and Engineering
UC Davis
Robert G. Brown
2006-08-29 11:28:46 UTC
Permalink
Post by Bill Broadley
Post by Lawrence Stewart
latency is much harder to get at. lat_mem_rd tries fairly hard to
defeat hardware
prefetch units by threading a chain of pointers through a random set
of cache
blocks. Other tests that don't do this get screwy results.
A random set of cache blocks?
http://www.bitmover.com/lmbench/
I got the newest lmbench3.
The benchmark runs as two nested loops. The outer loop is the stride
size. The inner loop is the array size.
Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
------------------------------------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Rand mem Guesses
--------- ------------- --- ---- ---- -------- -------- -------
amd-2214 Linux 2.6.9-3 2199 1.3650 5.4940 68.4 111.3
xeon-5150 Linux 2.6.9-3 2653 1.1300 5.3000 101.5 114.2
In benchmaster there is a memory read test that does indeed have a
random "shuffled access order" option. It begins by filling a vector
with its own index list, then shuffling its contents. Then it reads its
way through the list, using each index that it reads as the index of the
next read. It shows a really dramatic difference in timing, as in (on
my admittedly slow and plodgy 1.87 MHz laptop) 0.4 nsec in streaming
access mode and 70 nsec in random access mode. Yes, that is seven-zero.

Note that this is for a fairly long main (target) block (10,000,000
ints), and that the code is identical in the two cases -- the streaming
access mode does the same test but omits the pre-test shuffle. This
defeats both prefetch and cache.

Make of it what you will. Modern hardware "likes" streaming access.
Barring that, it likes local access, that fits into L2. When your code
bounces all over God's own kindgom with its memory accesses, the very
prefetch units that optimize streaming very likely work against you,
constantly pulling in more memory than you are actually going to use in
the vicinity of the last memory reference and having to flush it all
away as it eventually languishes and the cache space is reclaimed.

rgb
Post by Bill Broadley
while (p != 0)
{
p = a[p];
}
That to me is random memory latency. Although doing a 2 stage loop
for 0 to N pages
pick a random page
for 0 to M (cachelines per page)
pick a random cacheline
Would minimize time spent with the page overhead.
Ya, something like this but with a[p] shuffled. I'm trying to remember
if I wrote the shuffle in such a way as to guarantee no loops (all
elements visited exactly one time) and think that I did but would have
to check the source to be sure (even though it is my own source, sigh).
If I ever do get back to work on benchmaster, I'll definitely make it
work this way if it doesn't already.

rgb
--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:***@phy.duke.edu
Lawrence Stewart
2006-08-29 14:36:39 UTC
Permalink
Post by Bill Broadley
Post by Lawrence Stewart
AFAIK, the canonical latency benchmark is lat_mem_rd, which is part of
the lmbench suite.
Really? Seems like more of a prefetch test then a latency benchmark.
A fixed stride allows a guess at where the n+1 address before the n'th
address is loaded.
So after some study of the lmbench sources... The basic idea is to
follow a chain
of pointers, causing the loads to be serialized. There are three
different
initialization routines:

stride_initialize - steps through memory in a predictable pattern
thrash_initialize - random order of cache lines for the entire block,
which can (should)
cause both a TLB miss and a cache miss on entry load.
mem_initialize - threads through each cache line on a page in a
random order before
going to the next line

Evidently the mem_initialize routine was the one I was thinking of.
It seems to be used
by lat_dram_page rather than by lat_mem_rd. I'll stare at this some
more. So far I am
having trouble getting gmake's attention.

Does your program have just one touch of each cache block? Or does
it, in random order,
touch all the words in the line? The latter case should get a
somewhat lower access
time than the latency all the way to the drams.
Post by Bill Broadley
Host OS Description Mhz tlb cache
mem scal
pages line
par load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
amd-2214 Linux 2.6.9-3 x86_64-linux-gnu 2199 32 128 4.4800 1
xeon-5150 Linux 2.6.9-3 x86_64-linux-gnu 2653 8 128 5.5500 1
Strangely, the linux kernel disagrees on the cache line size for the amd
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
Post by Lawrence Stewart
Secondarily, streams is a compiler test of loop unrolling, software
pipelining, and prefetch.
Indeed.
Post by Lawrence Stewart
Streams is easy meat for hardware prefetch units, since the access
patterns are
sequential, but that is OK. It is a bandwidth test.
Agreed.
Post by Lawrence Stewart
latency is much harder to get at. lat_mem_rd tries fairly hard to
defeat hardware
prefetch units by threading a chain of pointers through a random set
of cache
blocks. Other tests that don't do this get screwy results.
A random set of cache blocks?
exactly. Also, the prefetcher isn't exactly useless, since there is
some chance that
the prefetch will load a line that hasn't yet been touched, and that
won't be evicted before
it is used.

The inner loop is an unwound p = (char **) *p;
Post by Bill Broadley
http://www.bitmover.com/lmbench/
I got the newest lmbench3.
The benchmark runs as two nested loops. The outer loop is the stride
size. The inner loop is the array size.
Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
----------------------------------------------------------------------
--------
Host OS Mhz L1 $ L2 $ Main mem Rand mem Guesses
--------- ------------- --- ---- ---- --------
-------- -------
amd-2214 Linux 2.6.9-3 2199 1.3650 5.4940 68.4 111.3
xeon-5150 Linux 2.6.9-3 2653 1.1300 5.3000 101.5 114.2
Post by Lawrence Stewart
lat_mem_rd produces a graph, and it is easy to see the L1, L2, and
main memory plateaus.
This is all leadup to asking for lat_mem_rd results for Woodcrest
(and Conroe, if there
are any out there), and for dual-core Opterons (275)
The above amd-2214 is the ddr2 version of the opteron 275.
My latency numbers with plat are 98.5ns for a 38MB array. A bit better than
lmbench.
Post by Lawrence Stewart
With both streams and lat_mem_rd, one can run one copy or multiple
copies, or use a
single copy in multithread mode. Many cited test results I have been
able to find use
very vague english to describe exactly what they have tested. I
while (p != 0)
{
p = a[p];
}
That to me is random memory latency. Although doing a 2 stage loop
for 0 to N pages
pick a random page
for 0 to M (cachelines per page)
pick a random cacheline
Would minimize time spent with the page overhead.
Post by Lawrence Stewart
prefer running
two copies of stream rather than using OpenMP - I want to measure
bandwidth, not
inter-core synchronization.
I prefer is synchronized. Otherwise 2 streams might get out of sync, and
while one gets 8GB/sec, and another gets 8GB/sec, they didn't do it at the
same time. In my benchmark I take the min of all start times and the max
of all stop times. That way there is no cheating.
Post by Lawrence Stewart
I'm interested in results for a single thread, but I am also
interested in results for
multiple threads on dual-core chips and in machines with multiple
sockets of single
or dual core chips.
Since your spending most of your time waiting on dram, there isn't much
http://cse.ucdavis.edu/~bill/intel-1vs4t.png
Post by Lawrence Stewart
The bandwidth of a two-socket single-core machine, for example,
should be nearly twice
the bandwidth of a single-socket dual-core machine simply because the
threads are
using different memory controllers.
http://cse.ucdavis.edu/~bill/quad-numa.png (quad opteron)
http://cse.ucdavis.edu/~bill/altix-dplace.png
http://cse.ucdavis.edu/~bill/intel-5150.png (woodcrest + ddr2-667)
Post by Lawrence Stewart
Is this borne out by tests?
Four threads on
a dual-dual should give similar bandwidth per core to a single socket
dual-core. True?
Yes, alas I don't have graphs of single socket dual core systems
handy.
Post by Lawrence Stewart
Next, considering a dual-core chip, to the extent that a single core
can saturate the memory
controller, when both cores are active, there should be a substantial
drop in bandwidth
per core.
Right.
Post by Lawrence Stewart
Latency is much more difficult. I would expect that dual-core
lat_mem_rd results with
both cores active should show only a slight degradation of latency,
due to occasional
bus contention or resource scheduling conflicts between the cores. A
single memory
controller should be able to handle pointer chasing activity from
multiple cores. True?
Right, see above graphs for 1 vs 4t.
Post by Lawrence Stewart
Our server farm here is all dual-processor single core (Opteron 248)
and they seem
to behave as expected: running two copies of stream gives nearly
double performance,
and the latency degradation due to running two copies of lat_mem_rd
is nearly
indetectable. We don't have any dual-core chips or any Intel chips.
Right.
--
Bill Broadley
Computational Science and Engineering
UC Davis
Loading...