[Beowulf] Theoretical vs. Actual Performance

Discussion:

Prentice Bisbal

2018-02-22 14:37:54 UTC

Beowulfers,

In your experience, how close does actual performance of your processors
match up to their theoretical performance? I'm investigating a
performances issue on some of my nodes. These are older systems using
AMD Opteron 6274 processors. I found literature from AMD stating the
theoretical performance of these processors is 282 GFLOPS, and my
LINPACK performance isn't coming close to that (I get approximately ~33%
of that). The number I often hear mentioned is actual performance
should be ~85%. of theoretical performance is that a realistic number
your experience?

I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!

--
Prentice

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) v

Joe Landman

2018-02-22 14:45:00 UTC

Permalink

85% makes the assumption that you have the systems configured in an
optimal manner, that the compiler doesn't do anything wonky, and that,
to some degree, you isolate the OS portion of the workload off of most
of the cores to reduce jitter. Among other things.

At Scalable, I'd regularly hit 60-90 % of theoretical max computing
performance, with progressively more heroic tuning. Storage, I'd
typically hit 90-95% of theoretical max (good architectures almost
always beat bad ones). Networking, fairly similar, though tuning per
use case mattered significantly.

Post by Prentice Bisbal
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!

--
Joe Landman
t: @hpcjoe
w: https://scalability.org

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/list

John Hearns via Beowulf

2018-02-22 15:16:33 UTC

Permalink

Prentice, I echo what Joe says.
When doing benchmarking with HPL or SPEC benchmarks, I would optimise the
BIOS settings to the highest degree I could.
Switch off processor C) states
As Joe says you need to look at what the OS is runnign in the background. I
would disable the Bright cluster manager daemon for instance.

85% of theoretical peak on an HPL run sounds reasonable to me and I would
get fogures in that ballpark.

For your AMDs I would start by choosing one system, no interconnect to
cloud the waters. See what you can get out of that.

Post by Joe Landman

85% makes the assumption that you have the systems configured in an
optimal manner, that the compiler doesn't do anything wonky, and that, to
some degree, you isolate the OS portion of the workload off of most of the
cores to reduce jitter. Among other things.
At Scalable, I'd regularly hit 60-90 % of theoretical max computing
performance, with progressively more heroic tuning. Storage, I'd
typically hit 90-95% of theoretical max (good architectures almost always
beat bad ones). Networking, fairly similar, though tuning per use case
mattered significantly.

Post by Prentice Bisbal
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!

--
Joe Landman
w: https://scalability.org
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Benson Muite

2018-02-22 15:42:51 UTC

Permalink

There is a very nice and simple Max flops code that requires much less
tuning than Linpack. It is described in pg 57 of:

Rahman "Intel® Xeon Phi™ Coprocessor Architecture and Tools"
https://link.springer.com/book/10.1007%2F978-1-4302-5927-5

An example Fortran code is here:
https://github.com/bkmgit/intel-xeon-phi-coprocessor-architecture-tools/tree/master/ch05

Post by John Hearns via Beowulf
Prentice, I echo what Joe says.
When doing benchmarking with HPL or SPEC benchmarks, I would optimise
the BIOS settings to the highest degree I could.
Switch off processor C) states
As Joe says you need to look at what the OS is runnign in the
background. I would disable the Bright cluster manager daemon for instance.
85% of theoretical peak on an HPL run sounds reasonable to me and I
would get fogures in that ballpark.
For your AMDs I would start by choosing one system, no interconnect to
cloud the waters. See what you can get out of that.
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These
are older systems using AMD Opteron 6274 processors. I found
literature from AMD stating the theoretical performance of these
processors is 282 GFLOPS, and my LINPACK performance isn't
coming close to that (I get approximately ~33% of that). The
number I often hear mentioned is actual performance should be
~85%. of theoretical performance is that a realistic number your
experience?
85% makes the assumption that you have the systems configured in an
optimal manner, that the compiler doesn't do anything wonky, and
that, to some degree, you isolate the OS portion of the workload off
of most of the cores to reduce jitter. Among other things.
At Scalable, I'd regularly hit 60-90 % of theoretical max computing
performance, with progressively more heroic tuning. Storage, I'd
typically hit 90-95% of theoretical max (good architectures almost
always beat bad ones). Networking, fairly similar, though tuning
per use case mattered significantly.
I don't want this to be a discussion of what could be wrong at
this point, we will get to that in future posts, I assure you!
--
Joe Landman
w: https://scalability.org
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailma

Chris Samuel

2018-02-23 06:57:11 UTC

Permalink

Post by Joe Landman
85% makes the assumption that you have the systems configured in an
optimal manner, that the compiler doesn't do anything wonky, and that,
to some degree, you isolate the OS portion of the workload off of most
of the cores to reduce jitter. Among other things.

Interesting, purchases I've done before for IB clusters have had Rpeak as 80%
of Rmax HPL requirement for acceptance testing before and not had much problem
hitting it.

The worst issue we had was on SandyBridge where the kernel ignoring UEFI
settings saying "hey, I know these CPUs, I'll enable ALL the power saving" and
that killed performance until we disabled the states via the kernel boot
parameters.

It can be worth running powertop to see what states your CPUs are sitting in
whilst running HPL, and also "perf top" to see what the system is up to.

Good luck!
Chris

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailm

Michael Di Domenico

2018-02-22 15:44:25 UTC

Permalink

i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
compiler and the Intel MKL, i can hit 90% without issue. no major
tuning either

if you're at 33% i would be suspect of your math library

Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your processors
match up to their theoretical performance? I'm investigating a performances
issue on some of my nodes. These are older systems using AMD Opteron 6274
processors. I found literature from AMD stating the theoretical performance
of these processors is 282 GFLOPS, and my LINPACK performance isn't coming
close to that (I get approximately ~33% of that). The number I often hear
mentioned is actual performance should be ~85%. of theoretical performance
is that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this point,
we will get to that in future posts, I assure you!
--
Prentice
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

John Hearns via Beowulf

2018-02-22 15:52:54 UTC

Permalink

Oh, and use the Adaptive computing HPL calculator to get your input file.
Thanks Adaptive guys!

Post by Michael Di Domenico
i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
compiler and the Intel MKL, i can hit 90% without issue. no major
tuning either
if you're at 33% i would be suspect of your math library

Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your processors
match up to their theoretical performance? I'm investigating a

performances

Post by Prentice Bisbal
issue on some of my nodes. These are older systems using AMD Opteron 6274
processors. I found literature from AMD stating the theoretical

performance

Post by Prentice Bisbal
of these processors is 282 GFLOPS, and my LINPACK performance isn't

coming

Post by Prentice Bisbal
close to that (I get approximately ~33% of that). The number I often

hear

Post by Prentice Bisbal
mentioned is actual performance should be ~85%. of theoretical

performance

Post by Prentice Bisbal
is that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this

point,

Post by Prentice Bisbal
we will get to that in future posts, I assure you!
--
Prentice
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Chris Samuel

2018-02-23 06:58:41 UTC

Permalink

Post by John Hearns via Beowulf
Oh, and use the Adaptive computing HPL calculator to get your input file.
Thanks Adaptive guys!

I think you mean Advanced Clustering.. :-)

http://www.advancedclustering.com/act_kb/tune-hpl-dat-file/

cheers,
Chris

Prentice Bisbal

2018-02-22 16:48:31 UTC

Permalink

I'm using OpenBLAS 0.29 with dynamic architecture support, but I'm
thinking of switching to using ACML for this test, to remove the
possibility that it's a problem with my OpenBLAS build.

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/be

Joe Landman

2018-02-22 16:58:51 UTC

Permalink

which compiler are you using, and what options are you compiling it with?

Post by Prentice Bisbal

I'm using OpenBLAS 0.29 with dynamic architecture support, but I'm
thinking of switching to using ACML for this test, to remove the
possibility that it's a problem with my OpenBLAS build.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.

Prentice Bisbal

2018-02-22 18:56:23 UTC

Permalink

For OpenBlas, or hpl?

For hpl, I used GCC 6.1.0 with these flags. I

$ egrep -i "flags|defs" Make.gcc-6.1.0_openblas-0.2.19
F2CDEFS      = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CCNOOPT = $(HPL_DEFS)
OMP_DEFS = -openmp
CCFLAGS = $(HPL_DEFS) -march=barcelona -O3 -Wall
LINKFLAGS    = $(CCFLAGS) $(OMP_DEFS)
ARFLAGS      = r

For OpenBLAS:

make DYNAMIC_ARCH=1 CC=gcc FC=gfortran

# This little summary is printed out at end of build:

OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

OS               ... Linux
Architecture     ... x86_64
BINARY           ... 64bit
C compiler       ... GCC (command line : gcc)
Fortran compiler ... GFORTRAN (command line : gfortran)
Library Name     ... libopenblasp-r0.2.19.a (Multi threaded; Max
num-threads i
s 8)

Prentice

Post by Joe Landman
which compiler are you using, and what options are you compiling it with?

Post by Prentice Bisbal

I'm using OpenBLAS 0.29 with dynamic architecture support, but I'm
thinking of switching to using ACML for this test, to remove the
possibility that it's a problem with my OpenBLAS build.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

David Mathog

2018-02-22 17:52:20 UTC

Permalink

Post by Prentice Bisbal
I found literature from AMD stating the
theoretical performance of these processors is 282 GFLOPS, and my
LINPACK performance isn't coming close to that (I get approximately
~33%
of that).

That does seem low. Check the usual culprits:

1. CPU frequency adjust locked to lowest setting, or set to one which
adjusts and which then interacts poorly with the test software. You
know that the rated performance will have been measured with the CPU
locked to its highest frequency.

2. something else running, especially something which forces the test
program out of memory or file caches. I wouldn't expect this sort of
test to be IO bound to disk, but if it is, and hugepages are used,
enormous performance drops may be observed when the system decides to
move those around. I wouldn't put it past AMD or Intel to run these
sorts of tests with the test system stripped down to the bones. No
network, no logging, single user, etc. That is, absolutely nothing that
would compete for CPU time. (Just checked on one of our big systems.
ps -ef | wc shows 953 processes: 48 migration, 48 ksoftirqd, 49
stopper, 49 watchdog, 49 kintegrityd, 49 kblockd, 49 ata_sff, 49 md, 49
md_misc, 49 aio, 49 crypto, 49 kthrotld, 49 rpciod, 19 gdm (console
processes, even with no display attached at the moment and nobody logged
in there), 193 events, 12 of my processes, and 107 miscellaneous OS
processes.)

3. ulimit settings. /etc/security/limits.conf settings.

4. NUMA issues. Multithreaded programs have been observed which
allocate a large block of memory once, which ends up on one side of a
NUMA system and then start some or all of the threads on the other.
Those on the wrong side will run a variable amount slower than those on
the right side. If this is what is going on locking all threads to the
same side of the system (if it has just two sides) can speed things up a
bit. Assuming it isn't supposed to use all threads.

5. Different compiler/optimization. The vendor may have used a binary
which was tweaked to the Nth degree, perhaps even using profiling from
earlier runs to optimize the final run. If you are using a benchmark
number from AMD see if you can obtain the exact same version of the test
software that they used (which is maybe available), so that you can
eliminate this variable. Perhaps wherever they keep that they also have
a detailed description of the test system?

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visi

Dmitri Chubarov

2018-02-22 18:18:14 UTC

Permalink

Hi,

not sure if the 282 GFLOPS number is correct.

We have 16 Bulldozer/Interlagos cores at 2.2 GHz. Each pair of cores forms
a CMT module. The two cores in the module share an FPU with 2 128-bit FMAC
units.

In terms of double precision FLOPS it should make
16 * 2.2GHz * 2 double precision scalars/SIMD register * 2 FLOPS / FMA op =
140.8 GFLOPS

It looks like 282 GFLOPS number is per a 2P node.

Dima

Prentice Bisbal

2018-02-22 18:50:54 UTC

Permalink

This is my source for those theoretical numbers:

http://dewaele.org/~robbe/thesis/writing/references/49747D_HPC_Processor_Comparison_v3_July2012.pdf

If those numbers are off, that makes my job a bit easier.Â And it looks
like you're right. In the text above the table, it does mention 2-socket
servers, and then below the table in fine print, it states

"For AMD Opteron Processors, theoretical FLOPS = Core Count x Core
Frequency x number of processors per server x 4."

Why can't the table just show single socket performance? Grrrr....

Regardless of bad marketing and graphics design, I'm still at at square
one. My system has 2 sockets, and the best I've been able to do is get
~115 GFLOPS. And that's one of the 'instaneous' values LINPACK spits out
every few seconds. At the end of test, the actual GFLOPSÂ result is more
like 77 GLOPS:

===========================================
T/VÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â NÂ Â Â NBÂ Â Â Â PÂ Â Â Â Q TimeÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Gflops
--------------------------------------------------------------------------------
WR00L2L2Â Â Â Â Â Â 82775Â Â Â 40Â Â Â Â 4Â Â Â Â 8 4924.71Â Â Â Â Â Â Â Â Â Â Â Â Â 7.678e+01

This is a two socket system, so that's only 27% of theoretical max.

Prentice

Post by Dmitri Chubarov
Hi,
not sure if the 282 GFLOPS number is correct.
We have 16 Bulldozer/Interlagos cores at 2.2 GHz. Each pair of cores
forms a CMT module. The two cores in the module share an FPU with 2
128-bit FMAC units.
In terms of double precision FLOPS it should make
16 * 2.2GHz * 2 double precision scalars/SIMD register * 2 FLOPS / FMA
op = 140.8 GFLOPS
It looks like 282 GFLOPS number is per a 2P node.
Dima
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found
literature from AMD stating the theoretical performance of these
processors is 282 GFLOPS, and my LINPACK performance isn't coming
close to that (I get approximately ~33% of that).Â The number I
often hear mentioned is actual performance should be ~85%. of
theoretical performance is that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at
this point, we will get to that in future posts, I assure you!
--
Prentice
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>

Prentice Bisbal

2018-02-22 22:27:27 UTC

Permalink

So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0, and
I'm now getting 197 GFLOPS, so clearly there's a problem with my
OpenBLAS build. I'm going to try building OpenBLAS without the dynamic
arch support on the machine where I plan on running my tests, and see if
that version of the library is any better.

Prentice

Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found literature
from AMD stating the theoretical performance of these processors is
282 GFLOPS, and my LINPACK performance isn't coming close to that (I
get approximately ~33% of that). The number I often hear mentioned is
actual performance should be ~85%. of theoretical performance is that
a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)

Prentice Bisbal

2018-02-22 22:48:36 UTC

Permalink

Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC 6.1.0,
and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS performance
should be close to ACML performance, if not better. I'll have to dig
into this later. For now, I'm going to continue my testing using the
ACML-based build and revisit the OpenBLAS performance later.

Prentice

Post by Prentice Bisbal
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0,
and I'm now getting 197 GFLOPS, so clearly there's a problem with my
OpenBLAS build. I'm going to try building OpenBLAS without the dynamic
arch support on the machine where I plan on running my tests, and see
if that version of the library is any better.
Prentice

Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found literature
from AMD stating the theoretical performance of these processors is
282 GFLOPS, and my LINPACK performance isn't coming close to that (I
get approximately ~33% of that). The number I often hear mentioned
is actual performance should be ~85%. of theoretical performance is
that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beo

Benson Muite

2018-02-22 22:56:12 UTC

Permalink

Consider trying:
https://github.com/amd/blis
https://github.com/clMathLibraries/clBLAS

as well.

Post by Prentice Bisbal
Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC 6.1.0,
and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS performance
should be close to ACML performance, if not better. I'll have to dig
into this later. For now, I'm going to continue my testing using the
ACML-based build and revisit the OpenBLAS performance later.
Prentice

Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found literature
from AMD stating the theoretical performance of these processors is
282 GFLOPS, and my LINPACK performance isn't coming close to that (I
get approximately ~33% of that). The number I often hear mentioned
is actual performance should be ~85%. of theoretical performance is
that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!

_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

-
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/list

Joe Landman

2018-02-22 23:01:06 UTC

Permalink

ACML is hand coded assembly. Not likely that OpenBLAS will be much
better. Could be similar. c.f.
http://gcdart.blogspot.co.uk/2013/06/fast-matrix-multiply-and-ml.html

Post by Prentice Bisbal
Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC
6.1.0, and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS
performance should be close to ACML performance, if not better. I'll
have to dig into this later. For now, I'm going to continue my testing
using the ACML-based build and revisit the OpenBLAS performance later.
Prentice

Post by Prentice Bisbal
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0,
and I'm now getting 197 GFLOPS, so clearly there's a problem with my
OpenBLAS build. I'm going to try building OpenBLAS without the
dynamic arch support on the machine where I plan on running my tests,
and see if that version of the library is any better.
Prentice

Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found literature
from AMD stating the theoretical performance of these processors is
282 GFLOPS, and my LINPACK performance isn't coming close to that (I
get approximately ~33% of that). The number I often hear mentioned
is actual performance should be ~85%. of theoretical performance is
that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!

_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Prentice Bisbal

2018-02-23 00:19:16 UTC

Permalink

Joe,

Thanks for the link. Based on that, they should be pretty close in
performance, and mine are not, so I must be doing something wrong with
my OpenBLAS build. Since ACML is dead, I was hoping I could use OpenBLAS
moving forward.

Prentice

Post by Joe Landman
ACML is hand coded assembly. Not likely that OpenBLAS will be much
better. Could be similar. c.f.
http://gcdart.blogspot.co.uk/2013/06/fast-matrix-multiply-and-ml.html

Post by Prentice Bisbal
Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC
6.1.0, and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS
performance should be close to ACML performance, if not better. I'll
have to dig into this later. For now, I'm going to continue my
testing using the ACML-based build and revisit the OpenBLAS
performance later.
Prentice

Post by Prentice Bisbal
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0,
and I'm now getting 197 GFLOPS, so clearly there's a problem with my
OpenBLAS build. I'm going to try building OpenBLAS without the
dynamic arch support on the machine where I plan on running my
tests, and see if that version of the library is any better.
Prentice

Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found literature
from AMD stating the theoretical performance of these processors is
282 GFLOPS, and my LINPACK performance isn't coming close to that
(I get approximately ~33% of that). The number I often hear
mentioned is actual performance should be ~85%. of theoretical
performance is that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!

_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.be

Continue reading on narkive:

Search results for '[Beowulf] Theoretical vs. Actual Performance' (Questions and Answers)

replies

What do you think about evolution vs. creationism?

started 2009-03-10 21:45:09 UTC

religion & spirituality

replies

Grades vs. Learning?

started 2008-01-29 13:57:52 UTC

primary & secondary education

replies

AMD radeon 5670 vs 6570?