Discussion:
[Beowulf] Theoretical vs. Actual Performance
Prentice Bisbal
2018-02-22 14:37:54 UTC
Permalink
Beowulfers,

In your experience, how close does actual performance of your processors
match up to their theoretical performance? I'm investigating a
performances issue on some of my nodes. These are older systems using
AMD Opteron 6274 processors. I found literature from AMD stating the
theoretical performance of these processors is 282 GFLOPS, and my
LINPACK performance isn't coming close to that (I get approximately ~33%
of that).  The number I often hear mentioned is actual performance
should be ~85%. of theoretical performance is that a realistic number
your experience?

I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!
--
Prentice

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) v
Joe Landman
2018-02-22 14:45:00 UTC
Permalink
Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found literature
from AMD stating the theoretical performance of these processors is
282 GFLOPS, and my LINPACK performance isn't coming close to that (I
get approximately ~33% of that).  The number I often hear mentioned is
actual performance should be ~85%. of theoretical performance is that
a realistic number your experience?
85% makes the assumption that you have the systems configured in an
optimal manner, that the compiler doesn't do anything wonky, and that,
to some degree, you isolate the OS portion of the workload off of most
of the cores to reduce jitter.   Among other things.

At Scalable, I'd regularly hit 60-90 % of theoretical max computing
performance, with progressively more heroic tuning.   Storage, I'd
typically hit 90-95% of theoretical max (good architectures almost
always beat bad ones).  Networking, fairly similar, though tuning per
use case mattered significantly.
Post by Prentice Bisbal
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!
--
Joe Landman
t: @hpcjoe
w: https://scalability.org

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/list
John Hearns via Beowulf
2018-02-22 15:16:33 UTC
Permalink
Prentice, I echo what Joe says.
When doing benchmarking with HPL or SPEC benchmarks, I would optimise the
BIOS settings to the highest degree I could.
Switch off processor C) states
As Joe says you need to look at what the OS is runnign in the background. I
would disable the Bright cluster manager daemon for instance.


85% of theoretical peak on an HPL run sounds reasonable to me and I would
get fogures in that ballpark.

For your AMDs I would start by choosing one system, no interconnect to
cloud the waters. See what you can get out of that.
Post by Joe Landman
Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your processors
match up to their theoretical performance? I'm investigating a performances
issue on some of my nodes. These are older systems using AMD Opteron 6274
processors. I found literature from AMD stating the theoretical performance
of these processors is 282 GFLOPS, and my LINPACK performance isn't coming
close to that (I get approximately ~33% of that). The number I often hear
mentioned is actual performance should be ~85%. of theoretical performance
is that a realistic number your experience?
85% makes the assumption that you have the systems configured in an
optimal manner, that the compiler doesn't do anything wonky, and that, to
some degree, you isolate the OS portion of the workload off of most of the
cores to reduce jitter. Among other things.
At Scalable, I'd regularly hit 60-90 % of theoretical max computing
performance, with progressively more heroic tuning. Storage, I'd
typically hit 90-95% of theoretical max (good architectures almost always
beat bad ones). Networking, fairly similar, though tuning per use case
mattered significantly.
Post by Prentice Bisbal
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!
--
Joe Landman
w: https://scalability.org
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Benson Muite
2018-02-22 15:42:51 UTC
Permalink
There is a very nice and simple Max flops code that requires much less
tuning than Linpack. It is described in pg 57 of:

Rahman "Intel® Xeon Phi™ Coprocessor Architecture and Tools"
https://link.springer.com/book/10.1007%2F978-1-4302-5927-5

An example Fortran code is here:
https://github.com/bkmgit/intel-xeon-phi-coprocessor-architecture-tools/tree/master/ch05
Post by John Hearns via Beowulf
Prentice, I echo what Joe says.
When doing benchmarking with HPL or SPEC benchmarks, I would optimise
the BIOS settings to the highest degree I could.
Switch off processor C) states
As Joe says you need to look at what the OS is runnign in the
background. I would disable the Bright cluster manager daemon for instance.
85% of theoretical peak on an HPL run sounds reasonable to me and I
would get fogures in that ballpark.
For your AMDs I would start by choosing one system, no interconnect to
cloud the waters. See what you can get out of that.
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These
are older systems using AMD Opteron 6274 processors. I found
literature from AMD stating the theoretical performance of these
processors is 282 GFLOPS, and my LINPACK performance isn't
coming close to that (I get approximately ~33% of that).  The
number I often hear mentioned is actual performance should be
~85%. of theoretical performance is that a realistic number your
experience?
85% makes the assumption that you have the systems configured in an
optimal manner, that the compiler doesn't do anything wonky, and
that, to some degree, you isolate the OS portion of the workload off
of most of the cores to reduce jitter.   Among other things.
At Scalable, I'd regularly hit 60-90 % of theoretical max computing
performance, with progressively more heroic tuning.   Storage, I'd
typically hit 90-95% of theoretical max (good architectures almost
always beat bad ones).  Networking, fairly similar, though tuning
per use case mattered significantly.
I don't want this to be a discussion of what could be wrong at
this point, we will get to that in future posts, I assure you!
--
Joe Landman
w: https://scalability.org
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailma
Chris Samuel
2018-02-23 06:57:11 UTC
Permalink
Post by Joe Landman
85% makes the assumption that you have the systems configured in an
optimal manner, that the compiler doesn't do anything wonky, and that,
to some degree, you isolate the OS portion of the workload off of most
of the cores to reduce jitter. Among other things.
Interesting, purchases I've done before for IB clusters have had Rpeak as 80%
of Rmax HPL requirement for acceptance testing before and not had much problem
hitting it.

The worst issue we had was on SandyBridge where the kernel ignoring UEFI
settings saying "hey, I know these CPUs, I'll enable ALL the power saving" and
that killed performance until we disabled the states via the kernel boot
parameters.

It can be worth running powertop to see what states your CPUs are sitting in
whilst running HPL, and also "perf top" to see what the system is up to.

Good luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailm
Michael Di Domenico
2018-02-22 15:44:25 UTC
Permalink
i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
compiler and the Intel MKL, i can hit 90% without issue. no major
tuning either

if you're at 33% i would be suspect of your math library
Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your processors
match up to their theoretical performance? I'm investigating a performances
issue on some of my nodes. These are older systems using AMD Opteron 6274
processors. I found literature from AMD stating the theoretical performance
of these processors is 282 GFLOPS, and my LINPACK performance isn't coming
close to that (I get approximately ~33% of that). The number I often hear
mentioned is actual performance should be ~85%. of theoretical performance
is that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this point,
we will get to that in future posts, I assure you!
--
Prentice
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mai
John Hearns via Beowulf
2018-02-22 15:52:54 UTC
Permalink
Oh, and use the Adaptive computing HPL calculator to get your input file.
Thanks Adaptive guys!
Post by Michael Di Domenico
i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
compiler and the Intel MKL, i can hit 90% without issue. no major
tuning either
if you're at 33% i would be suspect of your math library
Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your processors
match up to their theoretical performance? I'm investigating a
performances
Post by Prentice Bisbal
issue on some of my nodes. These are older systems using AMD Opteron 6274
processors. I found literature from AMD stating the theoretical
performance
Post by Prentice Bisbal
of these processors is 282 GFLOPS, and my LINPACK performance isn't
coming
Post by Prentice Bisbal
close to that (I get approximately ~33% of that). The number I often
hear
Post by Prentice Bisbal
mentioned is actual performance should be ~85%. of theoretical
performance
Post by Prentice Bisbal
is that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this
point,
Post by Prentice Bisbal
we will get to that in future posts, I assure you!
--
Prentice
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Chris Samuel
2018-02-23 06:58:41 UTC
Permalink
Post by John Hearns via Beowulf
Oh, and use the Adaptive computing HPL calculator to get your input file.
Thanks Adaptive guys!
I think you mean Advanced Clustering.. :-)

http://www.advancedclustering.com/act_kb/tune-hpl-dat-file/

cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/m
Prentice Bisbal
2018-02-22 16:48:31 UTC
Permalink
Post by Michael Di Domenico
i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
compiler and the Intel MKL, i can hit 90% without issue. no major
tuning either
if you're at 33% i would be suspect of your math library
I'm using OpenBLAS 0.29 with dynamic architecture support,  but I'm
thinking of switching to using ACML for this test, to remove the
possibility that it's a problem with my OpenBLAS build.

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/be
Joe Landman
2018-02-22 16:58:51 UTC
Permalink
which compiler are you using, and what options are you compiling it with?
Post by Prentice Bisbal
Post by Michael Di Domenico
i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
compiler and the Intel MKL, i can hit 90% without issue.  no major
tuning either
if you're at 33% i would be suspect of your math library
I'm using OpenBLAS 0.29 with dynamic architecture support,  but I'm
thinking of switching to using ACML for this test, to remove the
possibility that it's a problem with my OpenBLAS build.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.
Prentice Bisbal
2018-02-22 18:56:23 UTC
Permalink
For OpenBlas, or hpl?

For hpl, I used GCC 6.1.0 with these flags. I

$ egrep -i "flags|defs" Make.gcc-6.1.0_openblas-0.2.19
F2CDEFS      = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CCNOOPT  = $(HPL_DEFS)
OMP_DEFS = -openmp
CCFLAGS  = $(HPL_DEFS) -march=barcelona -O3 -Wall
LINKFLAGS    = $(CCFLAGS) $(OMP_DEFS)
ARFLAGS      = r

For OpenBLAS:

make DYNAMIC_ARCH=1 CC=gcc FC=gfortran

# This little summary is printed out at end of build:

 OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS               ... Linux
  Architecture     ... x86_64
  BINARY           ... 64bit
  C compiler       ... GCC  (command line : gcc)
  Fortran compiler ... GFORTRAN  (command line : gfortran)
  Library Name     ... libopenblasp-r0.2.19.a (Multi threaded; Max
num-threads i
s 8)

Prentice
Post by Joe Landman
which compiler are you using, and what options are you compiling it with?
Post by Prentice Bisbal
Post by Michael Di Domenico
i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
compiler and the Intel MKL, i can hit 90% without issue.  no major
tuning either
if you're at 33% i would be suspect of your math library
I'm using OpenBLAS 0.29 with dynamic architecture support,  but I'm
thinking of switching to using ACML for this test, to remove the
possibility that it's a problem with my OpenBLAS build.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowul
David Mathog
2018-02-22 17:52:20 UTC
Permalink
Post by Prentice Bisbal
I found literature from AMD stating the
theoretical performance of these processors is 282 GFLOPS, and my
LINPACK performance isn't coming close to that (I get approximately
~33%
of that). 
That does seem low. Check the usual culprits:

1. CPU frequency adjust locked to lowest setting, or set to one which
adjusts and which then interacts poorly with the test software. You
know that the rated performance will have been measured with the CPU
locked to its highest frequency.

2. something else running, especially something which forces the test
program out of memory or file caches. I wouldn't expect this sort of
test to be IO bound to disk, but if it is, and hugepages are used,
enormous performance drops may be observed when the system decides to
move those around. I wouldn't put it past AMD or Intel to run these
sorts of tests with the test system stripped down to the bones. No
network, no logging, single user, etc. That is, absolutely nothing that
would compete for CPU time. (Just checked on one of our big systems.
ps -ef | wc shows 953 processes: 48 migration, 48 ksoftirqd, 49
stopper, 49 watchdog, 49 kintegrityd, 49 kblockd, 49 ata_sff, 49 md, 49
md_misc, 49 aio, 49 crypto, 49 kthrotld, 49 rpciod, 19 gdm (console
processes, even with no display attached at the moment and nobody logged
in there), 193 events, 12 of my processes, and 107 miscellaneous OS
processes.)

3. ulimit settings. /etc/security/limits.conf settings.

4. NUMA issues. Multithreaded programs have been observed which
allocate a large block of memory once, which ends up on one side of a
NUMA system and then start some or all of the threads on the other.
Those on the wrong side will run a variable amount slower than those on
the right side. If this is what is going on locking all threads to the
same side of the system (if it has just two sides) can speed things up a
bit. Assuming it isn't supposed to use all threads.

5. Different compiler/optimization. The vendor may have used a binary
which was tweaked to the Nth degree, perhaps even using profiling from
earlier runs to optimize the final run. If you are using a benchmark
number from AMD see if you can obtain the exact same version of the test
software that they used (which is maybe available), so that you can
eliminate this variable. Perhaps wherever they keep that they also have
a detailed description of the test system?

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visi
Dmitri Chubarov
2018-02-22 18:18:14 UTC
Permalink
Hi,

not sure if the 282 GFLOPS number is correct.

We have 16 Bulldozer/Interlagos cores at 2.2 GHz. Each pair of cores forms
a CMT module. The two cores in the module share an FPU with 2 128-bit FMAC
units.

In terms of double precision FLOPS it should make
16 * 2.2GHz * 2 double precision scalars/SIMD register * 2 FLOPS / FMA op =
140.8 GFLOPS

It looks like 282 GFLOPS number is per a 2P node.

Dima
Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your processors
match up to their theoretical performance? I'm investigating a performances
issue on some of my nodes. These are older systems using AMD Opteron 6274
processors. I found literature from AMD stating the theoretical performance
of these processors is 282 GFLOPS, and my LINPACK performance isn't coming
close to that (I get approximately ~33% of that). The number I often hear
mentioned is actual performance should be ~85%. of theoretical performance
is that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this point,
we will get to that in future posts, I assure you!
--
Prentice
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Prentice Bisbal
2018-02-22 18:50:54 UTC
Permalink
This is my source for those theoretical numbers:

http://dewaele.org/~robbe/thesis/writing/references/49747D_HPC_Processor_Comparison_v3_July2012.pdf

If those numbers are off, that makes my job a bit easier.  And it looks
like you're right. In the text above the table, it does mention 2-socket
servers, and then below the table in fine print, it states

"For AMD Opteron Processors, theoretical FLOPS = Core Count x Core
Frequency x number of processors per server x 4."

Why can't the table just show single socket performance? Grrrr....

Regardless of bad marketing and graphics design, I'm still at at square
one. My system has 2 sockets, and the best I've been able to do is get
~115 GFLOPS. And that's one of the 'instaneous' values LINPACK spits out
every few seconds. At the end of test, the actual GFLOPS  result is more
like 77 GLOPS:

===========================================
T/V                N    NB     P     Q Time                 Gflops
--------------------------------------------------------------------------------
WR00L2L2       82775    40     4     8 4924.71              7.678e+01

This is a two socket system, so that's only 27% of theoretical max.

Prentice
Post by Dmitri Chubarov
Hi,
not sure if the 282 GFLOPS number is correct.
We have 16 Bulldozer/Interlagos cores at 2.2 GHz. Each pair of cores
forms a CMT module. The two cores in the module share an FPU with 2
128-bit FMAC units.
In terms of double precision FLOPS it should make
16 * 2.2GHz * 2 double precision scalars/SIMD register * 2 FLOPS / FMA
op = 140.8 GFLOPS
It looks like 282 GFLOPS number is per a 2P node.
Dima
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found
literature from AMD stating the theoretical performance of these
processors is 282 GFLOPS, and my LINPACK performance isn't coming
close to that (I get approximately ~33% of that).  The number I
often hear mentioned is actual performance should be ~85%. of
theoretical performance is that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at
this point, we will get to that in future posts, I assure you!
--
Prentice
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
Prentice Bisbal
2018-02-22 22:27:27 UTC
Permalink
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0, and
I'm now getting 197 GFLOPS, so clearly there's a problem with my
OpenBLAS build. I'm going to try building OpenBLAS without the dynamic
arch support on the machine where I plan on running my tests, and see if
that version of the library is any better.

Prentice
Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found literature
from AMD stating the theoretical performance of these processors is
282 GFLOPS, and my LINPACK performance isn't coming close to that (I
get approximately ~33% of that).  The number I often hear mentioned is
actual performance should be ~85%. of theoretical performance is that
a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
Prentice Bisbal
2018-02-22 22:48:36 UTC
Permalink
Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC 6.1.0,
and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS performance
should be close to ACML performance, if not better. I'll have to dig
into this later. For now, I'm going to continue my testing using the
ACML-based build and revisit the OpenBLAS performance later.

Prentice
Post by Prentice Bisbal
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0,
and I'm now getting 197 GFLOPS, so clearly there's a problem with my
OpenBLAS build. I'm going to try building OpenBLAS without the dynamic
arch support on the machine where I plan on running my tests, and see
if that version of the library is any better.
Prentice
Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found literature
from AMD stating the theoretical performance of these processors is
282 GFLOPS, and my LINPACK performance isn't coming close to that (I
get approximately ~33% of that).  The number I often hear mentioned
is actual performance should be ~85%. of theoretical performance is
that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beo
Benson Muite
2018-02-22 22:56:12 UTC
Permalink
Consider trying:
https://github.com/amd/blis
https://github.com/clMathLibraries/clBLAS

as well.
Post by Prentice Bisbal
Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC 6.1.0,
and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS performance
should be close to ACML performance, if not better. I'll have to dig
into this later. For now, I'm going to continue my testing using the
ACML-based build and revisit the OpenBLAS performance later.
Prentice
Post by Prentice Bisbal
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0,
and I'm now getting 197 GFLOPS, so clearly there's a problem with my
OpenBLAS build. I'm going to try building OpenBLAS without the dynamic
arch support on the machine where I plan on running my tests, and see
if that version of the library is any better.
Prentice
Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found literature
from AMD stating the theoretical performance of these processors is
282 GFLOPS, and my LINPACK performance isn't coming close to that (I
get approximately ~33% of that).  The number I often hear mentioned
is actual performance should be ~85%. of theoretical performance is
that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
-
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/list
Joe Landman
2018-02-22 23:01:06 UTC
Permalink
ACML is hand coded assembly.  Not likely that OpenBLAS will be much
better.  Could be similar.  c.f.
http://gcdart.blogspot.co.uk/2013/06/fast-matrix-multiply-and-ml.html
Post by Prentice Bisbal
Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC
6.1.0, and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS
performance should be close to ACML performance, if not better. I'll
have to dig into this later. For now, I'm going to continue my testing
using the ACML-based build and revisit the OpenBLAS performance later.
Prentice
Post by Prentice Bisbal
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0,
and I'm now getting 197 GFLOPS, so clearly there's a problem with my
OpenBLAS build. I'm going to try building OpenBLAS without the
dynamic arch support on the machine where I plan on running my tests,
and see if that version of the library is any better.
Prentice
Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found literature
from AMD stating the theoretical performance of these processors is
282 GFLOPS, and my LINPACK performance isn't coming close to that (I
get approximately ~33% of that).  The number I often hear mentioned
is actual performance should be ~85%. of theoretical performance is
that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/li
Prentice Bisbal
2018-02-23 00:19:16 UTC
Permalink
Joe,

Thanks for the link. Based on that, they should be pretty close in
performance, and mine are not, so I must be doing something wrong with
my OpenBLAS build. Since ACML is dead, I was hoping I could use OpenBLAS
moving forward.

Prentice
Post by Joe Landman
ACML is hand coded assembly.  Not likely that OpenBLAS will be much
better.  Could be similar.  c.f.
http://gcdart.blogspot.co.uk/2013/06/fast-matrix-multiply-and-ml.html
Post by Prentice Bisbal
Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC
6.1.0, and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS
performance should be close to ACML performance, if not better. I'll
have to dig into this later. For now, I'm going to continue my
testing using the ACML-based build and revisit the OpenBLAS
performance later.
Prentice
Post by Prentice Bisbal
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0,
and I'm now getting 197 GFLOPS, so clearly there's a problem with my
OpenBLAS build. I'm going to try building OpenBLAS without the
dynamic arch support on the machine where I plan on running my
tests, and see if that version of the library is any better.
Prentice
Post by Prentice Bisbal
Beowulfers,
In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found literature
from AMD stating the theoretical performance of these processors is
282 GFLOPS, and my LINPACK performance isn't coming close to that
(I get approximately ~33% of that).  The number I often hear
mentioned is actual performance should be ~85%. of theoretical
performance is that a realistic number your experience?
I don't want this to be a discussion of what could be wrong at this
point, we will get to that in future posts, I assure you!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.be
Continue reading on narkive:
Loading...