Code |
Compiler Options |
Times--p2sc thin | Times--p2sc wide | ||||
---|---|---|---|---|---|---|---|
Fortran | Fortran -qarch=pwr2 -qhot | C | Fortran | Fortran -qarch=pwr2 -qhot | C | ||
mma | (none) | 48.14 | 48.35 | 51.17 | 47.28 | 47.47 | 46.58 |
mma | -O3 | 37.41 | .90 | *39.21 | 41.52 | .83 | *40.00 |
mma | -Pk | 13.61 | 16.50 | 18.31 | 12.18 | 14.74 | 16.34 |
mma | -O3 -Pk | 1.71 | 1.61 | 1.61 | 1.59 | 1.55 | 1.56 |
mmr | -O3 | 36.16 | .91 | na | 39.37 | .83 | na |
mmu | -O3 | 9.71 | 9.56 | *9.58 | 11.71 | 11.35 | *11.45 |
mmb | -O3 | 38.38 | 1.34 | 35.48 | 41.65 | 1.23 | 39.50 |
mmp | -O3 | 47.43 | 1.94 | *49.13 | 52.98 | 2.04 | *52.39 |
mmbp | -O3 | 4.65 | 1.78 | 4.21 | 4.76 | 1.86 | 4.36 |
mmup | -O3 | 13.21 | 13.16 | *13.13 | 14.27 | 14.12 | *14.23 |
mmbup | -O3 | 4.03 | 3.96 | 4.41 | 4.29 | 4.33 | 4.68 |
mmbu | -O3 | 9.98 | 9.97 | 10.14 | 11.29 | 11.19 | 11.40 |
mmp | -O3 -Pk | 1.81 | 1.79 | 1.79 | 1.82 | 1.87 | 1.93 |
mmep | -O3 -lessl | .70 | .70 | .69 | .61 | .66 | .61 |
mme | -O3 -lessl | .69 | .70 | .69 | .62 | .64 | .63 |
mmep | -O3 -lesslp2 | .62 | .62 | .63 | .58 | .57 | .54 |
mme | -O3 -lesslp2 | .64 | .65 | .63 | .58 | .58 | .57 |
* These results were obtained using -qstrict option with -O3.
All results were obtained using xlf and cc under AIX 4.2.1.
Runs were performed on May 11-12, 1998
THEORETICAL MINIMUM: .558 thin node, .496 wide node
The "theoretical minimum" time is computed under the assumptions
that
(a) all memory operations (loads, stores, cache misses) take no time
and
(b) two FMA's can be executed in each and every cycle. To calculate a
single element in the matrix product of two NxN matrices, roughly N
FMAs must be executed. But there are N^2 such elements in the
product, giving an overall total of N^3 FMAs. When N=512, this works
out to be 1.34x10^8 FMAs that must be executed. Since the SP's P2SC
nodes have a clock speed of 120 MHz for thin nodes and 135 MHz for
wide nodes, one can ideally do 2.40x10^8 FMAs per second on thin
nodes and 2.70x10^8 FMAs per second on wide nodes. Dividing the first
number by the second gives our theoretical minimum times for thin and
wide nodes of .558 seconds and .496 seconds respectively. This is of
course never realized in practice--it's very tricky to prefetch the
data into cache in such a way that the FMA pipelines are kept fully
loaded.