Timing Results for Matrix Multiply Examples

Timing Results for Lab Exercise for
Single Processor Performance Considerations

Code	Compiler Options	Times--p2sc thin			Times--p2sc wide
Code	Compiler Options	Fortran	Fortran -qarch=pwr2 -qhot	C	Fortran	Fortran -qarch=pwr2 -qhot	C
mma	(none)	48.14	48.35	51.17	47.28	47.47	46.58
mma	-O3	37.41	.90	*39.21	41.52	.83	*40.00
mma	-Pk	13.61	16.50	18.31	12.18	14.74	16.34
mma	-O3 -Pk	1.71	1.61	1.61	1.59	1.55	1.56
mmr	-O3	36.16	.91	na	39.37	.83	na
mmu	-O3	9.71	9.56	*9.58	11.71	11.35	*11.45
mmb	-O3	38.38	1.34	35.48	41.65	1.23	39.50
mmp	-O3	47.43	1.94	*49.13	52.98	2.04	*52.39
mmbp	-O3	4.65	1.78	4.21	4.76	1.86	4.36
mmup	-O3	13.21	13.16	*13.13	14.27	14.12	*14.23
mmbup	-O3	4.03	3.96	4.41	4.29	4.33	4.68
mmbu	-O3	9.98	9.97	10.14	11.29	11.19	11.40
mmp	-O3 -Pk	1.81	1.79	1.79	1.82	1.87	1.93
mmep	-O3 -lessl	.70	.70	.69	.61	.66	.61
mme	-O3 -lessl	.69	.70	.69	.62	.64	.63
mmep	-O3 -lesslp2	.62	.62	.63	.58	.57	.54
mme	-O3 -lesslp2	.64	.65	.63	.58	.58	.57

* These results were obtained using -qstrict option with -O3.

All results were obtained using xlf and cc under AIX 4.2.1.
Runs were performed on May 11-12, 1998

THEORETICAL MINIMUM: .558 thin node, .496 wide node

The "theoretical minimum" time is computed under the assumptions that
(a) all memory operations (loads, stores, cache misses) take no time and
(b) two FMA's can be executed in each and every cycle. To calculate a single element in the matrix product of two NxN matrices, roughly N FMAs must be executed. But there are N^2 such elements in the product, giving an overall total of N^3 FMAs. When N=512, this works out to be 1.34x10^8 FMAs that must be executed. Since the SP's P2SC nodes have a clock speed of 120 MHz for thin nodes and 135 MHz for wide nodes, one can ideally do 2.40x10^8 FMAs per second on thin nodes and 2.70x10^8 FMAs per second on wide nodes. Dividing the first number by the second gives our theoretical minimum times for thin and wide nodes of .558 seconds and .496 seconds respectively. This is of course never realized in practice--it's very tricky to prefetch the data into cache in such a way that the FMA pipelines are kept fully loaded.

Last modified: December 1998

Timing Results for Lab Exercise for Single Processor Performance Considerations

Timing Results for Lab Exercise for
Single Processor Performance Considerations