SC740 SEMINAR REPORT 01 for Frederick L. Jones

PRESENTER: Dr. Marcin Paprzycki

TOPIC: Introduction to High Performance Numerical Linear Algebra - III

OVERVIEW

This is a continuation of two earlier talks on High Performance Numerical Linear Algebra Systems given last Fall, 2000 by Dr. Marcin Paprzycki. The talk began with the topic of shared memory computers and ended with the performance of numerical linear algebra workloads consisting of both LU and Cholesky factorizations and eigenvalue and eigenvector finding techniques.

SHARED MEMORY SYSTEMS

Dr. Paprzycki started his talk by discussing the characteristics of shared memory computer systems. Shared memory computer systems have become building blocks for larger systems. For example, Dr. Paprzycki noted that an INTEL Quad processor had been purchased by USM which was a 4-way INTEL Pentium system that could be used as a building block for a larger system.

The challenge for shared memory systems, as noted by Dr. Paprzycki, is that the memory hierarchy inherent in shared memory systems provides more of a challenge for optimization. However, some software packages have been written to take advantage of shared memory systems. One such example of a software package written to run on shared memory systems with vector processors is LAPACK.

LAPACK

Linear Algebra Packages such as LAPACK were written around a three Basic Linear Algebra Subprograms (BLAS) model. The three BLAS model has the following three types: (1)Vector-Vector, (2)Vector-Matrix, and (3)Matrix-Matrix. Shared memory parallelization of LAPACK is based on parallel BLAS kernels.

Dr. Paprzycki did note that some work at The University of Tennessee by Dr. Jack Dongarra has been done in converting LAPACK to run on distributed-memory concurrent computers. (See "ScaLAPACK: Linear Algebra Software for Distributed Memory Architectures" by James Demmel, Jack Dungarra, Robert van de Geijn and David Walker in Parallel Computers: Theory and Practice by Thomas L. Casavant, Pavel Tvrdik and Frantisek Plasil. 1996, pp.267-282.)

PERFORMANCE EXAMPLES

The first performance example that Dr. Paprzycki used involved both LU and Choleski factorization. An eight processor CRAY X-MP supercomputer with a peak performance of 315 Mflops was used for the LU and Cholesky factorizations. Blocked versions of SAXPY, GAXPY and DOT were run; however, only SAXPY and DOT were reported.

An eight fold speedup was achieved for LAPACK by running it on the CRAY X-MP. The optimal blocksize used varied inversely with the matrix size. For example, for a matrix of size 200 x 200, the optimal blocksize was 512 elements. For a matrix of size 1400 x 1400, the optimal blocksize was 320 elements. Finally, for a matrix of size 2600 x 2600, the optimal blocksize was 192 elements.

The second performance example that Dr. Paprzycki used involved finding eigenvalues and eigenvectors of a complex Hermetian matrix. Both a CRAY J-916 supercomputer and an SGI Power Challenge 1000 supercomputer were used. The speedups achieved were not as high as hoped. For example, for different options used, the speedup varied from a low value of 1 to values such as 3 and a little higher.

SUMMARY AND CONCLUSIONS

Dr. Marcin Paprzycki’s talk began with the topic of shared memory computers and ended with the performance of numerical linear algebra workloads consisting of both LU and Cholesky factorizations and eigenvalue and eigenvector finding techniques.

Dr. Paprzycki noted that LU and Cholesky factorizations parallelize well. However, other dense matrix problems such as either eigenvalue and eigenvector finding problems or least squares problems do not parallelize that well. Thus, as Dr. Paprzycki noted, parallelizing the latter problems requires more work, and I agree.