Next: Ongoing Activities Up: The Code Creation Previous: Quadratic Formula

Sparse Matrices

The transformational approach and the MathBus will be used in a restructuring compiler that will generate highly optimized parallel code for sparse matrix problems. Sparse matrix computations are ubiquitous in computational science since they arise whenever differential equations are solved using numerical techniques such as the finite element and finite difference methods. It is therefore ironic that almost all of the work to date on automatic program restructuring and parallelization, such as the HPF effort, has focused on dense matrix problems. The main reason for this is that restructuring compilers cannot analyze sparse matrix programs. Sparse matrices are stored in special formats, such as compressed row or column storage (CRS, CCS), and these formats complicate dependence analysis. Consider matrix-vector multiplication, which is a key step in conjugate gradient solvers, as well as in other iterative matrix algorithms [44]. If A is a sparse matrix in CCS form, the following code computes Ax, storing the result in vector v.

for i := 1 to N do for j := A.column-pointer(i) to A.column-pointer(i+1) - 1 do v( A.row-index(j)) := v( A.row-index(j)) + A.val(j)* x(i) od od

This small code fragment illustrates two key points regarding sparse matrix programs and their parallelization using automatic code restructuring tools. First, there is substantial parallelism in sparse matrix problems. A sparse matrix program performs a subset of the operations performed by the corresponding dense program; if two operations can be done in parallel in the dense program, they can be done in parallel in the sparse program. In the matrix-vector product, this means that different rows of the matrix can be multiplied with the vector in parallel; for each row, the partial products can be accumulated in parallel. In fact, in algorithms like the multi-frontal algorithm for sparse Cholesky factorization, the zero entries in sparse matrices can be exploited to extract more parallelism than is there in the corresponding dense program [72].

Secondly, the parallelism in sparse matrix programs is obscured by the storage scheme for sparse matrices. In particular, the use of indirection arrays, such as row-index in CCS, can foil the ability of a restructuring compiler to generate parallel code. These compilers perform dependence analysis to determine a partial order on loop nest iterations, but the analysis techniques work only when array subscripts are linear functions of loop index variables, allowing standard tests like the GCD and Banerjee tests to be used [126,91,18]. When complicated subscripts such as those involving indirection arrays are used, dependence analysis fails. With conventional compiler technology, the matrix-vector product code shown above is doomed to run sequentially.

The Bernoulli project at Cornell is addressing this problem through the use of program transformations to generate sparse matrix codes from dense matrix programs. Intuitively, this process can be viewed as a form of partial evaluation in which the compiler exploits information about zeros/nonzeros in matrices to eliminate unnecessary computations, and then chooses an appropriate representation for the sparse matrices. Since the input to the compiler is a dense matrix program, the ability to analyze and restructure programs, for example by using the Lambda loop transformation tool-kit developed at Cornell [65], is retained. However, the transformation to sparse code requires information about the sparsity structure of arrays. To get this information, we are integrating our compiler with a discretization system being written by Rich Zippel. This system uses Paul Chew's mesh generator, and incorporates symbolic and numerical routines for implementing the weighted residual method of discretizing differential equations. The discretization system has complete knowledge of the sparsity structure of matrices since this is fully determined by the meshing and discretization process. This information will be passed on the MathBus to the restructuring compiler to generate parallel code. Any bottlenecks detected by the restructuring compiler can be passed back to the synthesizer so that a revised mesh or different set of basis functions can be used for better parallel performance. The two-way flow of information between the discretization system and the restructuring compiler is an instance of the synergy which results from an integrated system, a synergy that is absent in current approaches. Distribution of sparse matrices across the nodes of the parallel machine is done using standard methods like recursive spectral bisection [92] and geometric partitioning [79].

To summarize, the compiler generates efficient, parallel sparse code by partial evaluation of the dense program relative to the sparsity structure of matrices which is obtained over the MathBus from the discretization system.

Next: Ongoing Activities Up: The Code Creation Previous: Quadratic Formula

nuprl project
Tue Nov 21 08:50:14 EST 1995