Concurrent Number Cruncher : An Efficient Sparse Linear Solver on the GPU

in: High Performance Computation Conference - HPCC'07, University of Houston, pages 358-371, Springer Berlin / Heidelberg

Abstract

A wide class of geometry processing and PDE resolution methods needs to solve a linear system, where the non-zero pattern of the matrix is dictated by the connectivity matrix of the mesh. The advent of GPUs with their ever-growing amount of parallel horsepower makes them a tempting resource for such numerical computations. This can be helped by new APIs (CTM from ATI and CUDA from NVIDIA) which give a direct access to the multithreaded computational resources and associated memory bandwidth of GPUs; CUDA even provides a BLAS implementation but only for dense matrices (CuBLAS). However, existing GPU linear solvers are restricted to specific types of matrices, or use non-optimal compressed row storage strategies. By combining recent GPU programming techniques with supercomputing strategies (namely block compressed row storage and register blocking), we implement a sparse generalpurpose linear solver which outperforms leading-edge CPU counterparts (MKL / ACML).

Download / Links

BibTeX Reference

@inproceedings{buatois:inria-00186833,
 abstract = {A wide class of geometry processing and PDE resolution methods needs to solve a linear system, where the non-zero pattern of the matrix is dictated by the connectivity matrix of the mesh. The advent of GPUs with their ever-growing amount of parallel horsepower makes them a tempting resource for such numerical computations. This can be helped by new APIs (CTM from ATI and CUDA from NVIDIA) which give a direct access to the multithreaded computational resources and associated memory bandwidth of GPUs; CUDA even provides a BLAS implementation but only for dense matrices (CuBLAS). However, existing GPU linear solvers are restricted to specific types of matrices, or use non-optimal compressed row storage strategies. By combining recent GPU programming techniques with supercomputing strategies (namely block compressed row storage and register blocking), we implement a sparse generalpurpose linear solver which outperforms leading-edge CPU counterparts (MKL / ACML).},
 address = {Houston, United States},
 author = {Buatois, Luc and Caumon, Guillaume and L{\'e}vy, Bruno},
 booktitle = {{High Performance Computation Conference - HPCC'07}},
 doi = {10.1007/978-3-540-75444-2\_37},
 editor = {Ronald Perrott and Barbara M. Chapman and Jaspal Subhlok and Rodrigo Fernandes de Mello and Laurence T. Yang},
 hal_id = {inria-00186833},
 hal_version = {v1},
 month = {September},
 note = {The original publication is available at www.springerlink.com},
 organization = {{University of Houston}},
 pages = {358-371},
 pdf = {https://inria.hal.science/inria-00186833/file/HPCC_number_cruncher.pdf},
 publisher = {{Springer Berlin / Heidelberg}},
 series = {Lecture Notes in Computer Science},
 title = {{Concurrent Number Cruncher : An Efficient Sparse Linear Solver on the GPU}},
 url = {https://inria.hal.science/inria-00186833},
 volume = {4782},
 year = {2007}
}