English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Conference Paper

Cache Oblivious Parallelograms in Iterative Stencil Computations

MPS-Authors
/persons/resource/persons45566

Strzodka,  Robert
Computer Graphics, MPI for Informatics, Max Planck Society;
Graphics - Optics - Vision, MPI for Informatics, Max Planck Society;

/persons/resource/persons45463

Shaheen,  Mohammed
Computer Graphics, MPI for Informatics, Max Planck Society;
International Max Planck Research School, MPI for Informatics, Max Planck Society;

/persons/resource/persons45154

Pajak,  Dawid
Computer Graphics, MPI for Informatics, Max Planck Society;

/persons/resource/persons45449

Seidel,  Hans-Peter
Computer Graphics, MPI for Informatics, Max Planck Society;

External Resource
No external resources are shared
Fulltext (public)
There are no public fulltexts stored in PuRe
Supplementary Material (public)
There is no public supplementary material available
Citation

Strzodka, R., Shaheen, M., Pajak, D., & Seidel, H.-P. (2010). Cache Oblivious Parallelograms in Iterative Stencil Computations. In ICS'10 (pp. 49-59). New York, NY: ACM. doi:10.1145/1810085.1810096.


Cite as: http://hdl.handle.net/11858/00-001M-0000-000F-1742-0
Abstract
We present a new cache oblivious scheme for iterative stencil computations that performs beyond system bandwidth limitations as though gigabytes of data could reside in an enormous on-chip cache. We compare execution times for 2D and 3D spatial domains with up to 128 million double precision elements for constant and variable stencils against hand-optimized naive code and the automatic polyhedral parallelizer and locality optimizer PluTo and demonstrate the clear superiority of our results. The performance benefits stem from a tiling structure that caters for data locality, parallelism and vectorization simultaneously. Rather than tiling the iteration space from inside, we take an exterior approach with a predefined hierarchy, simple regular parallelogram tiles and a locality preserving parallelization. These advantages come at the cost of an irregular work-load distribution but a tightly integrated load-balancer ensures a high utilization of all resources.