Cache Oblivious Parallelograms in Iterative Stencil Computations

Strzodka, Robert; Shaheen, Mohammed; Pajak, Dawid; Seidel, Hans-Peter

doi:10.1145/1810085.1810096

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

Cache Oblivious Parallelograms in Iterative Stencil Computations

MPS-Authors

/persons/resource/persons45566

Strzodka, Robert
Computer Graphics, MPI for Informatics, Max Planck Society;
Graphics - Optics - Vision, MPI for Informatics, Max Planck Society;

/persons/resource/persons45463

Shaheen, Mohammed
Computer Graphics, MPI for Informatics, Max Planck Society;
International Max Planck Research School, MPI for Informatics, Max Planck Society;

/persons/resource/persons45154

Pajak, Dawid
Computer Graphics, MPI for Informatics, Max Planck Society;

/persons/resource/persons45449

Seidel, Hans-Peter
Computer Graphics, MPI for Informatics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Strzodka, R., Shaheen, M., Pajak, D., & Seidel, H.-P. (2010). Cache Oblivious Parallelograms in Iterative Stencil Computations. In ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing (pp. 49-59). New York, NY: ACM. doi:10.1145/1810085.1810096.

Cite as: https://hdl.handle.net/11858/00-001M-0000-000F-1742-0

Abstract

We present a new cache oblivious scheme for iterative stencil computations that
performs beyond system bandwidth limitations as though gigabytes of data could
reside in an enormous on-chip cache. We compare execution times for 2D and 3D
spatial domains with up to 128 million double precision elements for constant
and variable stencils against hand-optimized naive code and the automatic
polyhedral parallelizer and locality optimizer PluTo and demonstrate the clear
superiority of our results. The performance benefits stem from a tiling
structure that caters for data locality, parallelism and vectorization
simultaneously. Rather than tiling the iteration space from inside, we take an
exterior approach with a predefined hierarchy, simple regular parallelogram
tiles and a locality preserving parallelization. These advantages come at the
cost of an irregular work-load distribution but a tightly integrated
load-balancer ensures a high utilization of all resources.