非表示:
キーワード:
Turbulence; Mixing; High Schmidt number; Compact finite differences; Nested OpenMP; Blue Waters
要旨:
A new dual-communicator algorithm with very favorable performance characteristics has been developed for direct numerical simulation (DNS) of turbulent mixing of a passive scalar governed by an advection-diffusion equation. We focus on the regime of high Schmidt number (Sc), where because of low molecular diffusivity the grid-resolution requirements for the scalar field are stricter than those for the velocity field by a factor root Sc. Computational throughput is improved by simulating the velocity field on a coarse grid of N-v(3) points with a Fourier pseudo-spectral (FPS) method, while the passive scalar is simulated on a fine grid of N-theta(3) points with a combined compact finite difference (CCD) scheme which computes first and second derivatives at eighth-order accuracy. A static three-dimensional domain decomposition and a parallel solution algorithm for the CCD scheme are used to avoid the heavy communication cost of memory transposes. A kernel is used to evaluate several approaches to optimize the performance of the CCD routines, which account for 60% of the overall simulation cost. On the petascale supercomputer Blue Waters at the University of Illinois, Urbana-Champaign, scalability is improved substantially with a hybrid MPI-OpenMP approach in which a dedicated thread per NUMA domain overlaps communication calls with computational tasks performed by a separate team of threads spawned using OpenMP nested parallelism. At a target production problem size of 8192(3) (0.5 trillion) grid points on 262,144 cores, CCD timings are reduced by 34% compared to a pure-MPI implementation. Timings for 16384(3) (4 trillion) grid points on 524,288 cores encouragingly maintain scalability greater than 90%, although the wall clock time is too high for production runs at this size. Performance monitoring with CrayPat for problem sizes up to 4096(3) shows that the CCD routines can achieve nearly 6% of the peak flop rate. The new DNS code is built upon two existing FPS and CCD codes. With the grid ratio N-theta/N-v = 8, the disparity in the computational requirements for the velocity and scalar problems is addressed by splitting the global communicator MPI_COMM_WORLD into disjoint communicators for the velocity and scalar fields, respectively. Inter communicator transfer of the velocity field from the velocity communicator to the scalar communicator is handled with discrete send and non-blocking receive calls, which are overlapped with other operations on the scalar communicator. For production simulations at N-theta = 8192 and N-v = 1024 on 262,144 cores for the scalar field, the DNS code achieves 94% strong scaling relative to 65,536 cores and 92% weak scaling relative to N-theta = 1024 and Nv = 128 on 512 cores.