Performance Evaluation of Large Scale Electron Dynamics Simulation under 
Many-core Cluster based on Knights Landing

Hirokawa, Y.; Boku, T.; Sato, S.; Yabana, K.

doi:10.1145/3149457.3149465

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Lokale TagsFreigabegeschichteDetailsÜbersicht

Freigegeben

Konferenzbeitrag

Performance Evaluation of Large Scale Electron Dynamics Simulation under Many-core Cluster based on Knights Landing

MPG-Autoren

/persons/resource/persons222317

Sato, S.
Theory Group, Theory Department, Max Planck Institute for the Structure and Dynamics of Matter, Max Planck Society;

Externe Ressourcen

https://dx.doi.org/10.1145/3149457.3149465
(Verlagsversion)

Volltexte (beschränkter Zugriff)

Für Ihren IP-Bereich sind aktuell keine Volltexte freigegeben.

Volltexte (frei zugänglich)

p183-hirokawa.pdf
(Verlagsversion), 418KB

Ergänzendes Material (frei zugänglich)

Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar

Zitation

Hirokawa, Y., Boku, T., Sato, S., & Yabana, K. (2018). Performance Evaluation of Large Scale Electron Dynamics Simulation under Many-core Cluster based on Knights Landing. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2018) (pp. 183-191). New York: ACM. doi:10.1145/3149457.3149465.

Zitierlink: https://hdl.handle.net/21.11116/0000-0002-CCE5-6

Zusammenfassung

We have been developing an advanced scientific code called "ARTED" for an electron dynamics simulation using the first-order computation of materials to be ported to various large-scale parallel systems including the "K" Computer, which was previously Japan's fastest supercomputer. In this paper, the implementation and performance evaluation of the ARTED code used in Intel's latest many-core processor, the Knights Landing (KNL) stand-alone cluster, are described based on past research on porting the code to the Knights Corner (KNC) accelerator. Our target system is Oakforest-PACS, which is currently the fastest supercomputer in Japan. For performance tuning on KNL, the largest issue is how to utilize multiple levels of parallelism, such as the instruction level (512-bit SIMD instruction), hardware thread (4 threads/core), and large number of cores. We focus on the dominant computation part of the code, where 25 points of a 3D stencil computation are required.

We successfully optimize this part to achieve 758.4 GFLOPS per node, which corresponds to 24.8% of the theoretical peak on the node of Oakforest-PACS using an Intel Xeon Phi 7250 (3046 GFLOPS peak). It is also shown that the KNL sustained performance is better than that of the two KNC accelerator cards. The entire ARTED code implies other time step computing, and was designed for a large-scale parallel execution using MPI, whereas single-node parallelization is achieved using OpenMP. We finally evaluate the entire parallel execution performance with up to 128 nodes.