Performance Optimization and Evaluation of Scalable Optoelectronics Application 
on Large Scale KNL Cluster

Hirokawa, Y.; Boku, T.; Uemoto, M.; Sato, S.; Yaban, K.

doi:10.1007/978-3-319-92040-5_11

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Lokale TagsFreigabegeschichteDetailsÜbersicht

Freigegeben

Konferenzbeitrag

Performance Optimization and Evaluation of Scalable Optoelectronics Application on Large Scale KNL Cluster

MPG-Autoren

/persons/resource/persons222317

Sato, S.
Theory Group, Theory Department, Max Planck Institute for the Structure and Dynamics of Matter, Max Planck Society;

Externe Ressourcen

https://dx.doi.org/10.1007/978-3-319-92040-5_11
(Verlagsversion)

Volltexte (beschränkter Zugriff)

Für Ihren IP-Bereich sind aktuell keine Volltexte freigegeben.

Volltexte (frei zugänglich)

Es sind keine frei zugänglichen Volltexte in PuRe verfügbar

Ergänzendes Material (frei zugänglich)

Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar

Zitation

Hirokawa, Y., Boku, T., Uemoto, M., Sato, S., & Yaban, K. (2018). Performance Optimization and Evaluation of Scalable Optoelectronics Application on Large Scale KNL Cluster. In R. Yokota, M. Weiland, J. Shalf, & S. Alam (Eds.), High Performance Computing. Basel, Switzerland: Springer International Publishing. doi:10.1007/978-3-319-92040-5_11.

Zitierlink: https://hdl.handle.net/21.11116/0000-0002-8520-3

Zusammenfassung

“ARTED” is an advanced scientific code for electron dynamics simulation which has been ported to various large-scale parallel systems including the “K” Computer, the ex-fastest supercomputer in the world, and many other MPP and cluster systems.

In this paper, we describe ARTED’s code optimization and performance evaluation applied to a large-scale cluster with Intel’s latest many-core processor, KNL (Knights Landing), based on past research regarding porting ARTED to the KNC (Knights Corner) coprocessor. Code optimization for dominant computation has been thoroughly carried out in KNL to achieve the highest performance with detailed optimization such as memory access, vectorization for the AVX-512 instruction set, cache utilization, etc. For further tuning, we investigated various KNL-dedicated techniques such as combining MCDRAM/DDR4 memories and parallel vector summation.

After detailed performance tuning on each core to achieve up to 25% of theoretical peak in the kernel part with 3-D stencil computation, we evaluated the application performance on the full system (25 PFLOPS of theoretical peak) of the KNL cluster “Oakforest-PACS” which is the largest KNL-based cluster in the world using the Intel Omni-Path Architecture. It shows excellent weak scaling with a dominant Hamiltonian performance of up to 4 PFLOPS (16% efficiency of the system) in double precision irrespective of simulation size as well as reasonable strong scaling on material simulations requiring high degree of parallelism.