Optimizing Application Performance with BlueField: Accelerating Large-Message 
Blocking and Nonblocking Collective Operations

Graham, Richard; Bosilca, George; Qin, Yong; Settlemyer, Bradley; Shainer, Gilad; Stunkel, Craig; Vallee, Geoffroy; Williams, Brody; Cisneros-Stoianowski, Gerardo; Ohlmann, Sebastian; Rampp, Markus

doi:10.23919/ISC.2024.10528935

アイテム詳細

登録内容を編集ファイル形式で保存

一時保存へ追加

タグ情報を表示リリース履歴を表示詳細要約

前へ次へ / 524933

公開

会議論文

Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective Operations

MPS-Authors

/persons/resource/persons243356

Ohlmann, Sebastian
Max Planck Computing and Data Facility, Max Planck Society;

/persons/resource/persons110221

Rampp, Markus
Max Planck Computing and Data Facility, Max Planck Society;

External Resource

There are no locators available

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

フルテキスト (公開)

公開されているフルテキストはありません

付随資料 (公開)

There is no public supplementary material available

引用

Graham, R., Bosilca, G., Qin, Y., Settlemyer, B., Shainer, G., Stunkel, C., Vallee, G., Williams, B., Cisneros-Stoianowski, G., Ohlmann, S., & Rampp, M. (2024). Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective Operations. In ISC High Performance 2024 Research Paper Proceedings (39th International Conference). Prometeus GmbH. doi:10.23919/ISC.2024.10528935.

引用: https://hdl.handle.net/21.11116/0000-000F-5B88-2

要旨

With the end of Dennard scaling, specializing and distributing compute engines throughout the system is a promising technique to improve applications performance. For example, NVIDIA's BlueField Data Processing Unit (DPU) integrates programmable processing elements within the network and offers specialized network processing capabilities. These capabilities enable communication via offloads onto DPUs and present new application opportunities for offloading nonblocking or complex communication patterns such as collective communication operations. This paper discusses the lessons learned enabling DPU-based acceleration for collective communication algorithms by describing the impact of such offloaded collective operations on two applications: Octopus and P3DFFT++. We present new algorithms for the nonblocking MPI_Ialltoallv and blocking MPI_Allgatherv collective operations that leverage DPU offloading, which are used by the above applications, and evaluate them. Our experiments show a performance improvement in the range of 14% to 49% for P3DFFT++ and 17% for Octopus, even though the performance of those collectives in well-balanced OSU latency benchmarks shows comparable performance to well-optimized host-based implementations of these collectives. This demonstrates that taking into account load imbalance in communication algorithms can help improve application performance where such imbalance is common and large in magnitude.