English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
 
 
DownloadE-Mail
  Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective Operations

Graham, R., Bosilca, G., Qin, Y., Settlemyer, B., Shainer, G., Stunkel, C., et al. (2024). Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective Operations. In ISC High Performance 2024 Research Paper Proceedings (39th International Conference). Prometeus GmbH. doi:10.23919/ISC.2024.10528935.

Item is

Basic

show hide
Genre: Conference Paper

Files

show Files
hide Files
:
Optimizing Application Performance with BlueField Accelerating Large-Message Blocking and Nonblocking Collective Operations.pdf (Any fulltext), 375KB
 
File Permalink:
-
Name:
Optimizing Application Performance with BlueField Accelerating Large-Message Blocking and Nonblocking Collective Operations.pdf
Description:
-
OA-Status:
Visibility:
Private
MIME-Type / Checksum:
application/pdf
Technical Metadata:
Copyright Date:
-
Copyright Info:
-
License:
-

Locators

show

Creators

show
hide
 Creators:
Graham, Richard, Author
Bosilca, George, Author
Qin, Yong, Author
Settlemyer, Bradley, Author
Shainer, Gilad, Author
Stunkel, Craig, Author
Vallee, Geoffroy, Author
Williams, Brody, Author
Cisneros-Stoianowski, Gerardo, Author
Ohlmann, Sebastian1, Author           
Rampp, Markus1, Author           
Affiliations:
1Max Planck Computing and Data Facility, Max Planck Society, ou_2364734              

Content

show
hide
Free keywords: -
 Abstract: With the end of Dennard scaling, specializing and distributing compute engines throughout the system is a promising technique to improve applications performance. For example, NVIDIA's BlueField Data Processing Unit (DPU) integrates programmable processing elements within the network and offers specialized network processing capabilities. These capabilities enable communication via offloads onto DPUs and present new application opportunities for offloading nonblocking or complex communication patterns such as collective communication operations. This paper discusses the lessons learned enabling DPU-based acceleration for collective communication algorithms by describing the impact of such offloaded collective operations on two applications: Octopus and P3DFFT++. We present new algorithms for the nonblocking MPI_Ialltoallv and blocking MPI_Allgatherv collective operations that leverage DPU offloading, which are used by the above applications, and evaluate them. Our experiments show a performance improvement in the range of 14% to 49% for P3DFFT++ and 17% for Octopus, even though the performance of those collectives in well-balanced OSU latency benchmarks shows comparable performance to well-optimized host-based implementations of these collectives. This demonstrates that taking into account load imbalance in communication algorithms can help improve application performance where such imbalance is common and large in magnitude.

Details

show
hide
Language(s):
 Dates: 2024-05-10
 Publication Status: Published online
 Pages: -
 Publishing info: -
 Table of Contents: -
 Rev. Type: -
 Identifiers: DOI: 10.23919/ISC.2024.10528935
 Degree: -

Event

show
hide
Title: 39th International Conference
Place of Event: Hamburg, Germany
Start-/End Date: 2024-05-12 - 2024-05-14

Legal Case

show

Project information

show

Source 1

show
hide
Title: ISC High Performance 2024 Research Paper Proceedings (39th International Conference)
Source Genre: Proceedings
 Creator(s):
Affiliations:
Publ. Info: Prometeus GmbH
Pages: - Volume / Issue: - Sequence Number: - Start / End Page: - Identifier: ISBN: 978-3-9826336-0-2