# aws-ofi-rccl **Repository Path**: mirrors_ROCmSoftwarePlatform/aws-ofi-rccl ## Basic Information - **Project Name**: aws-ofi-rccl - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: cxi - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-06-06 - **Last Updated**: 2026-05-16 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # AWS OFI RCCL is deprecated, please utilize upstream https://github.com/aws/aws-ofi-nccl which has support for ROCm + AMD GPUs AWS OFI RCCL is a plug-in which enables EC2 developers to use [libfabric](https://github.com/ofiwg/libfabric) as a network provider while running [AMD's RCCL](https://github.com/ROCmSoftwarePlatform/rccl) based applications. ## Overview Machine learning frameworks running on top of AMD GPUs use a library called [RCCL](https://github.com/ROCmSoftwarePlatform/rccl) which provides standard collective communication routines for an arbitrary number of GPUs installed across single or multiple nodes. This project implements a plug-in which maps RCCLs connection-oriented transport APIs to [libfabric's](https://ofiwg.github.io/libfabric/) connection-less reliable interface. This allows RCCL applications to take benefit of libfabric's transport layer services like reliable message support and operating system bypass. ## Requirements The plug-in currently supports the following distributions: * Amazon Linux * Amazon Linux 2 * Redhat Enterprise Linux 7 and 8 * Ubuntu 18.04 and 20.04 LTS * CentOS 7 and 8 It requires [Libfabric](http://github.com/ofiwg/libfabric/) and [RCCL](http://github.com/https://github.com/ROCmSoftwarePlatform/rccl). Please see the [Release notes](http://github.com/aws/aws-ofi-rccl/releases) for information on version compatibility. Libfabric supports various providers. The plug-in can choose only those which support the following features as defined in the [libfabric API documentation](https://github.com/ofiwg/libfabric/tree/master/man/). * Tagged messaging (`FI_TAGGED`, `FI_MSG`) * Data transfer context structures (`FI_CONTEXT`) * Reliable datagram endpoints (`FI_EP_RDM`) * Send after Send ordering semantics (`FI_ORDER_SAS`) * Communication with remote endpoints (`FI_REMOTE_COMM`) For GPUDirect RDMA support, it requires these additional features from libfabric providers. If these are not supported by any provider on system, plug-in turns off GPUDirect RDMA support. * Transfers to/from device memory (`FI_HMEM`) * Remote memory operations (`FI_RMA`, `FI_READ`) ## Getting Started ### Dependencies `aws-ofi-rccl` requires working installations of RCCL and libfabric. You can find the instructions for installing the first two at [RCCL installation](https://github.com/ROCmSoftwarePlatform/rccl) and [libfabric installation](https://github.com/ofiwg/libfabric) respectively. ### Build latest RCCL develop branch ``` cd ~ git clone https://github.com/ROCmSoftwarePlatform/rccl.git cd rccl mkdir build cd build/ CXX=/opt/rocm-5.2.0/bin/hipcc cmake -DCMAKE_PREFIX_PATH=/opt/rocm/ .. make -j ``` ### Plugin build Instructions The plugin uses GNU autotools for its build system. You can build it as follows: ``` $ ./autogen.sh $ CC=cc ./configure --with-libfabric=/opt/cray/libfabric/1.15.0.0 --with-hip=/opt/rocm-5.2.0 --with-rccl=path-to-rccl-build-folder $ make $ sudo make install ``` "--with-rccl=path-to-rccl-build-folder": Let's suppose we build RCCL at /home/username/rccl/build, then "--with-rccl=/home/username/rccl/build". If you want to install the plugin in a custom path, use the `--prefix` configure flag to provide the path. You can also point the build to custom dependencies with the following flags: ``` --with-libfabric=PATH Path to non-standard libfabric installation --with-hip=PATH Path to non-standard ROCm installation --with-rccl=PATH Path to non-standard RCCL installation --with-mpi=PATH Path to non-standard MPI installation ``` To enable trace messages for debugging (disabled by default), use the following config option: ``` --enable-trace Enable printing trace messages ``` By default, tests are built. To disable building tests, use the following config option: ``` --disable-tests Disable build of tests. ``` ### Plugin Configurations The plugin allows to configure the following variables at run-time according to your environment.
| Parameter | Description | Type | Accepted Value |
|---|---|---|---|
OFI_NCCL_USE_IPV6_TCP |
Allow using endpoints with IPv6 addressing format for TCP provider. Users can specify to use a preferred libfabric provider with `FI_PROVIDER` environment variable. | Boolean | 0/1 (Default: 0) |
OFI_NCCL_TCP_EXCLUDE_IF |
List of interface names to be filtered out for TCP provider. Users can specify to use a preferred libfabric provider with `FI_PROVIDER` environment variable. | String | Comma-separated list of interface names (Default: "lo,docker0") |
OFI_NCCL_GDR_FLUSH_DISABLE |
Disable flush operation when using GPUDirect. | Boolean | 0/1 (Default: 1) |
OFI_NCCL_CUDA_FLUSH_ENABLE |
When using GPUDirect use the cudaDeviceFlushGPUDirectRDMAWrites to enforce data consistency at the receiving GPU. Requires CUDA 11.3 or later. Note that this function only provides a GPU memory fence and requires that data has already been delivered to GPU memory. Some networks and PCIe configurations require an additional network-level flush that is not provided by this option. | Boolean | 0/1 (Default: 0) |