# my-msccl
**Repository Path**: jiaoff_gitee/my-msccl
## Basic Information
- **Project Name**: my-msccl
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-01-10
- **Last Updated**: 2024-01-10
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# MSCCL
Microsoft Collective Communication Library (MSCCL) is a platform to execute custom collective communication algorithms on heterogenous accelerators supported by Microsoft Azure. MSCCL currently supports NVIDIA and AMD GPUs. The research prototype of this project is [microsoft/msccl](https://github.com/microsoft/msccl).
## Introduction
MSCCL vision is to provide a unified, efficient, and scalable framework for executing collective communication algorithms on heterogenous accelerators. To achieve this, MSCCL has multiple components:
- [MSCCL toolkit](https://github.com/microsoft/msccl-tools): Inter-connection among accelerators have different latencies and bandwidths. Therefore, a generic collective communication algorithm does not necessarily well for all topologies and buffer sizes. In order to provide the flexibility, we provide the MSCCL toolkit, which allows a user to write a hyper-optimized collective communication algorithm for a given topology and a buffer size. MSCCL toolkit contains a high-level DSL (MSCCLang) and a compiler which generate an IR for the MSCCL executor to run on the backend. [Example](#Example) provides some instances on how MSCCL toolkit with the runtime works. Please refer to [MSCCL toolkit](https://github.com/microsoft/msccl-tools) for more information.
- [MSCCL scheduler](https://github.com/Azure/msccl-scheduler): MSCCL scheduler provides an example design and implementation of how to select optimal MSCCL algorithms for MSCCL executors.
- MSCCL executor: MSCCL executor is a set of libraries that are responsible for running custom-written collective communication algorithms on heterogenous accelerators. Each kind of accelerator has a corresponding executor library that is specifically optimized it. Different executor libraries share the same interface to run MSCCL algorithm IR from MSCCL toolkit and talk with MSCCL scheduler. For NVIDIA GPUs, it's [msccl-executor-nccl](https://github.com/Azure/msccl-executor-nccl) which is built on top of [NCCL](https://github.com/nvidia/nccl). For AMD GPUs, it's [RCCL](https://github.com/ROCmSoftwarePlatform/rccl) which already integrated all MSCCL executor features.
- MSCCL test toolkit([msccl-tests-nccl](https://github.com/Azure/msccl-tests-nccl)): These tests check both the performance and the correctness of MSCCL operations.
## Performance
For reference, FP16 All-Gather algorithms were tested and compared on ND H100 v5 VM, using msccl-tests-nccl.
All-Gather Latency (us) |
Message Size |
NCCL |
MSCCL |
MSCCL Speedup |
1KB |
9.54 |
5.65 |
1.69x |
2KB |
9.8 |
5.7 |
1.72x |
4KB |
9.78 |
5.43 |
1.80x |
8KB |
9.78 |
5.47 |
1.81x |
16KB |
10.29 |
5.53 |
1.86x |
32KB |
12.49 |
5.75 |
2.17x |
64KB |
12.87 |
5.95 |
2.16x |
128KB |
13.16 |
6.38 |
2.06x |
256KB |
13.23 |
7.26 |
1.82x |
512KB |
13.39 |
8.71 |
1.54x |
1MB |
18.33 |
12.3 |
1.49x |
2MB |
23.18 |
17.75 |
1.31x |
4MB |
33.66 |
23.37 |
1.44x |
8MB |
44.7 |
38.54 |
1.16x |
16MB |
67.19 |
67.16 |
1.00x |
32MB |
104.7 |
98.4 |
1.06x |
64MB |
192.4 |
181.9 |
1.06x |
128MB |
368.3 |
348.4 |
1.06x |
256MB |
699.5 |
680.7 |
1.03x |
512MB |
1358.6 |
1339.3 |
1.01x |
1GB |
2663.8 |
2633 |
1.01x |
## Example
In order to use MSCCL, you may follow these steps to use two different MSCCL algorithms for AllReduce on Azure NDv4 which has 8xA100 GPUs:
#### 1. Download the source code of msccl and related submodules
```sh
$ git clone https://github.com/Azure/msccl.git --recurse-submodules
```
#### 2. Below is the steps to install MSCCL executor:
```sh
$ git clone https://github.com/Azure/msccl.git --recurse-submodules
$ cd msccl/executor/msccl-executor-nccl
$ make -j src.build
$ cd ../
$ cd ../
```
#### 3. Below is the steps to install msccl-tests-nccl for performance evaluation:
```sh
$ cd tests/msccl-tests-nccl/
$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=$HOME/msccl/executor/msccl-executor-nccl/build/ -j
$ cd ../
$ cd ../
```
#### 4. Apply the msccl algo when using msccl external scheduler
- for ndv4, we already have algo optimized, you can use msccl scheduler to apply this algo directly to the executor, below is the steps to apply the scheduler
```sh
$ sudo apt-get install libcurl4-openssl-dev nlohmann-json3-dev
$ cd scheduler/msccl-scheduler
for nccl:
$ CXX=/path/to/nvcc BIN_HOME=/path/to/nccl/binary SRC_HOME=/path/to/nccl/source make
for rccl:
$ CXX=/path/to/nvcc BIN_HOME=/path/to/nccl/binary SRC_HOME=/path/to/nccl/source make PLATFORM=RCCL
$ make install
```
- for customize the msccl algo for your system, you can install [MSCCL toolkit](https://github.com/microsoft/msccl-tools) to compile a few custom algorithms:
```sh
$ git clone https://github.com/microsoft/msccl-tools.git
$ cd msccl-tools/
$ pip install .
$ cd ../
$ python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml
$ cd ../
```
The compiler's generated code is an XML file (`test.xml`) that is fed to MSCCL runtime. To evaluate its performance, copy the `test.xml` to the msccl/exector/msccl-executor-nccl/build/lib/msccl-algorithms/ and execute the following command line on an Azure NDv4 node or any 8xA100 system:
#### 5. Below is the command to run test using msccl-executor-nccl
```sh
$ mpirun -np 8 -x LD_LIBRARY_PATH=msccl/exector/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0
```
#### 6. If everything is installed correctly, you should see the following output in log:
```sh
[0] NCCL INFO Connected 1 MSCCL algorithms
```
You may evaluate the performance of `test.xml` by comparing in-place (the new algorithm) vs out-of-place (default ring algorithm) and it should up-to 2-3x faster on 8xA100 NVLink-interconnected GPUs. [MSCCL toolkit](https://github.com/microsoft/msccl-tools) has a rich set of algorithms for different Azure SKUs and collective operations with significant speedups over vanilla NCCL.
## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit [CLA](https://cla.opensource.microsoft.com).
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.