# NVIDIA_gdrcopy **Repository Path**: tingate/gdrcopy ## Basic Information - **Project Name**: NVIDIA_gdrcopy - **Description**: A low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology. - **Primary Language**: C - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-08-01 - **Last Updated**: 2022-08-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # GDRCopy A low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology. ## Introduction While GPUDirect RDMA is meant for direct access to GPU memory from third-party devices, it is possible to use these same APIs to create perfectly valid CPU mappings of the GPU memory. The advantage of a CPU driven copy is the very small overhead involved. That might be useful when low latencies are required. ## What is inside GDRCopy offers the infrastructure to create user-space mappings of GPU memory, which can then be manipulated as if it was plain host memory (caveats apply here). A simple by-product of it is a copy library with the following characteristics: - very low overhead, as it is driven by the CPU. As a reference, currently a cudaMemcpy can incur in a 6-7us overhead. - An initial memory *pinning* phase is required, which is potentially expensive, 10us-1ms depending on the buffer size. - Fast H-D, because of write-combining. H-D bandwidth is 6-8GB/s on Ivy Bridge Xeon but it is subject to NUMA effects. - Slow D-H, because the GPU BAR, which backs the mappings, can't be prefetched and so burst reads transactions are not generated through PCIE The library comes with a few tests like: - sanity, which contains unit tests for the library and the driver. - copybw, a minimal application which calculates the R/W bandwidth for a specific buffer size. - copylat, a benchmark application which calculates the R/W copy latency for a range of buffer sizes. ## Requirements GPUDirect RDMA requires NVIDIA Tesla or Quadro class GPUs based on Kepler, Pascal, Volta, or Turing, see [GPUDirect RDMA](http://developer.nvidia.com/gpudirect). For more technical informations, please refer to the official GPUDirect RDMA [design document](http://docs.nvidia.com/cuda/gpudirect-rdma). The device driver requires GPU display driver >= 418.40 on ppc64le and >= 331.14 on other platforms. The library and tests require CUDA >= 6.0. Additionally, the _sanity_ test requires check >= 0.9.8 and subunit. DKMS is a prerequisite for installing GDRCopy kernel module package. On RHEL, however, users have an option to build kmod and install it instead of the DKMS package. See [Build and installation](#build-and-installation) section for more details. ```shell # On RHEL # dkms can be installed from epel-release. See https://fedoraproject.org/wiki/EPEL. $ sudo yum install dkms check check-devel subunit subunit-devel # On Debian $ sudo apt install check libsubunit0 libsubunit-dev ``` CUDA and GPU display driver must be installed before building and/or installing GDRCopy. The installation instructions can be found in https://developer.nvidia.com/cuda-downloads. GPU display driver header files are also required. They are installed as a part of the driver (or CUDA) installation with *runfile*. If you install the driver via package management, we suggest - On RHEL, `sudo dnf module install nvidia-driver:latest-dkms`. - On Debian, `sudo apt install nvidia-dkms-`. The supported architectures are Linux x86_64, ppc64le, and arm64. The supported platforms are RHEL7, RHEL8, Ubuntu16_04, Ubuntu18_04, and Ubuntu20_04. Root privileges are necessary to load/install the kernel-mode device driver. ## Build and installation We provide three ways for building and installing GDRCopy. ### rpm package ```shell $ sudo yum groupinstall 'Development Tools' $ sudo yum install dkms rpm-build make check check-devel subunit subunit-devel $ cd packages $ CUDA= ./build-rpm-packages.sh $ sudo rpm -Uvh gdrcopy-kmod-dkms.noarch..rpm $ sudo rpm -Uvh gdrcopy-...rpm $ sudo rpm -Uvh gdrcopy-devel-.noarch..rpm ``` DKMS package is the default kernel module package that `build-rpm-packages.sh` generates. To create kmod package, `-m` option must be passed to the script. Unlike the DKMS package, the kmod package contains a prebuilt GDRCopy kernel module which is specific to the NVIDIA driver version and the Linux kernel version used to build it. ### deb package ```shell $ sudo apt install build-essential devscripts debhelper check libsubunit-dev fakeroot pkg-config dkms $ cd packages $ CUDA= ./build-deb-packages.sh $ sudo dpkg -i gdrdrv-dkms__..deb $ sudo dpkg -i libgdrapi__..deb $ sudo dpkg -i gdrcopy-tests__..deb $ sudo dpkg -i gdrcopy__..deb ``` ### from source ```shell $ make prefix= CUDA= all install $ sudo ./insmod.sh ``` If `libcheck` is installed in a non-standard path and therefore is not picked by `pkg-config`, you can set the `PKG_CONFIG_PATH` environment variable to the directory which contains the `check.pc` file and pass it down to make: ```shell $ PKG_CONFIG_PATH=/check_install_path/lib/pkgconfig/ make <...> ``` ## Tests Execute provided tests: ```shell $ sanity Running suite(s): Sanity 100%: Checks: 27, Failures: 0, Errors: 0 $ copybw GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00 GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00 GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00 GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00 GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00 GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00 GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00 GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00 selecting device 0 testing size: 131072 rounded size: 131072 gpu alloc fn: cuMemAlloc device ptr: 7f1153a00000 map_d_ptr: 0x7f1172257000 info.va: 7f1153a00000 info.mapped_size: 131072 info.page_size: 65536 info.mapped: 1 info.wc_mapping: 1 page offset: 0 user-space pointer:0x7f1172257000 writing test, size=131072 offset=0 num_iters=10000 write BW: 9638.54MB/s reading test, size=131072 offset=0 num_iters=100 read BW: 530.135MB/s unmapping buffer unpinning buffer closing gdrdrv $ copylat GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00 GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00 GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00 GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00 GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00 GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00 GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00 GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00 selecting device 0 device ptr: 0x7fa2c6000000 allocated size: 16777216 gpu alloc fn: cuMemAlloc map_d_ptr: 0x7fa2f9af9000 info.va: 7fa2c6000000 info.mapped_size: 16777216 info.page_size: 65536 info.mapped: 1 info.wc_mapping: 1 page offset: 0 user-space pointer: 0x7fa2f9af9000 gdr_copy_to_mapping num iters for each size: 10000 WARNING: Measuring the API invocation overhead as observed by the CPU. Data might not be ordered all the way to the GPU internal visibility. Test Size(B) Avg.Time(us) gdr_copy_to_mapping 1 0.0889 gdr_copy_to_mapping 2 0.0884 gdr_copy_to_mapping 4 0.0884 gdr_copy_to_mapping 8 0.0884 gdr_copy_to_mapping 16 0.0905 gdr_copy_to_mapping 32 0.0902 gdr_copy_to_mapping 64 0.0902 gdr_copy_to_mapping 128 0.0952 gdr_copy_to_mapping 256 0.0983 gdr_copy_to_mapping 512 0.1176 gdr_copy_to_mapping 1024 0.1825 gdr_copy_to_mapping 2048 0.2549 gdr_copy_to_mapping 4096 0.4366 gdr_copy_to_mapping 8192 0.8141 gdr_copy_to_mapping 16384 1.6155 gdr_copy_to_mapping 32768 3.2284 gdr_copy_to_mapping 65536 6.4906 gdr_copy_to_mapping 131072 12.9761 gdr_copy_to_mapping 262144 25.9459 gdr_copy_to_mapping 524288 51.9100 gdr_copy_to_mapping 1048576 103.8028 gdr_copy_to_mapping 2097152 207.5990 gdr_copy_to_mapping 4194304 415.2856 gdr_copy_to_mapping 8388608 830.6355 gdr_copy_to_mapping 16777216 1661.3285 gdr_copy_from_mapping num iters for each size: 100 Test Size(B) Avg.Time(us) gdr_copy_from_mapping 1 0.9069 gdr_copy_from_mapping 2 1.7170 gdr_copy_from_mapping 4 1.7169 gdr_copy_from_mapping 8 1.7164 gdr_copy_from_mapping 16 0.8601 gdr_copy_from_mapping 32 1.7024 gdr_copy_from_mapping 64 3.1016 gdr_copy_from_mapping 128 3.4944 gdr_copy_from_mapping 256 3.6400 gdr_copy_from_mapping 512 2.4394 gdr_copy_from_mapping 1024 2.8022 gdr_copy_from_mapping 2048 4.6615 gdr_copy_from_mapping 4096 7.9783 gdr_copy_from_mapping 8192 14.9209 gdr_copy_from_mapping 16384 28.9571 gdr_copy_from_mapping 32768 56.9373 gdr_copy_from_mapping 65536 114.1008 gdr_copy_from_mapping 131072 234.9382 gdr_copy_from_mapping 262144 496.4011 gdr_copy_from_mapping 524288 985.5196 gdr_copy_from_mapping 1048576 1970.7057 gdr_copy_from_mapping 2097152 3942.5611 gdr_copy_from_mapping 4194304 7888.9468 gdr_copy_from_mapping 8388608 18361.5673 gdr_copy_from_mapping 16777216 36758.8342 unmapping buffer unpinning buffer closing gdrdrv $ apiperf -s 8 GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00 GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00 GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00 GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00 GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00 GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00 GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00 GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00 selecting device 0 device ptr: 0x7f1563a00000 allocated size: 65536 Size(B) pin.Time(us) map.Time(us) get_info.Time(us) unmap.Time(us) unpin.Time(us) 65536 1346.034060 3.603800 0.340270 4.700930 676.612800 Histogram of gdr_pin_buffer latency for 65536 bytes [1303.852000 - 2607.704000] 93 [2607.704000 - 3911.556000] 0 [3911.556000 - 5215.408000] 0 [5215.408000 - 6519.260000] 0 [6519.260000 - 7823.112000] 0 [7823.112000 - 9126.964000] 0 [9126.964000 - 10430.816000] 0 [10430.816000 - 11734.668000] 0 [11734.668000 - 13038.520000] 0 [13038.520000 - 14342.372000] 2 closing gdrdrv ``` ## NUMA effects Depending on the platform architecture, like where the GPU are placed in the PCIe topology, performance may suffer if the processor which is driving the copy is not the one which is hosting the GPU, for example in a multi-socket server. In the example below, GPU ID 0 is hosted by CPU socket 0. By explicitly playing with the OS process and memory affinity, it is possible to run the test onto the optimal processor: ```shell $ numactl -N 0 -l copybw -d 0 -s $((64 * 1024)) -o $((0 * 1024)) -c $((64 * 1024)) GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00 GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00 GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00 GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00 GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00 GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00 GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00 GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00 selecting device 0 testing size: 65536 rounded size: 65536 gpu alloc fn: cuMemAlloc device ptr: 7f5817a00000 map_d_ptr: 0x7f583b186000 info.va: 7f5817a00000 info.mapped_size: 65536 info.page_size: 65536 info.mapped: 1 info.wc_mapping: 1 page offset: 0 user-space pointer:0x7f583b186000 writing test, size=65536 offset=0 num_iters=1000 write BW: 9768.3MB/s reading test, size=65536 offset=0 num_iters=1000 read BW: 548.423MB/s unmapping buffer unpinning buffer closing gdrdrv ``` or on the other socket: ```shell $ numactl -N 1 -l copybw -d 0 -s $((64 * 1024)) -o $((0 * 1024)) -c $((64 * 1024)) GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00 GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00 GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00 GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00 GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00 GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00 GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00 GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00 selecting device 0 testing size: 65536 rounded size: 65536 gpu alloc fn: cuMemAlloc device ptr: 7fbb63a00000 map_d_ptr: 0x7fbb82ab0000 info.va: 7fbb63a00000 info.mapped_size: 65536 info.page_size: 65536 info.mapped: 1 info.wc_mapping: 1 page offset: 0 user-space pointer:0x7fbb82ab0000 writing test, size=65536 offset=0 num_iters=1000 write BW: 9224.36MB/s reading test, size=65536 offset=0 num_iters=1000 read BW: 521.262MB/s unmapping buffer unpinning buffer closing gdrdrv ``` ## Restrictions and known issues GDRCopy works with regular CUDA device memory only, as returned by cudaMalloc. In particular, it does not work with CUDA managed memory. `gdr_pin_buffer()` accepts any addresses returned by cudaMalloc and its family. In contrast, `gdr_map()` requires that the pinned address is aligned to the GPU page. Neither CUDA Runtime nor Driver APIs guarantees that GPU memory allocation functions return aligned addresses. Users are responsible for proper alignment of addresses passed to the library. Two cudaMalloc'd memory regions may be contiguous. Users may call `gdr_pin_buffer` and `gdr_map` with address and size that extend across these two regions. This use case is not well-supported in GDRCopy. On rare occassions, users may experience 1.) an error in `gdr_map`, or 2.) low copy performance because `gdr_map` cannot provide write-combined mapping. In some GPU driver versions, pinning the same GPU address multiple times consumes additional BAR1 space. This is because the space is not properly reused. If you encounter this issue, we suggest that you try the latest version of NVIDIA GPU driver. On POWER9 where CPU and GPU are connected via NVLink, CUDA9.2 and GPU Driver v396.37 are the minimum requirements in order to achieve the full performance. GDRCopy works with ealier CUDA and GPU driver versions but the achievable bandwidth is substantially lower. ## Bug filing For reporting issues you may be having using any of NVIDIA software or reporting suspected bugs we would recommend you use the bug filing system which is available to NVIDIA registered developers on the developer site. If you are not a member you can [sign up](https://developer.nvidia.com/accelerated-computing-developer). Once a member you can submit issues using [this form](https://developer.nvidia.com/nvbugs/cuda/add). Be sure to select GPUDirect in the "Relevant Area" field. You can later track their progress using the __My Bugs__ link on the left of this [view](https://developer.nvidia.com/user). ## Acknowledgment If you find this software useful in your work, please cite: R. Shi et al., "Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters," 2014 21st International Conference on High Performance Computing (HiPC), Dona Paula, 2014, pp. 1-10, doi: 10.1109/HiPC.2014.7116873.