# ucudnn **Repository Path**: tju_haibo/ucudnn ## Basic Information - **Project Name**: ucudnn - **Description**: Accelerating DNN Convolutional Layers with Micro-batches - **Primary Language**: C++ - **License**: BSD-3-Clause - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2019-10-20 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # μ-cuDNN μ-cuDNN is a transparent wrapper for the [NVIDIA cuDNN](https://developer.nvidia.com/cudnn) library that splits a minibatch into micro-batches to speed up computation. μ-cuDNN is intended to be combined with deep learning frameworks written in C++, such as [Caffe](https://github.com/BVLC/caffe) and [TensorFlow](https://github.com/tensorflow/tensorflow). ## Reference This repository contains the code used in > Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka, μ-cuDNN: Accelerating Deep Neural Networks with Micro-Batching, arXiv e-prints, 2018. \[[URL](https://arxiv.org/abs/1804.04806)\] Please cite as: ``` @article{ucudnn, author = {Yosuke Oyama and Tal Ben-Nun and Torsten Hoefler and Satoshi Matsuoka}, title = {{{\(\mu\)}-cuDNN}: Accelerating Deep Learning Frameworks with Micro-Batching}, journal = {CoRR}, volume = {abs/1804.04806}, year = {2018}, url = {http://arxiv.org/abs/1804.04806}, archivePrefix = {arXiv}, eprint = {1804.04806}, } ``` ## Requirements * GCC >= 4.8.5 (should support `-std=c++11`) * [CMake](https://cmake.org/) >= 3.9.2 * [CUDA](https://developer.nvidia.com/cuda-downloads) >= 8 * [cuDNN](https://developer.nvidia.com/cudnn) >= 6 * [GLPK](https://www.gnu.org/software/glpk/#downloading) >= 4.63 (optional) * [SQLite](https://www.sqlite.org/download.html) >= 3.21 (optional) ## Performance ### DeepBench This figure shows the relative speedups of [DeepBench](https://github.com/baidu-research/DeepBench)'s 3x3 and 5x5 convolution layers on the [NVIDIA Tesla P100-SXM2](https://www.nvidia.com/en-us/data-center/tesla-p100/) GPU. We use a mini-batch of 256, and workspace limits of 128, 256, and 512 MiB. μ-cuDNN achieves up to speedups of 2.31x for 3x3 layers and 3.85x for 5x5 layers. ### CIFAR-10 Training This figure shows learning curves of [a CIFAR-10 CNN defined in Caffe](https://github.com/BVLC/caffe/tree/master/examples/cifar10) with three different micro-batch policies. We use a mini-batch of 1024, and workspace limit of 64 MiB. The CNN achieves ~80% test accuracy that is similar to [the official result](https://github.com/BVLC/caffe/tree/master/examples/cifar10). ## Installation 1. Compile μ-cuDNN with [CMake](https://cmake.org/): ``` mkdir build && cd build cmake .. -DCMAKE_INSTALL_PREFIX:PATH="/path/to/ucudnn" make make install ``` 2. Add path to μ-cuDNN: ``` export CPLUS_INCLUDE_PATH=/path/to/ucudnn/include:$CPLUS_INCLUDE_PATH export LD_LIBRARY_PATH=/path/to/ucudnn/lib:$LD_LIBRARY_PATH ``` 3. Modify your deep learning framework: * Add `#include ` to `*.cpp`,`*.cu`,`*.h` files that contain `cudnnHandle_t`. * Replace `cudnnHandle_t` with `UcudnnHandle_t`. 4. Compile the framework. * You need to link `libucudnn.so` to the framework explictly by adding the `-lucudnn` flag. * In some frameworks you also need to specify the `-std=c++11` flag. * For example, the following CMake flags are needed to compile μ-cuDNN-enabled [Caffe](https://github.com/BVLC/caffe): * `-DCMAKE_SHARED_LINKER_FLAGS="-lucudnn"` * `-DCMAKE_EXE_LINKER_FLAGS="-lucudnn"` * `-DCMAKE_CXX_FLAGS="-std=c++11"` Here you can find μ-cuDNN enabled forks of [Caffe](?) and [TensorFlow](?), instead of running steps 3. and 4.. ### CMake options | Option | Default | Description | |--------|---------|-------------| | `UCUDNN_USE_GLPK` | OFF | Use [GNU Linear Programming Kit (GLPK)](https://www.gnu.org/software/glpk/) to run ILP-based optimization. | | `UCUDNN_USE_SQLITE` | OFF | Use [SQLite](https://www.sqlite.org/) to cache benchmark result in file systems. | | `UCUDNN_DEBUG_GLPK_PRINT` | OFF | Output ILP information after solving (as `glpk_{sol,mip,prob}_(UNIX time)`) to the current directory. The output formats are based on the `glp_print_sol`, `glp_print_mip`, `glp_write_prob` functions of GLPK respectively. | | `UCUDNN_DEBUG_OUTPUT` | ON | Output optimization results to stderr. | | `UCUDNN_DEBUG_OUTPUT_ENV` | OFF | Output used environment variables to stderr. | | `UCUDNN_DEBUG_EQUIVALENCE_TEST` | OFF | Compute normalized L2-distance between output tensors of cuDNN and μ-cuDNN to check whether these convolutions are equivalent. | | `UCUDNN_TEST` | ON | Build tests. | * Notes on `UCUDNN_DEBUG_EQUIVALENCE_TEST`: * The normalized distance may be larger than zero due to numerical error and nondeterministic behavior of some algorithms. * In practice the normalized distance should be less than 1e-6. * Since this option computes the distance every time, it considerably slows down the computation. Please turn off if you want to perform practical training/inference. ### Runtime options (environment variables) | Variable | Acceptable values | Default | Description | |----------|-------------------|---------|-------------| | `UCUDNN_BATCH_SIZE_POLICY` | one of `undivided`,`powerofTwo`,`all` | `powerOfTwo` | μ-batch size policy. This can be set by `UcudnnHandle_t::setOptimizerBatchSizePolicy` as well. | | `UCUDNN_BENCHMARK_DEVICES` | comma-separated integers (e.g. `0,1,2,3`) or `all` | The current device of the process | GPU device ID(s) that are used for benchmarking in parallel. Note that optimization will be incorrect if different kinds of GPUs are used at the same time. | | `UCUDNN_DEFAULT_WORKSPACE_LIMIT` | integer | `67108864` (i.e. 64 MiB) | The default workspace limit __in bytes__ for convolutional layers. This is used only if the framework tries to get the workspace size before the workspace limit is provided. | | `UCUDNN_TOTAL_WORKSPACE_SIZE` | integer | N/A | If this is set, μ-cuDNN will create a static workspace size with the specified size __in bytes__ for ILP-based optimization. Workspace limit passed via conventional cuDNN functions will be ignored. | | `UCUDNN_DATABASE` | path | N/A | If this is set, μ-cuDNN will use a SQLite 3 database at the specified path to cache the benchmark results. |