# ps-lite **Repository Path**: ByteDance/ps-lite ## Basic Information - **Project Name**: ps-lite - **Description**: A lightweight parameter server interface - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: byteps - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-08-26 - **Last Updated**: 2025-10-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README This is the communication library for [BytePS](https://github.com/bytedance/byteps). It is designed for high performance RDMA. However, it also supports TCP. ## Build ```bash git clone -b byteps https://github.com/bytedance/ps-lite cd ps-lite make -j USE_RDMA=1 ``` - Remove `USE_RDMA=1` if you don't want to build with RDMA ibverbs support. - Add `USE_FABRIC=1` if you want to build with RDMA libfabric support for AWS Elastic Fabric Adaptor. To build ps-lite with UCX: ``` # dependencies sudo apt install -y build-essential libtool autoconf automake libnuma-dev unzip pkg-config # build ucx wget https://github.com/openucx/ucx/archive/refs/tags/v1.11.1.tar.gz tar -xf v1.11.1.tar.gz cd ucx-1.11.1 (./autogen.sh || ./autogen.sh) && ./configure --enable-logging --enable-mt --with-verbs --with-rdmacm --with-cuda=/usr/local/cuda make clean && make -j && sudo make install -j # build ps-lite cd .. make clean; USE_UCX=1 CUDA_HOME=/usr/local/cuda USE_CUDA=1 make -j ``` BytePS relies on UCXVan for GPU related communication, such as intra-node cuda-IPC, inter-node GPU-to-GPU / GPU-to-CPU communication with GPU-direct RDMA. For the list of transports UCX supports, see [link](https://openucx.readthedocs.io/en/master/faq.html?highlight=UCX_TLS#list-of-main-transports-and-aliases). ## Concepts In ps-lite, there are three roles: worker, server and scheduler. Each role is an independent process. The scheduler is responsible for setting up the connections between workers and servers at initialization. There should be only 1 scheduler process. A worker process only communicates with server processes, and vice versa. There won't be any traffic between worker-to-worker, and server-to-server. ## Tutorial After build, you will have two testing applications under `tests/` dir, namely `test_benchmark` and `test_ipc_benchmark`. Below we elaborate how you can run with them. To debug, set `PS_VERBOSE=1` to see important logs during connection setup, and `PS_VERBOSE=2` to see each message log. ### 1. Basic benchmark Suppose you want to run with 1 worker and 1 server on different machines. Therefore, we need to launch 3 processes in total (including the scheduler). You can launch the scheduler process at any machine as it does not affect the performance. For the scheduler: ``` # common setup export DMLC_ENABLE_RDMA=ibverbs export DMLC_NUM_WORKER=1 export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) export DMLC_INTERFACE=eth5 # my RDMA interface # launch scheduler DMLC_ROLE=scheduler ./tests/test_benchmark ``` For the server: ``` # common setup export DMLC_ENABLE_RDMA=ibverbs export DMLC_NUM_WORKER=1 export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) export DMLC_INTERFACE=eth5 # my RDMA interface # launch server DMLC_ROLE=server ./tests/test_benchmark ``` For the worker: ``` # common setup export DMLC_ENABLE_RDMA=ibverbs export DMLC_NUM_WORKER=1 export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) export DMLC_INTERFACE=eth5 # my RDMA interface # launch worker DMLC_ROLE=worker ./tests/test_benchmark ``` If you want to use libfabric with Amazon Elastic Fabric Adaptor, make sure to set `DMLC_ENABLE_RDMA=fabric` for all processes. If you are using libfabric < 1.10, please also set `FI_EFA_ENABLE_SHM_TRANSFER=0` to avoid a bug in the EFA shm provider. If you just want to use TCP, make sure to unset `DMLC_ENABLE_RDMA` for all processes. ### 2. Benchmark with IPC support The `test_ipc_benchmark` demonstrates how inter-process communication (IPC) helps improve RDMA performance when the server is co-located with the worker. Suppose you have two machines. Each machine should launch a worker and a server process. For the scheduler: (you can launch it on either machine-0 or machine-1) ``` # common setup export DMLC_ENABLE_RDMA=ibverbs export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) export DMLC_INTERFACE=eth5 # my RDMA interface # launch scheduler DMLC_ROLE=scheduler ./tests/test_ipc_benchmark ``` For machine-0 and machine-1: ``` # common setup export DMLC_ENABLE_RDMA=ibverbs export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) export DMLC_INTERFACE=eth5 # my RDMA interface # launch server and worker DMLC_ROLE=server ./tests/test_ipc_benchmark & DMLC_ROLE=worker ./tests/test_ipc_benchmark ``` Note: This benchmark is only valid for RDMA. ### 3. Other GPU-related benchmarks ``` cd tests; NODE_ONE_IP=xxx NODE_TWO_IP=yyy bash test.sh (local|remote|joint) bytes_per_msg msg_count (push_only|pull_only|push_pull) (cpu2cpu|cpu2gpu|gpu2gpu|gpu2cpu) ```