# megakv **Repository Path**: z_matrix/megakv ## Basic Information - **Project Name**: megakv - **Description**: clone https://github.com/pzrq/megakv - **Primary Language**: C - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-07-29 - **Last Updated**: 2024-08-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README Mega-KV is a high-throughput in-memory key-value store (cache) which adopts a novel approach by offloading index data structure and corresponding operations to GPU. Mega-KV is currently implemented above NVIDIA CUDA APIs and Intel DPDK on Linux, but it can be ported to other GPGPU programming frameworks, such as OpenCL, and operating systems as well. ## GETTING STARTED If you intend to run Mega-KV on AWS `p2.xlarge` instances using the AMIs listed on [Deep Learning AMI CUDA 9 Ubuntu Version][aws-deep-learning-cuda-9], the script in [`bin/setup.sh`](bin/setup.sh) may work for you, if you have a different environment to set up or wish to understand better what is going on here, please follow the USAGE instructions below. ### AWS P2 instance type results We were able to rent an AWS `p2.xlarge` instance at the almost [too cheap to meter][too-cheap-to-meter] AWS [spot price][spot-price] of $0.1301 per hour, many many times cheaper than purchasing the equivalent CPUs, GPUs, motherboard, RAM, PSU, case/rack and other system components. This fulfils the standard dramatically lower capex outlay promise of the cloud. #### MegaKV Tesla K80 GPU utilisation out of the box was between 1-5% Using [`htop`][htop] and [`nvidia-smi`][nvidia-smi] combined with the Mega-KV `src` (see USAGE Steps 4-6) we found for the default workload, that both `insert` and `search` performance was still CPU-bound, with GPU utilisation at 1-2% and 4-5% for the `insert` and `search` phases respectively. Future work could look into why this is the case and investigate how to attain a higher utilisation of the available GPU resource, such as offloading more of the CPU-bound work onto the GPU itself, other parts of the system such as say the networking stack, or even other AWS instances. #### Example improvement - Modifying `NUM_QUEUE_PER_PORT` and `MAX_WORKER_NUM` We found that modifying the values `NUM_QUEUE_PER_PORT` and `MAX_WORKER_NUM` in [macros.h][macros-h-opt] from 7 and 12 respectively to 1 and 1 improved out of the box MegaKV insert phase throughput on our `p2.xlarge` instance from ~0.5 to ~2 Mops, and also improved search phase performance from ~8 to ~18 Mops, a ~4x and ~2x improvement respectively for our machine. #### The benchmark `rte_eth_dev_count()` returns 0 This most likely means DPDK-enabled network interfaces are not available on P2 instances, [only X1 instances, at the time of writing][aws-dpdk-ena-x1-only]. Future work could wait for a [P2, P3 or CG1 CUDA-enabled][aws-cuda-instances] instance to also have DPDK / ENA support, or as Kai Zhang, et al suggested earlier in this README, consider ports to other GPGPU programming frameworks, such as OpenCL, or support for other operating systems as well. ## HISTORY 1. Jun 1, 2015: megakv-0.1-alpha. Initial release; basic interfaces for an in memory key-value store. This is a demo and is not ready for production use yet. Bugs are expected. 2. Nov 1, 2017: For [MongoDB Skunkworks][mongodb-skunkworks] - Updates to run on AWS `p2.xlarge` instances, Intel DPDK v16.11, CUDA 9 and Ubuntu gcc 5.4.0 ## PROTOCOL Mega-KV currently uses a simple self-defined protocol for efficient communication. * A request packet has a 16-bit magic number in the beginning: 0x1234. * A request packet has a 16-bit ending mark in the end: 0xFFFF. * Each GET query in the packet has the format: 16-bit Job Type(0x2), 16-bit Key Length, and the key. * Each SET query in the packet has the format: 16-bit Job Type(0x3), 16-bit Key Length, 32-bit Value Length, and the key and value. Anyone can improve or modify this protocol according to the practical needs. ## HARDWARE * NIC: Intel 10 Gigabit NIC that is supported by Intel DPDK SDK. * CPU: Intel CPU that supports the SSE instruction set in Intel DPDK SDK. * GPU: NVIDIA GPU newer than GTX680. We have conducted experiments on GTX780. ## USAGE 1. Setup network with Intel DPDK. We recommend installing Intel DPDK 1.7.1, which is known to work with Mega-KV. Newer versions of DPDK may have some compiling problems with Mega-KV. Then run `export RTE_SDK=$(PATH_TO_DPDK)`. `PATH_TO_DPDK` is the path of the DPDK directory. 2. Go to `libgpuhash` directory, edit `Makefile` to setup correct CUDA installation path. We recommend installing CUDA SDK 6.5, which is known to work with Mega-KV. Some important macros in `gpu_hash.h`: * MEM_P: 2^MEM_P bytes GPU device memory space for hash table. * HASH_CUCKOO/HASH_2CHOICE: cuckoo hash or two choice hash. 3. Run `make`. This should compile the CUDA hash table library, including cuckoo hash or two choice hash. Macros can be set in `gpu_hash.h`. This will generate `libgpuhash.a` in lib directory, which is used by Mega-KV as the GPU hash table library. 4. Go to `src` directory, edit `Makefile` to setup correct CUDA installation path. Setup other macros in `Makefile` and `macros.h` for test or production use. Edit the config variables in `mega.c` for different GPUs or configurations. In the `Makefile`, a macro is disabled with the `_0` suffix. You can enable the macro by removing the suffix. Some important macros in `Makefile`: * PREFETCH_BATCH: enable batch prefetching to improve performance. * PRELOAD: preload key/value items into Mega-KV before test. * LOCAL_TEST: run Mega-KV locally, just for testing. * SIGNATURE: enable a simple signature algorithm instead the one used for testing. You can implement a new signature algorithm under this macro. Some important macros in `macros.h`: * CPU_FREQUENCY_US: set the CPU frequency for the timers. * MEM_LIMIT: set the memory limit to avoid using virtual memory. * NUM_QUEUE_PER_PORT: number of queues per NIC port. Each queue will have one receiver and one sender. 5. Edit the CPU core mappings in `mega.c`. Three functions for launching Receivers, Senders, and the Scheduler: `mega_launch_receivers`, `mega_launch_senders`, and `mega_launch_scheduler`. You can edit `context->core_id` assignment to change the core mapping for these threads. To maximize the resource utilization and system utilization, Hyper-threading is recommended. The Nth Receiver and the Nth Sender can be assignment to two virtual cores that locate on the same physical core. Please note that one physical core should be reserved for the Scheduler so that it will not be affected by other threads. Corresponding DPDK parameters may also need to be modified in line 527. 6. Run `make`. This should compile Mega-KV. Then Mega-KV can be run with `./build/megakv` The above currently defaults to `insert` jobs for about a minute, prints `========================== Hash table has been loaded ==========================` and then switches to `search` jobs, periodically reporting statistics to the terminal. 7. Benchmark. Go to `benchmark` directory. This is also based on Intel DPDK 1.7.1. Modify macros in `benchmark.h`, and modify CPU core mappings between the line 792 and the line 815. Run `make`, then run `sudo ./build/benchmark` This benchmark currently only support for 8 byte key and 8 byte value generation. NOTE: LOAD_FACTOR, PRELOAD_CNT, and TOTAL_CNT should be the same with Mega-KV if Mega-KV preloads key-value items locally for testing. Some important macros in `benchmark.h`: * DIS_ZIPF/DIS_UNIFORM: key popularity distribution. * WORKLOAD_ID: 100% GET or 95% GET ## PERFORMANCE BOTTLENECKS It should be possible to run the following Linux system utility programs to identify the system's performance bottlenecks: 1. CPU/RAM bottlenecks - [`top`][top] or [`htop`][htop] 2. GPU bottlenecks - [`nvidia-smi`][nvidia-smi] There may also be a need for additional specific tools to investigate performance bottlenecks, for a brief overview please see [this AskUbuntu][ask-ubuntu-performance]. ## LIMITATIONS 1. Do not support UPDATE command yet. 2. Do not support other fields in memcached, such as expiration time. However, they are easy to be implemented and have been planed in the roadmap. 3. LOCAL_TEST may not be accurate. Because the overhead of key generation is very huge, especially with zipf key generation. ## DEVELOPMENT Go to [http://kay21s.github.io/megakv](http://kay21s.github.io/megakv) for documentation and other development notices. You can contact the author at `kay21s [AT] gmail [DOT] com`. ## DISCLAIMER This software is not supported by [MongoDB, Inc.](https://www.mongodb.com) under any of their commercial support subscriptions or otherwise. Any usage of Mega-KV is at your own risk. Bug reports, feature requests and questions can be posted in the [Issues](https://github.com/pzrq/megakv/issues?state=open) section on GitHub. [mongodb-skunkworks]: https://www.mongodb.com/careers/departments/engineering [aws-deep-learning-cuda-9]: https://aws.amazon.com/marketplace/pp/B076TGJHY1 [too-cheap-to-meter]: https://en.wikipedia.org/wiki/Too_cheap_to_meter [spot-price]: https://aws.amazon.com/ec2/spot/pricing/ [top]: https://linux.die.net/man/1/top [htop]: https://linux.die.net/man/1/htop [nvidia-smi]: https://developer.nvidia.com/nvidia-system-management-interface [macros-h-opt]: https://github.com/pzrq/megakv/blob/308713f364d4cb66f690722759d3d278a392ee8f/src/macros.h#L29-L31 [aws-dpdk-ena-x1-only]: https://aws.amazon.com/blogs/aws/elastic-network-adapter-high-performance-network-interface-for-amazon-ec2/ [aws-cuda-instances]: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/accelerated-computing-instances.html [ask-ubuntu-performance]: https://askubuntu.com/questions/1540/how-can-i-find-out-if-a-process-is-cpu-memory-or-disk-bound