21 Star 56 Fork 41

openEuler/fastblock

Create your Gitee Account
Explore and code with more than 13.5 million developers,Free private repositories !:)
Sign up
Clone or Download
contribute
Sync branch
Cancel
Notice: Creating folder will generate an empty file .keep, because not support in Git
Loading...
README
MulanPSL-2.0

fastblock

A distributed block storage system that uses mature Raft protocol and is designed for all-flash scenarios.

Description

The current distribution block storage system(Ceph) is facing challenges that hinder its ability to meet the needs of performance, latency, cost, and stability. The main issues are:

  • High CPU Cost: A significant amount of CPU resources is consumed, with the CPU becoming a bottleneck in nvme ssd clusters.
  • Suboptimal Availability: The implementation of a master-slave strong synchronous replication strategy is proving to be a liability. It leads to I/O operations being suspended during instances of cluster jitter.
  • Limited Performance on a Per-Volume Basis: Integration with QEMU reveals a marked degradation in performance. Stress testing also indicates the necessity for multiple volumes to maximize the cluster's throughput.
  • Elevated Latency for Single Volumes: It is unable to capitalize on the low-latency advantages of NVMe devices. RBD typically exhibits latency in the millisecond range.
  • Insufficient Concurrency Performance: There is a noticeable disparity between the IOPS and throughput achievable by the system compared to the potential offered by the underlying hardware.

fastblock is designed to tackle the challenges of performance and latency issues in distributed block storage system. Its key features include:

  • SPDK Framework Usage: Using the SPDK framework, which leverages user-space NVMe drivers and lock-free queues to minimize I/O latency.
  • RDMA Network Cards Integration: Incorporating RDMA network cards to facilitate zero-copy, kernel bypassing, and CPU-independent network communication.
  • Multi-Raft Data Replication: Employing a multi-raft algorithm for data replication to ensure data reliability.
  • Reliable Cluster Metadata Management: Offering a straightforward, reliable, and easily customizable approach to managing cluster metadata.

Software Architecture

The architecture is closely similar to Ceph, including many concepts such as monitor, OSD, and PG. For quick understanding, the architecture diagram is shown below. arch

  • Compute: represents the compute services.
  • Monitor cluster: Responsible for maintaining cluster metadata, including osdMap, pg, pgMap, pool, and image.
  • Storage cluster: Each storage cluster comprises multiple Storage Nodes, and each Storage Node operates several OSDs.
  • Control RPC: Using TCP sockets to transmit metadata.
  • Data RPC and Raft RPC: Data RPC is for transferring data requests between clients and OSDs, while Raft RPC is used for transferring RPC messages between OSDs. Both Data RPC and Raft RPC employ protobuf and RDMA for communication.
  • Monitor Client: A client module for communication with the monitor.
  • Command Dispatcher: A message processing module that receives and handles data requests from clients.
  • Raft Protocol Processor: Handles Raft RPC messages, elections, membership changes, and other operations as stipulated by the Raft protocol.
  • Raft Log Manager: Manages and persists the Raft log, using SPDK blob for the persistence of the Raft log.
  • Data State Machine: Stores user data using SPDK blobstore.
  • Raft Log Entry Cache: Used for caching the Raft log to improve performance.
  • KV System: Provides a key-value API, using SPDK blob for persistence.

Components and Interaction

monitor

monitor is responsible for maintaining the status of storage nodes and managing node additions and deletions. It also handles the metadata for storage volumes, maintains the cluster's topological structure, responds to user requests for creating pools, and creates Raft groups on OSDs based on the current topology. As a cluster management tool, monitor does not store data itself and does not aim for extreme performance. Therefore, it is implemented in Golang and uses etcd for multi-replica storage. The monitor cluster is crucial for ensuring consistency, as it provides a unified view to both clients and OSDs. For all client I/O operations, only the PG layer is visible. Both OSDs and clients initiate a timer at startup to periodically fetch osdmap and pgmap information from monitor. This ensures that all OSDs and clients perceive the same changes in PG status and respond accordingly. This design prevents write operations targeted at a specific PG from being incorrectly directed. For more details, refer to monitor documentation.

OSD RPC

RPC subsystem in fastblock is a crucial system that interconnects various modules. To accommodate heterogeneous networks, the RPC subsystem is implemented in two ways: socket-based (Control RPC) and RDMA-based (Data RPC and Raft RPC). Socket-based RPC follows the classic Linux socket application scenario, while RDMA-based RPC utilizes asynchronous RDMA (i.e., RDMA write) semantics. OSD RPC subsystem There are three types of RPC interactions, as depicted in the diagram:

  • Control RPC: Used for transferring osdmap, pgmap, and image information between clients and monitor, and between OSDs and monitor. Since these data are not large in volume and are not frequently transmitted, a socket-based implementation is suitable.
  • Data RPC: Facilitates the transfer of object data operations and results between clients and OSDs. Due to the larger size and higher frequency of this data, RDMA-based methods are employed.
  • Raft RPC: Used for transferring Raft RPC protocol contents among OSDs, which include object data. Similar to Data RPC, the larger data volumes and high frequency necessitate the use of RDMA-based methods. Both Data RPC and Raft RPC utilize protobuf. The network communication parts employ RDMA, and RPC data serialization is managed using Protobuf.

OSD Raft

Raft achieves consistency in distributed systems by electing a leader and entrusting them with the responsibility of managing the replication log. The leader receives log entries from clients, replicates these entries to other servers, and coordinates when these entries can be safely applied to their state machines. There are many open-source implementations of Raft, and we referenced Willemt's C language raft implementation and additionally implemented multi-raft, which mainly includes:

  • Management of Raft Groups: This involves the creation, modification, and deletion of Raft.
  • Raft Election and Election Timeout Handling.
  • Raft Log Processing: This includes caching logs, persisting logs to disk, and replicating logs to follower nodes.
  • Data State Machine Processing: Managing the persistence of data to disk.
  • Raft Snapshot Management and Log Recovery.
  • Raft Membership Changes Management (not yet implemented).
  • Raft Heartbeat Merging.

In a multi-group Raft setup, where multiple Raft groups coexist, each group's leader must send heartbeats to its followers. With many Raft groups, this could lead to an excessive number of heartbeats, consuming significant bandwidth and CPU resources. The solution is elegantly simple: since an osd might belong to multiple Raft groups, heartbeats can be consolidated for groups with the same leader and followers. This consolidation significantly reduces the number of heartbeat messages needed. For instance, consider a scenario with two PGs in Raft, namely pg1 and pg2, where both include osd1, osd2, and osd3. osd1 is the leader in this case. Without consolidation, osd1 would need to send separate heartbeat messages for pg1 and pg2 to both osd2 and osd3. However, with heartbeat consolidation, osd1 only needs to send one combined heartbeat message to osd2 and osd3.

Hearbeat Merging

OSD KV

The KV (Key-Value) subsystem is tailored for storing both the metadata of Raft and the data of the storage system itself. The design caters to the small scale of the data involved. It employs an in-memory hash map to store all the data, offering basic operations like put, remove, and get. To ensure data persistence, the modified data in the hash map is written to the disk at intervals of 10 milliseconds.

OSD Localstore

OSD localstore utilizes SPDK blobstore for data storage and is comprised of three key storage modules:

  • disk_log: This module is responsible for storing the Raft log. Each PG corresponding to a Raft group, is associated with a specific SPDK blob.
  • object_store: This module handles the storage of object data, where each object is mapped to an SPDK blob.
  • kv_store: Each CPU core has its own SPDK blob, which stores all kv data required by that core. This includes Raft's metadata and the data of the storage system itself. For instance, if two Rafts are running, the localstore provides these three types of storage functionalities - log, object, and KV - for both Rafts. localstore

Client

Clients is designed for creating, modifying, and deleting images. It converts user operations on images into operations on objects (the basic data units processed by OSDs) and then packages these as Data PRC messages to be sent to the leader OSD of the PG. The client also receives and processes responses from the leader OSD and returns the results to the user. Clients operate in various modes to accommodate different environments and uses. These modes include spdk vhost for virtual machines, NBD for bare metal, and CSI for virtual machines. In all these modes, the libfastblock library is called to handle the conversion from images to objects and to communicate with the OSDs. The focus of this explanation is on the spdk vhost for virtual machines. The client uses the SPDK library to create a vhost app. After initializing SPDK resources, a timer is set up to fetch osdmap, pgmap and image information from monitor. An SPDK script (rpc.py) is used to send requests to the vhost app for creating bdev (bdev_fastblock_create). Upon receiving a request, the vhost app creates an image, sends its information to the monitor, creates a bdev device, and then registers its operation interface (which calls the libfastblock library). The rpc.py script is also used to send requests to the vhost application for creating vhost-blk controller (vhost_create_blk_controller). Upon receiving the request, the vhost app opens the bdev device and registers a vhost driver to handle vhost message. This involves creating a socket for client connections (such as QEMU) and implementing connection services following the vhost protocol, a feature already estabilished in DPDK. Libfastblock converts user operations on images into operations on objects, encapsulates these into Data RPC messages, sends them to the leader OSD of the PG, and handles the responses from the leader OSD.

Build and Compile

The source code is primarily organized into three directories: src, monitor, and msg, each serving distinct functions:

  • src directory primarily contains the implementation of Raft, RDMA communication, the underlying storage engine, and block layer API encapsulations. For more details, you can refer to the documentation of src.
  • monitor directory includes features related to cluster metadata storage management, monitor elections, PG allocation, and distribution of clustermap. Detailed information about this directory can be found in the documentation of monitor.
  • msg directory contains all the implementations of RDMA RPC, along with a simple demo illustrating its use.

For the first compilation, you need to acquire dependencies such as SPDK and abseil-cpp by running the following command:

./install-deps.sh

You can compile the Release versions of monitor and osd by running the following command:

./build.sh -t Release -c monitor
./build.sh -t Release -c osd

After compilation, the binaries for fastblock-mon and fastblock-client will be located in the mon/ directory, while fastblock-osd and fastblock-vhost binaries will be found in the build/src/osd/ and build/src/bdev directories, respectively. For subsequent modifications, if there are code changes in the OSD or vhost, recompilation can be done directly in the build/ directory. Similarly, for any updates to the monitor, a simple make command in the mon/ directory suffices.

Deploy and Test

please refer to Deployment and Performance Test Report.

Future Works

  • Volume snapshot and snapshot group features implementation.
  • Volume Qos implementation.
  • Multi-core performance optimization for osd and client.
  • Recoverability and optimization of local storage engine.
  • Testing system addition, adding a testing system for unit tests, integration tests, and particularly fault testing of the Raft layer and local storage engine.
  • CI system integration.
  • Customizable pg allocation plugin.
  • Raft membership changes and coordination with pg allocation.
  • Optimization of RDMA connection management in osd client.
  • Support for DPU offload of vhost.
  • Monitoring data export and cluster runtime data display.
  • Deployment tool development and simplification of system configuration files.
  • Volume encryption and decryption support.
  • Volume sharing support.

Contribution

  1. Fork the repository
  2. Create Feat_xxx branch
  3. Commit your code
  4. Create Pull Request

Gitee Feature

  1. You can use Readme_XXX.md to support different languages, such as Readme_en.md, Readme_zh.md
  2. Gitee blog blog.gitee.com
  3. Explore open source project https://gitee.com/explore
  4. The most valuable open source project GVP
  5. The manual of Gitee https://gitee.com/help
  6. The most popular members https://gitee.com/gitee-stars/
木兰宽松许可证, 第2版 木兰宽松许可证, 第2版 2020年1月 http://license.coscl.org.cn/MulanPSL2 您对“软件”的复制、使用、修改及分发受木兰宽松许可证,第2版(“本许可证”)的如下条款的约束: 0. 定义 “软件”是指由“贡献”构成的许可在“本许可证”下的程序和相关文档的集合。 “贡献”是指由任一“贡献者”许可在“本许可证”下的受版权法保护的作品。 “贡献者”是指将受版权法保护的作品许可在“本许可证”下的自然人或“法人实体”。 “法人实体”是指提交贡献的机构及其“关联实体”。 “关联实体”是指,对“本许可证”下的行为方而言,控制、受控制或与其共同受控制的机构,此处的控制是指有受控方或共同受控方至少50%直接或间接的投票权、资金或其他有价证券。 1. 授予版权许可 每个“贡献者”根据“本许可证”授予您永久性的、全球性的、免费的、非独占的、不可撤销的版权许可,您可以复制、使用、修改、分发其“贡献”,不论修改与否。 2. 授予专利许可 每个“贡献者”根据“本许可证”授予您永久性的、全球性的、免费的、非独占的、不可撤销的(根据本条规定撤销除外)专利许可,供您制造、委托制造、使用、许诺销售、销售、进口其“贡献”或以其他方式转移其“贡献”。前述专利许可仅限于“贡献者”现在或将来拥有或控制的其“贡献”本身或其“贡献”与许可“贡献”时的“软件”结合而将必然会侵犯的专利权利要求,不包括对“贡献”的修改或包含“贡献”的其他结合。如果您或您的“关联实体”直接或间接地,就“软件”或其中的“贡献”对任何人发起专利侵权诉讼(包括反诉或交叉诉讼)或其他专利维权行动,指控其侵犯专利权,则“本许可证”授予您对“软件”的专利许可自您提起诉讼或发起维权行动之日终止。 3. 无商标许可 “本许可证”不提供对“贡献者”的商品名称、商标、服务标志或产品名称的商标许可,但您为满足第4条规定的声明义务而必须使用除外。 4. 分发限制 您可以在任何媒介中将“软件”以源程序形式或可执行形式重新分发,不论修改与否,但您必须向接收者提供“本许可证”的副本,并保留“软件”中的版权、商标、专利及免责声明。 5. 免责声明与责任限制 “软件”及其中的“贡献”在提供时不带任何明示或默示的担保。在任何情况下,“贡献者”或版权所有者不对任何人因使用“软件”或其中的“贡献”而引发的任何直接或间接损失承担责任,不论因何种原因导致或者基于何种法律理论,即使其曾被建议有此种损失的可能性。 6. 语言 “本许可证”以中英文双语表述,中英文版本具有同等法律效力。如果中英文版本存在任何冲突不一致,以中文版为准。 条款结束 如何将木兰宽松许可证,第2版,应用到您的软件 如果您希望将木兰宽松许可证,第2版,应用到您的新软件,为了方便接收者查阅,建议您完成如下三步: 1, 请您补充如下声明中的空白,包括软件名、软件的首次发表年份以及您作为版权人的名字; 2, 请您在软件包的一级目录下创建以“LICENSE”为名的文件,将整个许可证文本放入该文件中; 3, 请将如下声明文本放入每个源文件的头部注释中。 Copyright (c) [Year] [name of copyright holder] [Software Name] is licensed under Mulan PSL v2. You can use this software according to the terms and conditions of the Mulan PSL v2. You may obtain a copy of Mulan PSL v2 at: http://license.coscl.org.cn/MulanPSL2 THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE. See the Mulan PSL v2 for more details. Mulan Permissive Software License,Version 2 Mulan Permissive Software License,Version 2 (Mulan PSL v2) January 2020 http://license.coscl.org.cn/MulanPSL2 Your reproduction, use, modification and distribution of the Software shall be subject to Mulan PSL v2 (this License) with the following terms and conditions: 0. Definition Software means the program and related documents which are licensed under this License and comprise all Contribution(s). Contribution means the copyrightable work licensed by a particular Contributor under this License. Contributor means the Individual or Legal Entity who licenses its copyrightable work under this License. Legal Entity means the entity making a Contribution and all its Affiliates. Affiliates means entities that control, are controlled by, or are under common control with the acting entity under this License, ‘control’ means direct or indirect ownership of at least fifty percent (50%) of the voting power, capital or other securities of controlled or commonly controlled entity. 1. Grant of Copyright License Subject to the terms and conditions of this License, each Contributor hereby grants to you a perpetual, worldwide, royalty-free, non-exclusive, irrevocable copyright license to reproduce, use, modify, or distribute its Contribution, with modification or not. 2. Grant of Patent License Subject to the terms and conditions of this License, each Contributor hereby grants to you a perpetual, worldwide, royalty-free, non-exclusive, irrevocable (except for revocation under this Section) patent license to make, have made, use, offer for sale, sell, import or otherwise transfer its Contribution, where such patent license is only limited to the patent claims owned or controlled by such Contributor now or in future which will be necessarily infringed by its Contribution alone, or by combination of the Contribution with the Software to which the Contribution was contributed. The patent license shall not apply to any modification of the Contribution, and any other combination which includes the Contribution. If you or your Affiliates directly or indirectly institute patent litigation (including a cross claim or counterclaim in a litigation) or other patent enforcement activities against any individual or entity by alleging that the Software or any Contribution in it infringes patents, then any patent license granted to you under this License for the Software shall terminate as of the date such litigation or activity is filed or taken. 3. No Trademark License No trademark license is granted to use the trade names, trademarks, service marks, or product names of Contributor, except as required to fulfill notice requirements in Section 4. 4. Distribution Restriction You may distribute the Software in any medium with or without modification, whether in source or executable forms, provided that you provide recipients with a copy of this License and retain copyright, patent, trademark and disclaimer statements in the Software. 5. Disclaimer of Warranty and Limitation of Liability THE SOFTWARE AND CONTRIBUTION IN IT ARE PROVIDED WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL ANY CONTRIBUTOR OR COPYRIGHT HOLDER BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE SOFTWARE OR THE CONTRIBUTION IN IT, NO MATTER HOW IT’S CAUSED OR BASED ON WHICH LEGAL THEORY, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 6. Language THIS LICENSE IS WRITTEN IN BOTH CHINESE AND ENGLISH, AND THE CHINESE VERSION AND ENGLISH VERSION SHALL HAVE THE SAME LEGAL EFFECT. IN THE CASE OF DIVERGENCE BETWEEN THE CHINESE AND ENGLISH VERSIONS, THE CHINESE VERSION SHALL PREVAIL. END OF THE TERMS AND CONDITIONS How to Apply the Mulan Permissive Software License,Version 2 (Mulan PSL v2) to Your Software To apply the Mulan PSL v2 to your work, for easy identification by recipients, you are suggested to complete following three steps: i Fill in the blanks in following statement, including insert your software name, the year of the first publication of your software, and your name identified as the copyright owner; ii Create a file named “LICENSE” which contains the whole context of this License in the first directory of your software package; iii Attach the statement to the appropriate annotated syntax at the beginning of each source file. Copyright (c) [Year] [name of copyright holder] [Software Name] is licensed under Mulan PSL v2. You can use this software according to the terms and conditions of the Mulan PSL v2. You may obtain a copy of Mulan PSL v2 at: http://license.coscl.org.cn/MulanPSL2 THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE. See the Mulan PSL v2 for more details.

About

A distributed block storage system that uses mature Raft protocol and is designed for all-flash scenarios expand collapse
README
MulanPSL-2.0
Cancel

Releases (4)

All
6 months ago

Contributors

All

Activities

can not load any more
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/openeuler/fastblock.git
git@gitee.com:openeuler/fastblock.git
openeuler
fastblock
fastblock
master

Search