# fastblock

**Repository Path**: hanfengzhe/fastblock

## Basic Information

- **Project Name**: fastblock
- **Description**: A distributed block storage system that uses mature Raft protocol and is designed for all-flash scenarios
- **Primary Language**: Unknown
- **License**: MulanPSL-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 43
- **Created**: 2024-01-29
- **Last Updated**: 2024-01-29

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Introduction to FastBlock

The problems of the distributed block storage system (CEPH) in use today  can no longer meet the requirements for performance, latency, cost, and  stability, which are mainly reflected in: 

- CPU economy: Currently, a large amount of CPU is consumed, and the CPU becomes a bottleneck in the NVMe SSD cluster 
- Poor availability: The master-slave strong synchronous replication strategy  is adopted, and I/O will be hung when the cluster is jitter 
- Insufficient performance of a single volume: The performance of a single volume is  even worse when interconnected to QEMU, and multiple volumes are  required to run the performance of the entire cluster during stress  testing 
- Excessive single-volume latency: Failure to take full advantage of the low  latency characteristics of NVMe devices, RBD block devices typically  have millisecond latency 
- nsufficient total concurrent performance: IOPS and throughput vary greatly from what hardware can provide 

FastBlock is designed to solve performance and latency problems, and it features: 

- It is written using the spdk programming framework and uses features such  as user-mode nvme drivers and lock-free queues to reduce I/O path  latency 
- RDMA network cards are introduced for zero-copy, kernel bypass, and network communication without CPU intervention 
- Use multi-raft for data replication to ensure data reliability 
- Simple, reliable, and easy-to-customize cluster metadata management 

# Fastblock design and architecture

The architecture of FastBlock is very similar to that of Ceph, and many  concepts such as Monitor, OSD, and PG are the same as those of Ceph for  quick understanding, as shown in the following figure:
 ![arch](docs/architecture.png) Thereinto:

- Compute stands for Compute Service 
- Monitor cluster is responsible for maintaining cluster metadata (including  osdMap, pgMap, pool information, and image information), as well as  managing pools and pgs. 
- Each storage cluster contains multiple storage nodes running multiple object storage dads (OSDs). 
- Control rpc is used to transfer metadata, using TCP sockets; Data RPC is used  to transfer data requests between the client and the OSD; raft rpc is  used to transfer RPC messages defined in the raft paper between OSDs.  Data rpc and raft rpc use protobuf and RDMA. 
- The monitor client is a monitor client module that communicates with the monitor. 
- The Command Dispatcher is a message processing module that receives data requests from processing clients. 
- The raft protocol Processor is used to process raft RPC messages,  elections, member changes, and other content specified in the raft  protocol. 
- The raft log manager manages and persists the raft log, and the persistent raft log uses spdk blobs. 
- The Data State Machine stores user data and uses the spdk blobstore. 
- The raft log entry cache is used to cache raft logs to improve performance. 
- The KV system provides a kv api and uses spdk blob for persistence. 

# fastblock component and interaction logic

## monitor

The Monitor service is responsible for maintaining the state of storage  nodes, deleting node additions, storing the metadata of volumes,  maintaining the topology of the cluster, responding to operations such  as creating pools by users, and evenly creating raft groups on OSDs  based on the current topology. As a cluster management tool, monitor  does not need to store data and does not need to pursue extreme  performance, so it is implemented by golang and monitor is implemented  with etcd for multi-copy storage.
The monitor cluster is an important guarantee of consistency, because  the client and OSD see the same view. For all client IO operations, only the PG layer can be seen, and both the OSD and the client will open a  timer to get OSDMAP and PGMAP information from the monitor at startup,  so all OSD and clients can see the same PG status change and make the  same response, and write operations for a specific PG will not be  written to the wrong place.
For more information, see Monitor Overview

## OSD RPC subsystem

RPC subsystem is an important system connecting each module, for the  requirements of heterogeneous network, the RPC subsystem is implemented  in two ways, namely socket-based (Control RPC) and RDMA (Data RPC and  Raft RPC), Socket-based is the classic Linux socket application  scenario, and RDMA-based RPC is implemented using asynchronous RDMA  (i.e., RDMA Write) semantics.
 ![rpc子系统](docs/rpc_subsystem.png) 
The above figure shows the connection between each module in fastblock, from which it can be seen that three types of RPCs are used, namely  Control RPC, Data RPC and Raft RPC: Control rpc: Used to pass data such  as osdmap, pgmap and image information between the client and the  monitor, and between the osd and the monitor. So a socket-based  implementation can be used; Data RPC: used to transfer object data  operations and results between the client and OSD, which is relatively  large and frequent, so RDMA-based methods are required. Raft RPC: It is  used to transfer the content of the Raft RPC protocol between OSDs,  which protects object data, which is relatively large and frequent, so  an RDMA-based method is required. Data rpc and Raft rpc use protobuf's  RPC framework, the network interaction part code uses RDMA, and the  serial number of the data transmitted by the rpc uses protobuf.

## OSD Raft subsystem

Raft achieves consistency by electing a leader and then giving him full  responsibility for managing the replication logs. The leader receives  log entries from the client, copies them to other servers, and tells  other servers when they can safely apply them to their state machine.  raft already has a lot of open source implementations, we refer to  willemt's C language raft implementation, and additionally implement  multi-raft, this module mainly includes:

- Management of raft groups, including creating, modifying, and deleting rafts; 
- raft elections and election overtime; 
- Raft log processing, including log caching, log disking, and log replication to follower nodes. 
- Data state machine processing, that is, data disking; 
- Raft snapshot management and Raft log recovery; 
- RAFT member change management (not yet implemented); 
- Raft heartbeats merge. 

The implementation of multi-group raft means that there are multiple rafts  coexisting, and the leader of each raft needs to send heartbeat packets  to its followers, so there will be multiple heartbeat packets, if there  are too many rafts, it will lead to too many heartbeat packets, which  occupies a lot of bandwidth and CPU resources. The solution is also very simple, each OSD may belong to multiple rafts, so you can merge the  rafts of the same leader and the same flower to reduce the number of  heartbeat packets. As shown in the figure below, there are two pgs  (rafts) are pg1 and pg2, pg1 and pg2 contain osd1, osd2, and osd3, osd1  is the leader, osd1 needs to send heartbeat (pg1) to osd2 and osd3, and  osd1 needs to send heartbeat (pg2) to osd2 and osd3 in pg2. After the  heartbeats are combined, only OSD1 needs to send heartbeat (PG1, PG2) to OSD2 and OSD3, respectively. ![心跳合并](docs/heartbeat_merge.png)

## OSD KV subsystem

The KV subsystem is used to store the metadata of the raft and the data of  the storage system itself. Because the amount of data is not large, the  hash map in memory can store all the data, provide put, remove, and get  interfaces, and write the modified data in the hash map to the disk  every 10 ms. 

## osd localstore subsystem

Local storage is based on SPDK Blobstrore and consists of three storage modules: 

- disk_log: Stores the raft log, and one PG (corresponding to a raft group) corresponds to one spdk blob. 
- object_store: Stores object data, and each object corresponds to one spdk blob. 
- kv_store: Each CPU core has one SPDK blob. Save all KV data that needs to be  saved on the CPU core, including the metadata of the raft and the data  of the storage system itself. As shown in the figure below, suppose we  run two rafts, and the localstore provides three storage functions for  these two rafts: log, object, and kv. ![本地存储引擎](docs/osd_localstore.png)

## client

The client is used to create, modify, and delete images, convert the user's data operations on the image into operations on the object (the basic  data unit for OSD processing), and then encapsulate it as a data rpc  message to the leader osd of the pg, and receive the response returned  by the leader OSD, and the result is returned to the user. There are  several modes for the client: use the spdk vhost to make it available to the virtual machine; Use NBD for bare metal use; Use CSI to make it  available to virtual machines. Eventually, all three modes will call the libfastblock library to convert images to objects and communicate with  OSD. The following describes the modes that are used by using spdk vhost to provide to virtual machines:
After the SPDK resources are initialized, you need to enable a timer to go to monitor to obtain OSDMAP, pgmap, and image information. Use the  rpc.py script of the spdk to send a request to the vhost app to create a bdev (bdev_fastblock_create), and the vhost app will create an image  after receiving the request, send the image information to the monitor,  create a bdev device, and then register the operation interface of the  device (this interface will call the libfastblock library). Use the  rpc.py script of the spdk to send a request to the vhost app to create a vhost-blk controller (vhost_create_blk_controller) for bdev, and the  vhost app will open the bdev device after receiving the request, and  register a vhost driver to process vhost messages (create a socket that  can be connected by a client (such as qemu), and follow the vhost  protocol to implement the connection service, which is a feature that  has been implemented in DPDK). libfastblock converts the user's data  operation on the image into an operation on the object (the basic data  unit for OSD processing), and then encapsulates it as a Data RPC message sent to the leader osd of the pg, and receives the response returned by the processing leader OSD.

# Code structure and compilation

The fastblock code is mainly located in the src, monitor, and msg directories: 

- The src directory mainly contains functions such as raft implementation,  rdma communication, underlying storage engine, and block-layer API  encapsulation 
- The monitor directory includes cluster metadata storage management, monitor election, PG allocation, clustermap distribution, and other functions 
- The msg directory contains all implementations of RDMA RPC, as well as a simple demo. 

```
./install-deps.sh
```

When compiling for the first time, you need to obtain dependencies such as  spdk and abseil-cpp, and run the following command to compile the  montior and osd of the Release version respectively: 

```
./build.sh -t Release -c monitor
./build.sh -t Release -c osd
```

After compilation, the and binary are in the `fastblock-mon` directory, while the and `fastblock-vhost` `fastblock-client` binary are in `mon/` the `build/src/osd/` directory `fastblock-osd` and `build/src/bdev` . If there are subsequent code changes to OSD and vhost, they can be  compiled only in the directory, and if there are changes to monitor,  they can be compiled only `build/` in `make` the `mon/` directory.

# Deployment and performance testing

Referring to the deployment and test report, in our test environment, with only  one core per OSD, we achieved a latency of less than 100 μs for a 4K  random write single thread and a concurrent performance of 410,000 IOPS.

# future works

- Implement functions such as volume snapshots and snapshot groups 
- Implement volume QoS 
- Optimized the multi-core performance of OSD and client 
- Achieve local storage engine recoverability and local storage engine optimization 
- Add a test system for unit testing, integration testing, and especially  failure testing of the raft layer and local storage engines 
- Connect to the CI system 
- PG assignment plugin for implementing customizable monitors 
- Implement the change of RAFT members and the overall joint debugging of PG allocation with Monitor 
- Optimized RDMA connection management for OSD clients 
- vhost The DPU can be unloaded from vhost 
- Export monitoring data and display data while the cluster is running 
- Deployment tool development and simplified system configuration files 
- Volume encryption and decryption are supported 
- Volume sharing is supported