# ffrecord **Repository Path**: chen0qi/ffrecord ## Basic Information - **Project Name**: ffrecord - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-02-14 - **Last Updated**: 2025-09-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # FFRecord The FFRecord format is a simple format for storing a sequence of binary records developed by HFAiLab, which supports random access and Linux Asynchronous Input/Output (AIO) read. ## File Format **Storage Layout:** ``` +-----------------------------------+---------------------------------------+ | checksum | N | +-----------------------------------+---------------------------------------+ | checksums | offsets | +---------------------+---------------------+--------+----------------------+ | sample 1 | sample 2 | .... | sample N | +---------------------+---------------------+--------+----------------------+ ``` **Fields:** | field | size (bytes) | description | |-----------|-------------------------------|---------------------------------| | checksum | 4 | CRC32 checksum of metadata | | N | 8 | number of samples | | checksums | 4 * N | CRC32 checksum of each sample | | offsets | 8 * N | byte offset of each sample | | sample i | offsets[i + 1] - offsets[i] | data of the i-th sample | ## Get Started ### Requirements - OS: Linux - Python >= 3.6 - Pytorch >= 1.6 - NumPy - tqdm - zlib: `sudo apt install zliblg-dev` - cmake: `pip install cmake` - pybind11 >= 2.8 ### Install ``` python3 setup.py install ``` ## Usage We provide `ffrecord.FileWriter` and `ffrecord.FileReader` for reading and writing, respectively. ### Write To create a `FileWriter` object, you need to specify a file name and the total number of samples. And then you could call `FileWriter.write_one()` to write a sample to the FFRecord file. It accepts `bytes` or `bytearray` as input and appends the data to the end of the opened file. ```python from ffrecord import FileWriter def serialize(sample): """ Serialize a sample to bytes or bytearray You could use anything you like to serialize the sample. Here we simply use pickle.dumps(). """ return pickle.dumps(sample) samples = [i for i in range(100)] # anything you would like to store fname = 'test.ffr' n = len(samples) # number of samples to be written writer = FileWriter(fname, n) for i in range(n): data = serialize(samples[i]) # data should be bytes or bytearray writer.write_one(data) writer.close() ``` ### Read To create a `FileReader` object, you only need to specify the file name. And then you could call `FileWriter.read()` to read multiple samples from the FFReocrd file. It accepts a list of indices as input and outputs the corresponding samples data. The reader would validate the checksum before returning the data if `check_data = True`. ```python from ffrecord import FileReader def deserialize(data): """ deserialize bytes data The deserialize method should be paired with the serialize method above. """ return pickle.loads(data) fname = 'test.ffr' reader = FileReader(fname, check_data=True) print(f'Number of samples: {reader.n}') indices = [3, 6, 0, 10] # indices of each sample data = reader.read(indices) # return a list of bytes-like data for i in range(n): sample = deserialize(data[i]) # do what you want reader.close() ``` ### Dataset and DataLoader for PyTorch We also provide `ffrecord.torch.Dataset` and `ffrecord.torch.DataLoader` for PyTorch users to train models using FFRecord. Different from `torch.utils.data.Dataset` which accepts an index as input and returns only one sample, `ffrecord.torch.Dataset` accepts a batch of indices as input and returns a batch of samples. One advantage of `ffrecord.torch.Dataset` is that it could read a batch of data at a time using Linux AIO. Users need to inherit from `ffrecord.torch.Dataset` and define their custom `__getitem__()` and `__len__()` function. For example: ```python class CustomDataset(ffrecord.torch.Dataset): def __init__(self, fname, check_data=True, transform=None): self.reader = FileReader(fname, check_data) self.transform = transform def __len__(self): return self.reader.n def __getitem__(self, indices): # we read a batch of samples at once assert isintance(indices, list) data = self.reader.read(indices) # deserialize data samples = [pickle.loads(b) for b in data] # transform data if self.transform: samples = [self.transform(s) for s in samples] return samples dataset = CustomDataset('train.ffr') indices = [3, 4, 1, 0] samples = dataset[indices] ``` `ffrecord.torch.DataLoader` is a drop-in replacement for PyTorch's standard dataloader. `ffrecord.torch.Dataset` could be combined with it just like PyTorch. `ffrecord.torch.DataLoader` supports for skipping steps during training by `set_step()` method. ```python dataset = CustomDataset('train.ffr') loader = ffrecord.torch.DataLoader(dataset, batch_size=16, shuffle=True, num_workers=8) start_epoch = 5 start_step = 100 # resume from epoch 5, step 100 loader.set_step(start_step) for epoch in range(start_epoch, epochs): for i, batch in enumerate(loader): # training model loader.set_step(0) # remember to reset before the next epoch ``` ### Pack a folder into ffrecord FFRecord could also be used to pack a folder into a single file, which could be accessed without unpacking. For example: Assume we have a folder named `just_a_folder`: ```shell $ tree just_a_folder just_a_folder/ ├── 001.txt ├── 002.txt ├── 003.txt ├── just_a_figure.png └── just_another_folder ├── 004.txt ├── jsonfile.json ├── npyfile.npy ├── npzfile.npz └── another_folder └── 005.txt ``` Now we pack this folder into a file named `packed.ffr`: ```python from ffrecord import pack_folder pack_folder("just_a_folder", "packed.ffr", verbose=True) ``` And then we could access the packed folder by `PackedFolder`: ```python >>> import io >>> from ffrecord import PackedFolder >>> >>> folder = PackedFolder("packed.ffr") >>> folder.list() ['001.txt', '002.txt', '003.txt', 'just_a_figure.png', 'just_another_folder'] >>> folder.list('just_another_folder') ['004.txt','jsonfile.json','npyfile.npy','npzfile.npz','another_folder'] >>> folder.is_file("just_another_folder") False >>> folder.is_dir("just_another_folder") True >>> folder.exists("just_another_folder/another_folder") True >>> fp = io.BytesIO(folder.read('001.txt')) >>> data = fp.read() # binary data >>> list_of_data = folder.read(["001.txt", "002.txt"]) # read multiple files by Linux AIO ``` Here are some samples for reading file formats that are frequently used. Just replace your original code blocks with follows and enjoy FFRecord. #### Images: ```Python import cv2 order = "RGB" path = "just_a_figure.png" fp = io.BytesIO(folder.read_one(path)) img = cv2.imdecode(np.frombuffer(fp.read(), np.uint8), cv2.IMREAD_COLOR) if order == 'RGB': img = img[:, :, ::-1].copy() cv2.imwrite("test.png", img) ``` #### Texts: ```Python fp = io.BytesIO(folder.read_one("just_another_folder/another_folder/005.txt")) bytestring = fp.read() result_str = bytestring.decode("utf-8") print(result_str) ``` #### JSON: ```Python import json fp = io.BytesIO(folder.read_one("just_another_folder/jsonfile.json")) bytestring = fp.read() result_str = bytestring.decode("utf-8") annot = json.loads(result_str) print(annot) ``` #### Ndarrays saved in .npy file: ```Python import numpy as np fp = io.BytesIO(folder.read_one("just_another_folder/npyfile.npy")) result = np.load(fp,allow_pickle=True) print(result) ``` #### Ndarrays saved in .npz file: ```Python # .npz file is a zip file for ndarrays, generated by np.savez import numpy as np import zipfile fp = io.BytesIO(folder.read_one("just_another_folder/npzfile.npz")) test = zipfile.ZipFile(fp,allowZip64=True) print(test.namelist()) # arr_0 is a key in the namelist with test.open('arr_0.npy',"r") as myfile: result = np.load(myfile,allow_pickle=True) ``` #### Write a file directly: ```Python fp = io.BytesIO(folder.read_one("just_another_folder/npyfile.npy")) with open("just_a_name", 'wb') as f: f.write(fp.read()) ```