# hakkero-dataloader

**Repository Path**: underdogs/hakkero-dataloader

## Basic Information

- **Project Name**: hakkero-dataloader
- **Description**: No description available
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: feat
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-01-11
- **Last Updated**: 2025-01-11

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

hakkero-dataloader
------------------

A general dataloader build on top of Pytorch Dataloader.


## 1. How to use

### 1.1 Build Index

Install `pip install hakkero-dataloader` and run the following command to build index.

```shell
hakkero -h

usage: hakkero [-h] [--version] [--filename FILENAME] [--output OUTPUT] --dtype {legacy,message,preference} [--num_workers NUM_WORKERS] [--not_shuf]

build index for dataset

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --filename FILENAME   full filename of jsonl file
  --output OUTPUT       output path for saving data.jsonl and index.h5
  --dtype {legacy,message,preference}
                        data type
  --num_workers NUM_WORKERS
                        number of workers
  --not_shuf            not shuf data
```

### 1.2 Use In Training

```python
from hakkero.dataset import get_dataset

# pretrain or sft
from hakkero.dataset import PadLoader
from hakkero.dataset import UnpadLoader

# preference
from hakkero.dataset import PreferencePadLoader
from hakkero.dataset import PreferenceUnpadLoader

dp_world_size, dp_rank = 1, 0
tokenizer = ...
batch_size = 4
max_length = 4096
n_workers = 2

dataset = get_dataset(
    config="/path/to/dataset",
    tokenizer=tokenizer,
    num_epochs=-1,
    max_length=max_length,
    homogeneous=True,
    seed=9527,
    rank=dp_rank,
    world_size=dp_world_size,
    n_workers=n_workers,
    # segment and tokenize strategy or set them in `config` and let strategy_segment=None and strategy_tokenize=None: 
    st_segment="naive",
    st_tokenize="legacy",
    # add bos/eos token for legacy tokenize strategy
    add_bos_token=True,
    add_eos_token=True,
    # norm dataset weight with tokens of target
    norm_weight_with_n_targets=False,
)

dataloader = UnpadLoader(dataset, max_total_length=batch_size * max_length)
prefetcher = dataloader.prefetch(n_workers)

for step, batch in enumerate(prefetcher, start=0):
    print(batch)
```

example of `config`: 
```json
{
    "hermes25_1":
    {
        "group": "en",
        "name": "hermes25_1",
        "epoch": 1,
        "path": "hermes25",
        "strategy":
        {
            "st_segment": "integrous",
            "st_tokenize": "hg"
        },
        "weight": 0.5
    },
    "hermes25_2":
    {
        "group": "en",
        "name": "hermes25_1",
        "epoch": 1,
        "path": "hermes25",
        "strategy":
        {
            "st_segment": "integrous",
            "st_tokenize": "hg"
        },
        "weight": 0.5
    }
}
```

## 2. Supported Strategies

See [segmentation.py](./hakkero/dataset/strategy/segmentation.py) and [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.

### 2.1 Segmentation Strategies

- `integrous`: discard sample that is too long, exceed `max_length`
- `concat`: split long sample, concat it with previous segment, shuffle all segments
  - not support preference data.
- `naive`: split long sample with random length, shuffle all segments
  - not support preference data.
- `unbiased`: split long sample exceed `max_length` with random length, shuffle all segments.
  - not support preference data.

### 2.2 Tokenization Strategies

- `legacy`: `\n\n` as delimiter to join text and use `tokenizer.encode` to encode the input.
  - format of input data
    ```json
    {
      "uid": "xxx",
      "data":
      {
          "title": "xxx",
          "summary": "xxx",
          "abstract": "xxx",
          "text": "xxx",
          "question": "xxx",
          "answer": "xxx",
          "code": "xxx",
          "label": "xxx"
      }
    }
    ```

    - All fields except `label` are stripped and joined with "\n\n" as the context.
    - `label` is the target to learn for finetuning (pretrain data should not have the `label` field).
    - See func `legacy` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
  - extra parameters: `add_bos_token`, `add_eos_token`

- `hg`: huggingface message data, use `tokenizer.apply_chat_template` to encode the input.
  - format of input data
    ```json
    {
      "uid": "xx",
      "data": [
        {"role": "user", "content": "xxx"},
        {"role": "assistant", "content": "xxx"},
         ...
      ]
    }
    ```

    See func `huggingface_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.

- `chatml`: chat message data, use chatml to encode the input.
  - format of input data
    ```json
    {
      "uid": "xx",
      "data": [
        {"role": "user", "content": "xxx"},
        {"role": "assistant", "content": "xxx"},
         ...
      ]
    }
    ```

    See func `chatml_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
- `chatml_qwen2_vl_message`: chat message vl data, use chatml to encode the input.
  - format of input data
    ```json
    {
      "uid": "xx",
      "data": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "images/2.jpg"
                },
                {
                    "type": "text",
                    "text": "他是谁？"
                }
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "他是来自拜仁慕尼黑的托马斯·穆勒。"
                }
            ]
        },
         ...
      ]
    }
    ```

    See func `chatml_qwen2_vl_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
    Only support "integrous" segmentation strategies

- `hg_preference`: preference data, use `tokenizer.apply_chat_template` to encode the input.
  - format of input data
    ```json
    {
      "uid": "xx",
      "data": {
        "context": [
          {"role": "user", "content": "xxx"},
          {"role": "assistant", "content": "xxx"},
          ...
          {"role": "user", "content": "xxx"}
        ],
        "chosen": "chosen response",
        "rejected": "rejected response"
      }
    }
    ```
    
    See func `huggingface_preference` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.

- `chatml_preference`: preference data, use chatml to encode the input.
  - format of input data
    ```json
    {
      "uid": "xx",
      "data": {
        "context": [
          {"role": "user", "content": "xxx"},
          {"role": "assistant", "content": "xxx"},
          ...
          {"role": "user", "content": "xxx"}
        ],
        "chosen": "chosen response",
        "rejected": "rejected response"
      }
    }
    ```
    
    See func `chatml_preference` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.