# pls-loss

**Repository Path**: AI-Group/pls-loss

## Basic Information

- **Project Name**: pls-loss
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-14
- **Last Updated**: 2026-01-30

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# PLS-Loss: Partial Label Smoothing Loss for Large Language Models

![License](https://img.shields.io/badge/license-MIT-blue.svg)
![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)
![PyTorch](https://img.shields.io/badge/PyTorch-2.4.0-orange.svg)

Authors: [Xueming Hou]()\*

## 🔔 NEWS
- **[01/18/2026]** Our paper !

## Table of Contents

- [PLS-Loss: Partial Label Smoothing Loss for Large Language Models](#t6-tensor-product-attention-transformer)
  - [Table of Contents](#table-of-contents)
  - [Features](#features)
  - [Hardware Requirements](#hardware-requirements)
  - [Installation](#installation)
  - [Data Preparation](#data-preparation)
    - [Fineweb-Edu-100B](#fineweb-edu-100b)
  - [Pretraining](#pretraining)
  - [Evaluation](#evaluation)
  - [Acknowledgements](#acknowledgements)
  - [Star History](#star-history)
  - [Citation](#citation)

## Features

- **PLS-Loss:** Implements partial label smoothing loss.

## Hardware Requirements
A100 and H100 are recommended. At least 8*80G VRAM is needed.

## Installation

Ensure you have Python 3.10 or higher installed. It's recommended to use a virtual environment to manage dependencies.

1. **Clone the Repository**

   ```bash
   git clone https://github.com/tensorgi/TPA.git
   cd TPA
   ```
2. **Create and Activate a Virtual Environment**

   ```bash
   python3 -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   ```
3. **Install Required Packages**

   ```bash
   pip install torch==2.4.0 numpy transformers datasets tiktoken wandb tqdm
   ```

## Data Preparation

Prepare the necessary datasets before pretraining the model. Support [Fineweb-Edu-100B](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/)

### Fineweb-Edu-100B

Fineweb-Edu-100B is a large-scale educational dataset hosted on Hugging Face.

1. **Navigate to the Data Directory**

   ```bash
   cd data/fineweb-edu
   ```
2. **Run the Data Preparation Script**

   ```bash
   python fineweb-edu.py
   ```
3. **Move the Prepared Data**

   ```bash
   mv fineweb-edu100B ..
   cd ../..
   ```
4. **Gen ngram model**
    ```bash
   cp data/fineweb-edu/read_tokens_train.py /path/to/kenlm/build/
   python read_tokens_train.py | ./bin/lmplz -o 3 -S 50% -T /data/tmp > 100Bo3.arpa
   ```
5. **Gen cache model**
    ```bash
   python arpa2binary_partial.py --arpa 100Bo3.arpa --binary 100Bo3.bin.pt --partial 100000000
   python query.py --lm_path 100Bo3.bin.pt --cache_path 100Bo3.cache --mode fast --shard_size 2000000 --max_cands 1000
   ```

max_cands==100时，100Bo2.bin.pt竟然也没有减少结果的存储大小！！！
当前100Bo3.bin.pt在读取到第17个分片后，内存基本被占满，所以程序会自动卡住，可能接下来需要分桶保存。比如以2位尾数分100桶，先逐步按10个分片计算并分桶保存，然后将所有分桶数据合并到一起。 

## Pretraining

Pretrain the model using the prepared datasets. The provided scripts support distributed training across multiple GPUs.

1. **Baseline**

   For more control or customization, use `torchrun` to initiate training. Replace `config/train_llama_small_adam_80g8.py` with your desired configuration file.

   ```bash
   torchrun --standalone --nproc_per_node=8 \
       train_fw.py \
       config/train_llama_small_adam_80g8.py
   ```

   - `--nproc_per_node=8` specifies the number of processes (typically matching the number of GPUs).

2. **PLS-Loss**
    Update `train_fw_pls.py` with correct path of `cache_model`
    
    ```bash
   torchrun --standalone --nproc_per_node=8 \
       train_fw_pls.py \
       config/train_llama_small_adam_80g8_pls.py
   ```

## Evaluation

Evaluate the performance of the pretrained model using standardized benchmarks.

1. **Navigate to the Evaluation Harness Directory**

   ```bash
   cd lm-evaluation-harness
   ```
2. **Follow the Instructions Within This Directory**

   *Ensure your model is compatible with the evaluation harness requirements.*

## Acknowledgements

- [Karpathy’s nanoGPT](https://github.com/karpathy/nanoGPT) provides the foundational codebase upon which this repo is built.
- [Hugging Face](https://huggingface.co/) for providing the [Fineweb-Edu-100B](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/) dataset.
- [EleutherAI](https://www.eleuther.ai/) for the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
- [tensorgi/TPA](https://github.com/tensorgi/TPA) provides the foundational codebase upon which this repo is built.

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=bigcash/PLS-Loss&type=Date)](https://star-history.com/#bigcash/PLS-Loss&Date)

## Citation

If you use PLS-Loss in your research or application, please consider citing it!

```bibtex
@article{hou2026pls-loss,
    title={PLS-Loss: Partial Label Smoothing Loss for Large Language Models},
    author={Xueming Hou},
    journal={arXiv preprint arXiv:},
    year={2026},
}
```