# PT-ALIGN
**Repository Path**: MerrySunlight/pt-align
## Basic Information
- **Project Name**: PT-ALIGN
- **Description**: PT-ALIGN:LLM 双安全自对齐开源方案,通过精修正负样本 + 话题引导红队策略,仅需少量人工标注即可实现模型安全对齐。
🔔 项目归属声明:本项目由许晶鑫独立开发,仅由 @MerrySunlight 协助在 Gitee 上进行开源托管。
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 33
- **Forks**: 25
- **Created**: 2026-01-17
- **Last Updated**: 2026-02-14
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
项目归属声明:本项目由许晶鑫独立开发,仅由@MerrySunlight协助在 Gitee 上进行开源托管。
# Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions
## Introduction
___
PT-ALIGN is an innovative self-align training method with fewer than 50 human annotations. For comprehensive details and insights, we kindly direct you to our paper: **Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions**
## Setup
___
To train your own self-aligned model with an open-source base language model(i.e., Llama), or to perform inference on GPUs with quantities differing from 1, 2, 4, or 8 (i.e., any power of 2), you should install the [`llama_dromedary`](llama_dromedary) package (code comes from [Dromedary](https://github.com/IBM/Dromedary)).
In a conda env with pytorch / cuda available, run:
```bash
conda env create -f PN-ALIGN.yml
```
Otherwise, if you only want to perform inference on 1, 2, 4, 8, or 16 GPUs, you can reuse the original LLaMA repo.
```bash
git clone https://github.com/facebookresearch/llama.git
cd llama
pip install -r requirements.txt
pip install -e .
cd ..
```
## Starting Point
___
Prompts are available under the `prompts` directory.
Harmful seed datasets are available `prompts/harmful_seed.json`.
Training procedures are available under the `training` directory where you will find the training pipeline for PN-ALIGN.
Evaluations are available under the `evaluation` directory. We provide comprehensive evaluations for hhh, openbookqa, truthfulqa, pku-saferlhf and piqa.
## Training Experiences
We provide the full [training pipeline](training) of `PT-ALIGN` for reproduction.
The whole **PN-ALIGN** process involves three distinct stages.
### Prerequisites
#### Llama-Based Models
For efficiency concerns, we utilize the [model parallel](https://github.com/facebookresearch/fairscale/tree/main/fairscale/nn/model_parallel) scheme from [llama](https://github.com/facebookresearch/llama) when generating synthetic instructions and self-aligned responses. To prepare the sharded model checkpoints of LLaMA and Dromedary on your own machine/cluster, please refer to our [model_sharding](./model_sharding.sh)
#### Other Open-Sourced Model
For other open-source models employing architectures different from the Llama model, we utilized [vllm](https://github.com/vllm-project/vllm) to conduct the same process while ensuring high efficiency.
### Stage 1: Safety-Driven Red-Teaming
The first stage is called **Safety-Driven Red-Teaming**, which employs the language model itself to generate synthetic safety-related instructions and enhance diversity via a topic-guided red-teaming approach.
We use our own instruction prompts for [harmful topic brainstorming](../prompts/topic-gen/tgrt_self_instruct_topic_brainstorm_prompt.txt) and [topic-guided harmful instruction generation](../prompts/question-gen/tgrt_self_instruct_question_generation_prompt.txt). We also create our own [harmful seeds](../prompts/harmful_seed.json).
Running the code
```bash
cd training/step1_safety_driven_red_teaming
# Harmful Topic Generation
bash scripts/safe_tgrt_topic_generate_llama.sh # For Llama Model
# bash scripts/safe_tgrt_topic_generate_vllm.sh # For Other Models
bash deduplicate_topics.sh
# Topic-guided instruction generation
bash scripts/safe_tgrt_question_generate_llama.sh # For Llama Model
# bash scripts/safe_tgrt_question_generate_vllm.sh # For Other Models
bash merge_insts.sh
# Self-Generating Positive and Negative principles and prompts
bash self-prin-gen.sh
```
### Stage 2: Self-Constraint-Driven Positive and Negative Annotation
The second stage, **Self-Constraint-Driven Positive and Negative Annotation**, aims to refine both benign and harmful responses for the safety-related instruction set. This stage strategically induces the LLM to formulate rules that guide it towards generating both harmless and toxic content. By leveraging the self-constraint capabilities of the unaligned LLM, this process significantly reduces the necessity for human supervision.
Running the code
```bash
cd training/step2_self-constraint-driven_positive_and_negative_annotation
# Self-Generating Positive and Negative principles and prompts
bash self-prin-gen.sh
# Generate both positive and negative samples
bash scripts/safe_self_align_generate_purple_llama.sh # Llama Model
# bash scripts/safe_self_align_generate_purple_other.sh # Other Model
bash random_select.sh # randomly select samples
```
### Stage 3: Dual Safety Self-Alignment
The third stage, **Dual Safety Self-Alignment**, leverages the generated datasets to finetune our model using both positive and negative samples.
Running the code
```bash
cd training/step3_dual_safety_self_alignment
bash scripts/finetune_purple.sh # For Llama Model
# For Other Model such as ChatGLM3-6b, You should change finetune_purple.py: Line 38:
# from unlikelihood_modeling_pro import LlamaForCausalLM, PeftModelForCausalLM, MyDataCollator
# to
# from unlikelihood_modeling_auto import LlamaForCausalLM, PeftModelForCausalLM, MyDataCollator
# And copy modeling_chatglm.py to ChatGLM3-6B's Model Folder
```
## Evaluation
Running the code
```bash
cd mc_evaluation
# Run Evaluations
bash evaluate_hhh.sh
bash evaluate_obqa.sh
bash evaluate_piqa.sh
bash evaluate_pku_safer.sh
bash evaluate_tqa.sh
```
## Acknowledgements
___
This project is built upon or inspired by the following repositories:
* **DeepAudit**: [https://github.com/lintsinghua/XCodeReviewer](https://github.com/lintsinghua/XCodeReviewer)