# PT-ALIGN **Repository Path**: MerrySunlight/pt-align ## Basic Information - **Project Name**: PT-ALIGN - **Description**: PT-ALIGN：LLM 双安全自对齐开源方案，通过精修正负样本 + 话题引导红队策略，仅需少量人工标注即可实现模型安全对齐。 🔔 项目归属声明：本项目由许晶鑫独立开发，仅由 @MerrySunlight 协助在 Gitee 上进行开源托管。 - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 33 - **Forks**: 25 - **Created**: 2026-01-17 - **Last Updated**: 2026-02-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README 项目归属声明：本项目由许晶鑫独立开发，仅由@MerrySunlight协助在 Gitee 上进行开源托管。 # Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions ## Introduction ___ PT-ALIGN is an innovative self-align training method with fewer than 50 human annotations. For comprehensive details and insights, we kindly direct you to our paper: **Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions** ## Setup ___ To train your own self-aligned model with an open-source base language model(i.e., Llama), or to perform inference on GPUs with quantities differing from 1, 2, 4, or 8 (i.e., any power of 2), you should install the [`llama_dromedary`](llama_dromedary) package (code comes from [Dromedary](https://github.com/IBM/Dromedary)). In a conda env with pytorch / cuda available, run: ```bash conda env create -f PN-ALIGN.yml ``` Otherwise, if you only want to perform inference on 1, 2, 4, 8, or 16 GPUs, you can reuse the original LLaMA repo. ```bash git clone https://github.com/facebookresearch/llama.git cd llama pip install -r requirements.txt pip install -e . cd .. ``` ## Starting Point ___ Prompts are available under the `prompts` directory. Harmful seed datasets are available `prompts/harmful_seed.json`. Training procedures are available under the `training` directory where you will find the training pipeline for PN-ALIGN. Evaluations are available under the `evaluation` directory. We provide comprehensive evaluations for hhh, openbookqa, truthfulqa, pku-saferlhf and piqa. ## Training Experiences We provide the full [training pipeline](training) of `PT-ALIGN` for reproduction. The whole **PN-ALIGN** process involves three distinct stages. ### Prerequisites #### Llama-Based Models For efficiency concerns, we utilize the [model parallel](https://github.com/facebookresearch/fairscale/tree/main/fairscale/nn/model_parallel) scheme from [llama](https://github.com/facebookresearch/llama) when generating synthetic instructions and self-aligned responses. To prepare the sharded model checkpoints of LLaMA and Dromedary on your own machine/cluster, please refer to our [model_sharding](./model_sharding.sh) #### Other Open-Sourced Model For other open-source models employing architectures different from the Llama model, we utilized [vllm](https://github.com/vllm-project/vllm) to conduct the same process while ensuring high efficiency. ### Stage 1: Safety-Driven Red-Teaming The first stage is called **Safety-Driven Red-Teaming**, which employs the language model itself to generate synthetic safety-related instructions and enhance diversity via a topic-guided red-teaming approach. We use our own instruction prompts for [harmful topic brainstorming](../prompts/topic-gen/tgrt_self_instruct_topic_brainstorm_prompt.txt) and [topic-guided harmful instruction generation](../prompts/question-gen/tgrt_self_instruct_question_generation_prompt.txt). We also create our own [harmful seeds](../prompts/harmful_seed.json).

Running the code

```bash cd training/step1_safety_driven_red_teaming # Harmful Topic Generation bash scripts/safe_tgrt_topic_generate_llama.sh # For Llama Model # bash scripts/safe_tgrt_topic_generate_vllm.sh # For Other Models bash deduplicate_topics.sh # Topic-guided instruction generation bash scripts/safe_tgrt_question_generate_llama.sh # For Llama Model # bash scripts/safe_tgrt_question_generate_vllm.sh # For Other Models bash merge_insts.sh # Self-Generating Positive and Negative principles and prompts bash self-prin-gen.sh ```

### Stage 2: Self-Constraint-Driven Positive and Negative Annotation The second stage, **Self-Constraint-Driven Positive and Negative Annotation**, aims to refine both benign and harmful responses for the safety-related instruction set. This stage strategically induces the LLM to formulate rules that guide it towards generating both harmless and toxic content. By leveraging the self-constraint capabilities of the unaligned LLM, this process significantly reduces the necessity for human supervision.

Running the code

```bash cd training/step2_self-constraint-driven_positive_and_negative_annotation # Self-Generating Positive and Negative principles and prompts bash self-prin-gen.sh # Generate both positive and negative samples bash scripts/safe_self_align_generate_purple_llama.sh # Llama Model # bash scripts/safe_self_align_generate_purple_other.sh # Other Model bash random_select.sh # randomly select samples ```

### Stage 3: Dual Safety Self-Alignment The third stage, **Dual Safety Self-Alignment**, leverages the generated datasets to finetune our model using both positive and negative samples.

Running the code

```bash cd training/step3_dual_safety_self_alignment bash scripts/finetune_purple.sh # For Llama Model # For Other Model such as ChatGLM3-6b, You should change finetune_purple.py: Line 38: # from unlikelihood_modeling_pro import LlamaForCausalLM, PeftModelForCausalLM, MyDataCollator # to # from unlikelihood_modeling_auto import LlamaForCausalLM, PeftModelForCausalLM, MyDataCollator # And copy modeling_chatglm.py to ChatGLM3-6B's Model Folder ```

## Evaluation

Running the code

```bash cd mc_evaluation # Run Evaluations bash evaluate_hhh.sh bash evaluate_obqa.sh bash evaluate_piqa.sh bash evaluate_pku_safer.sh bash evaluate_tqa.sh ```

## Acknowledgements ___ This project is built upon or inspired by the following repositories: * **DeepAudit**: [https://github.com/lintsinghua/XCodeReviewer](https://github.com/lintsinghua/XCodeReviewer)