# align

**Repository Path**: yang-wenxiao-111/align

## Basic Information

- **Project Name**: align
- **Description**: 复现这一块
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-24
- **Last Updated**: 2025-08-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README


* LoRD: Locality Reinforced Distillation

This repository contains the source code of our paper [[https://arxiv.org/abs/2409.02718]["Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation.]].


Feel free to give me any feedback via issues or email (=zi1415926.liang@connect.polyu.hk=) when you reproduce our work.

** Introduction of LoRD

Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this issue, we propose Locality Reinforced
Distillation (LoRD), a novel model extraction algorithm
specifically designed for LLMs. In particular, LoRD employs a newly
defined policy-gradient-style training task that utilizes the
responses of victim model as the signal to guide the crafting of
preference for the local model. Theoretical analyses demonstrate that
/i)/ The convergence procedure of
LoRD in model extraction is consistent with the alignment procedure of
LLMs, and /ii)/ LoRD can reduce
query complexity while mitigating watermark protection through
exploration-based stealing. Extensive experiments on domain-specific
extractions validate the superiority of our method in extracting
various state-of-the-art commercial LLMs.

#+ATTR_HTML: :align center
[[file:images/intro.png]]

This figure provides a comparison between vanilla MEAs on conventional DNNs (left) and MEAs on LLMs with alignments (right).

Consistent with the training procedure of conventional DNNs, the vanilla extracting procedure employs a supervised loss. But when extra training tasks like reinforcement learning are integrated and play an important role in the training of LLMs, such consistency no longer exists, which challenges the effectiveness of vanilla MEAs on LLMs. The question is: *is a supervised loss (e.g., MLE) compatible to extract a RL-aligned LLM?*

In our paper, we show that the answer is yes. However, stealing LLMs suffers from two potential drawbacks:

+ Low query efficiency. Ideally, a supervised loss requires a level of $O(V^{N_{q}}\cdot V^{N_{r}})$ to learn from a LLM, where $V$ is the vocabulary size, and $N_{q}$ and $N_{r}$ denote the sequence lengths of the query and the response.
+ Vulnerable to text watermarks. Current MEAs will learn a watermarked local model when stealing.


We aim to address this two drawbacks in our research.

** LoRD 
#+ATTR_HTML: :align center
[[file:images/lord.png]]


The core idea of LoRD is to let the local model explore the correct responses under the gold response of the victim model. The victim model is the "Lord". There are three advantages for that:

#+ATTR_HTML: :align center
[[file:images/po.png]]

+ Query Efficiency. It can provide multiple responses under the same input query, reducing the complexity from $O(V^{N_{q}}\cdot V^{N_{r}})$ to $O(V^{N_{q}}\cdot C)$ with $C$ a constant.

+ Watermark resistance. It achieves a trade-off between the stealing performance and the watermark residue.

#+ATTR_HTML: :align center
[[file:images/cp.png]]

+ Stealing consistency. Its stealing procedure is consistent to the RL alignment procedure of LLMs.

** Evaluation

*** Environments

You may require a new environment with =python>3.8= and Nvidia GPU (cuda) environments.

Clone this repository, =cd= where you cloned to, and then =pip install -r re.txt=. The environment setting will be done.

*** Look the source Code/Use LoRD 

The most convenient way is to reuse =train_pod2.py=.

*** Explanations of the Source Code


+ *scripts*: all of the commands to evaluate a method. Use it by =bash XXXX.sh=.
+ *.py*: core codes.
  + with =eval_= prefix: for evaluation
  + with =draw_= or =plot_=: for drawing
  + with =_process=: data process
  + with =train=: different training methods
+ *watermark*: code for watermark experiments.

 All of the other directories are dirty for storing checkpoints or results.

*** Experiments

**** Effectiveness comparison

#+ATTR_HTML: :align center
[[file:images/mea-table.png]]

#+ATTR_HTML: :align center
[[file:images/mea-table2.png]]

#+ATTR_HTML: :align center
[[file:images/mea-table3.png]]

The above experiments can be reproduced by running =6.X.xxxxx.sh= in =./scripts=. Here is an example:

 
 #+BEGIN_SRC sh

   #!/bin/bash

   echo "HOME: ${HOME}"
   export python=${HOME}/anaconda3/envs/align/bin/python3
   export CUDA_VISIBLE_DEVICES="1"
   export TORCH_USE_CUDA_DSA="1"
   export root_dir="${HOME}/alignmentExtraction/"
   export POD_save_dir="${root_dir}/wmt16_ckpts/"
   export from_path="meta-llama/Meta-Llama-3-8B-Instruct"
   export TRAIN_NUMS=(16)
   export train_times=(1 2 3 4 5)
   export msl=256
   export task_ls=("cs-en" "de-en" "fi-en")
   export train_taskls=("LoRD-VI")

   export is_black_box=1
   export use_lora=1

   export epoch=2
   export period=1

   export sub_set_num=1
   export sub_stage_num=256
   export max_new_tokens=64
   export infer_batch_size=1
   export batch_size=1

   export beta=-1
   export temperature=-1

   export use_old_logits=1
   export use_vic_logits=1
   export use_kld=0
   export use_entropy=0

   # export tau1=0.85
   export tau1=0.80
   export tau2=0.85

   for train_num in ${TRAIN_NUMS[*]}
   do
       for train_time in ${train_times[*]}
       do
           for task in ${task_ls[*]}
           do
               for train_task in ${train_taskls[*]}
               do
                   echo "====================================================="
                   echo "+++++++train_num: ${train_num}+++++++"
                   echo "+++++++train_time: ${train_time}+++++++"
                   echo "+++++++task: ${task}+++++++"
                   echo "+++++++train_task: ${train_task}+++++++"
                   echo "====================================================="

                   export save_path="${POD_save_dir}WMTTT0519${task}${train_num}${train_time}${train_task}"

                   $python ${root_dir}lord_train.py\
                       --use_lora=$use_lora \
                       --from_path=$from_path \
                       --is_black_box=$is_black_box \
                       --sub_set_num=$sub_set_num \
                       --sub_stage_num=$sub_stage_num\
                       --infer_batch_size=$infer_batch_size\
                       --tau1=$tau1 \
                       --tau2=$tau2 \
                       --task=$train_task \
                       --device="cuda" \
                       --epoch=$epoch \
                       --period_num=$period \
                       --acc_step=1 \
                       --log_step=50 \
                       --train_num=$train_num \
                       --max_new_tokens=$max_new_tokens \
                       --LR="3e-5" \
                       --save_step=$sub_stage_num \
                       --beta=$beta \
                       --temperature=$temperature \
                       --batch_size=$batch_size \
                       --use_old_logits=$use_old_logits\
                       --use_vic_logits=$use_vic_logits\
                       --use_kld=$use_kld\
                       --max_length=$msl \
                       --dataset_task=$task \
                       --save_path=$save_path
                   echo "DONE FOR ONE TRAIN NUMBERS...."
               done
           done
       done
   done


   $python ${root_dir}wmt_process.py
 #+END_SRC

 In the above script, you can simply replace your dataset with others, as shown in =./lord_train.py=.


 #+begin_src python 

    tasks_glue = [
        "cola", "mnli",
        "mrpc",
        "qnli", "qqp", "rte", "sst2",
        "wnli",]

    tasks_wmt16 = [
        "cs-en",
        "de-en",
        "fi-en",
        "ro-en",
        "ru-en",
        "tr-en",
    ]

    tasks_wmt16_wrmk=[
        "cs-en@wrmk",
        "de-en@wrmk",
        "fi-en@wrmk",
        "ro-en@wrmk",
        ]

    tasks_qa = [
        "piqa",
        "truthful_qa",
        "allenai/ai2_arc",
    ]

    tasks_code = [
        "deepmind/code_contests",
        ]

    tasks_data2text = [
        "e2e_nlg",
        "allenai/common_gen",
    ]

    tasks_data2text_wrmk=[
        "e2e_nlg@wrmk",
        "allenai/common_gen@wrmk",
        ]

    tasks_sum = [
        "UCL-DARK/openai-tldr-filtered",
        "cnn_dailymail",
        "samsum",
    ]

    tasks_text2sql = [
        "wikisql",
        "spider",
    ]

    tasks_safety = [
        "PKU-Alignment/PKU-SafeRLHF",
        "thu-coai/diasafety",
        ]

    tasks_general = [
        "liangzid/claude3_chat3.3k",
        "liangzid/claude3_short256",
        "teknium/GPT4-LLM-Cleaned",
        "BAAI/Infinity-Instruct",
    ]

#+end_src

 
 This is a spectrum of results.

#+ATTR_HTML: :align center
[[file:images/spectrum.png]]

**** Watermark Resistance experiments.

We use a green-set based watermarking by Kirchenbauer et al. to implement our text watermarks.

The original code comes from [[https://github.com/jwkirchenbauer/lm-watermarking][here]]. All rights are reserved for the original repository.

#+ATTR_HTML: :align center
[[file:images/wm-ex.png]]

Our evaluation code is in =./watermark=

=./watermark/llama3_watermark_gen.py= shows how to generate texts with watermark for llama3-70B.

You can simply run =bash ./watermark/1.1.train_with_wtmk.sh= to obtain all experiments.

Detection and visualization are here:
#+BEGIN_SRC sh

  $python ${root_dir}watermark/watermark_detect.py

  $python ${root_dir}plot_watermark_curve.py
#+END_SRC


**** Hyper-parameter's Experiments


#+ATTR_HTML: :align center
[[file:images/querytime-ex.png]]


#+ATTR_HTML: :align center
[[file:images/model-ex.png]]


**** Fidelity

#+ATTR_HTML: :align center
[[file:images/fidelity.png]]

**** Distribution to Victim Models

#+ATTR_HTML: :align center
[[file:images/corre-dist.png]]


** Reference

#+begin_src bib
  @misc{liang2025yeslordguidinglanguage,
      title={"Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation}, 
      author={Zi Liang and Qingqing Ye and Yanyun Wang and Sen Zhang and Yaxin Xiao and Ronghua Li and Jianliang Xu and Haibo Hu},
      year={2025},
      eprint={2409.02718},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2409.02718}, 
}
#+end_src