# SafeSpeech

**Repository Path**: hu-hanqing/SafeSpeech

## Basic Information

- **Project Name**: SafeSpeech
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-03-06
- **Last Updated**: 2025-03-06

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# SafeSpeech

This is the source code of our paper "SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis" in the USENIX Security 2025. We propose a proactive framework named SafeSpeech utilizing the pivotal objective optimization and Speech PErturbative Concealment (SPEC) techniques to prevent publicly uploaded speeches from unauthorized and malicious speech synthesis.


\[[Demo Page](https://wxzyd123.github.io/safespeech)\]


## Setup
We tested our experiments on Ubuntu 20.04. And at least one GPU is needed.

The required dependencies can be installed by running the following:

```bash
conda create --name safespeech python=3.8
conda activate safespeech
pip install -r requirements.txt
sudo apt install ffmpeg
```


## Pre-trained Models

Before fine-tuning BERT-VITS2, you should download the pre-trained checkpoints. Assuming the checkpoint folder is `checkpoints`.

- BERT-VITS2: You can download checkpoints [here](https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/tree/main) to `checkpoints/base_models`;

- DeBERTa: You can download pre-trained BERT models to `bert/deberta-v3-large`. You can download it [here](https://huggingface.co/microsoft/deberta-v3-large).

- WavLM: BERT-VITS2 employs the pre-trained WavLM to enhance the timbre similarity. You can download it [here](https://huggingface.co/microsoft/wavlm-base-plus) to `bert_vits2/slm/wavlm-base-plus`.
- ECAPA-TDNN: We utilize the ECAPA-TDNN encoder as the timbre extractor. You can download it [here](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) to `encoders/spkrec-ecapa-voxceleb`.

Alternatively, you can download models by this command:
```
python download_models.py
```


## 1. Datasets

In our paper, we have conducted our experiments on two datasets.

For [LibriTTS](http://www.openslr.org/60/), we download the train-clean-100.tar.gz subset and select speaker 5339. For [CMU ARCTIC](http://festvox.org/cmu_arctic/packed/), we select 100 sentences from each speaker. You can use your customized voices to achieve protection as follows, and we use the LibriTTS dataset as an example:

1. Move dataset to `data/{dataset_name}`, the structure of dataset can be `data/{dataset_name}/{speaker-id}/{name}.wav`.
  
2. The training dataset is indexed by a file list. The initial file list is like `{path}|{speaker-id}|{language}|{text}`, such as the provided`filelists/libritts_train_text.txt`. Then convert the file list to the correct form that BERT-VITS2 can accept by:
   ```bash
   python preprocess_text.py --file-path filelists/libritts_train_text.txt
   ```
   Then the processed and cleaned file list can be found at `filelists/libritts_train_text.txt.cleaned`, which can index the dataset.


**Remark**: We provide the LibriTTS in `data/LibriTTS` and its corresponding file lists in `filelists`, you can use them directly without preprocessing.


## 2. Protect

After obtaining the dataset and successfully running the model, you can protect the dataset by SafeSpeech.

1. Get BERT files from DeBERTa-V3:
   ```bash
   python bert_gen.py --dataset LibriTTS --mode clean
   ```

2. Generate perturbation:
   ```
   python protect.py --dataset LibriTTS \
                     --model BERT_VITS2 \
                     --batch-size 27 \
                     --gpu 0 \
                     --mode SPEC \
                     --checkpoint-path checkpoints \
                     --epsilon 8 \
                     --perturbation-epochs 200
   ```

    Basic arguments:

   - `--dataset`: which dataset to protect. Default: LibriTTS
   - `--model`: the surrogate model. Default: BERT_VITS2
   - `--batch-size`: the batch size of training and perturbation generation. Default: 27
   - `--gpu`: use which GPU. Default:0
   - `--mode`: the protection mode of the SafeSpeech. Default: SPEC
   - `--checkpoints-path`: the storing dir of the checkpoints. Default: checkpoints 
   - `--epsilon`: the perturbation radius boundary. Default:8
   - `--perturbation-epochs`: the optimization iterations of perturbation. Default: 200

    For data protection, we provide two protective modes: [`SPEC` and `SafeSpeech`]. The mode of `SPEC` implements the proposed method in Section 4.1, while `SafeSpeech` combing the introduced perceptual loss. For more protective methods, please refer to their open-source repositories: [AdvPoison](https://arxiv.org/abs/2106.10807), [SEP](https://github.com/Sizhe-Chen/SEP), [Unlearnable Examples/PTA](https://github.com/HanxunH/Unlearnable-Examples), [AttackVC](https://github.com/cyhuang-tw/attack-vc), and [AntiFake](https://github.com/WUSTL-CSPL/AntiFake).

    In this experiment, large GPU memories are needed. We set the batch size as 27 on an A800 GPU with 80GB memory. 

3. After generating the perturbation, you can save the generated audio by:

    ```bash
    python save_audio.py --mode clean --batch-size 27
    ```

    or

    ```bash
    python save_audio.py --mode SPEC --batch-size 27
    ```

    The saved dataset can be found at `data/{dataset}/protected/{mode}`.


## 3. Fine-tuning

You can fine-tune the model on the original dataset or protected dataset.

1. Before training, the BERT file should be generated by:
    ```
    python bert_gen.py --dataset LibriTTS --mode SPEC
    ```

2. Fine-tuning on the original dataset without perturbation:
    ```bash
    python train.py --mode clean --batch-size 64
    ```

3. Fine-tuning on the protected dataset by SafeSpeech:
    ```bash
    python train.py --mode SPEC --batch-size 64
    ```

After fine-tuning, the code will generate the checkpoint at `checkpoints/{dataset}`.


## 4. Evaluation

You can evaluate the synthetic quality by this command:
```bash
python evaluate.py --mode SPEC
```


## **Acknowledgment**

- [BERT-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [Unlearnable Examples](https://github.com/HanxunH/Unlearnable-Examples)


## Citation

If you find our repository helpful, please consider citing our work in your research or project.

```
@inproceedings{zhang2025safespeech,
  author = {Zhang, Zhisheng and Wang, Derui and Yang, Qianyi and Huang, Pengyang and Pu, Junhan and Cao, Yuxin and Ye, Kai and Hao, Jie and Yang, Yixian},
  title = {SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis},
  booktitle = {34th USENIX Security Symposium (USENIX Security 25)},
  year = {2025},
  address = {Seattle, WA, USA}
}
```


## Disclaimer
SafeSpeech is utilized for personal sensitive information protection. If users use this tool to disrupt legitimate and beneficial speech synthesis, all the resulting consequences shall have nothing to do with the publishers and designers of SafeSpeech!