# WhisperJAV **Repository Path**: jackielaw1990/WhisperJAV ## Basic Information - **Project Name**: WhisperJAV - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2026-03-20 - **Last Updated**: 2026-03-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # WhisperJAV

A subtitle generator for Japanese Adult Videos. --- ### What is the idea: Transformer-based ASR architectures like Whisper suffer significant performance degradation when applied to the **spontaneous and noisy domain of JAV**. This degradation is driven by specific acoustic and temporal characteristics that defy the statistical distributions of standard training data. #### 1. The Acoustic Profile JAV audio is defined by "acoustic hell" and a low Signal-to-Noise Ratio (SNR), characterized by: * **Non-Verbal Vocalisations (NVVs):** A high density of physiological sounds (heavy breathing, gasps, sighs) and "obscene sounds" that lack clear harmonic structure. * **Spectral Mimicry:** These vocalizations often possess "curve-like spectrum features" that mimic the formants of fricative consonants or Japanese syllables (e.g., *fu*), acting as accidental adversarial examples that trick the model into recognizing words where none exist. * **Extreme Dynamics:** Volatile shifts in audio intensity, ranging from faint whispers (*sasayaki*) to high-decibel screams, which confuse standard gain control and attention mechanisms. * **Linguistic Variance:** The prevalence of theatrical onomatopoeia and *Role Language* (*Yakuwarigo*) containing exaggerated intonations and slang absent from standard corpora. #### 2. Temporal Drift and Hallucination While standard ASR models are typically trained on short, curated clips, JAV content comprises long-form media often exceeding 120 minutes. Research indicates that processing such extended inputs causes **contextual drift** and error accumulation. Specifically, extended periods of "ambiguous audio" (silence or rhythmic breathing) cause the Transformer's attention mechanism to collapse, triggering repetitive **hallucination loops** where the model generates unrelated text to fill the acoustic void. #### 3. The Pre-processing Paradox & Fine-Tuning Risks Standard audio engineering intuition—such as aggressive denoising or vocal separation—often fails in this domain. Because Whisper relies on specific **log-Mel spectrogram** features, generic normalization tools can inadvertently strip high-frequency transients essential for distinguishing consonants, resulting in "domain shift" and erroneous transcriptions. Consequently, audio processing requires a "surgical," multi-stage approach (like VAD clamping) rather than blanket filtering. Furthermore, while fine-tuning models on domain-specific data can be effective, it presents a high risk of **overfitting**. Due to the scarcity of high-quality, ethically sourced JAV datasets, fine-tuned models often become brittle, losing their generalization capabilities and leading to inconsistent "hit or miss" quality outputs. **WhisperJAV** is an attempt to address above failure points. The inference pipelines do: 1. **Acoustic Filtering:** Deploys **scene-based segmentation** and VAD clamping under the hypothesis that distinct scenes possess uniform acoustic characteristics, ensuring the model processes coherent audio environments rather than mixed streams [1-3]. 2. **Linguistic Adaptation:** Normalizes **domain-specific terminology** and preserves onomatopoeia, specifically correcting dialect-induced tokenization errors (e.g., in *Kansai-ben*) that standard BPE tokenizers fail to parse [4, 5]. 3. **Defensive Decoding:** Tunes **log-probability thresholding** and `no_speech_threshold` to systematically discard low-confidence outputs (hallucinations), while utilizing regex filters to clean non-lexical markers (e.g., `(moans)`) from the final subtitle track [6, 7]. --- ## Quick Start ### GUI (Recommended for most users) ```bash whisperjav-gui ``` A window opens. Add your files, pick a mode, click Start. ### Command Line ```bash # Basic usage whisperjav video.mp4 # Specify mode and sensitivity whisperjav audio.mp3 --mode balanced --sensitivity aggressive # Process a folder whisperjav /path/to/media_folder --output-dir ./subtitles ``` --- ## Features ### Processing Modes | Mode | Backend | Scene Detection | VAD | Best For | |------|---------|-----------------|-----|----------| | **faster** | stable-ts (turbo) | No | No | Speed priority, clean audio | | **fast** | stable-ts | Yes | No | General use, mixed quality | | **balanced** | faster-whisper | Yes | Yes | Default. Noisy audio, dialogue-heavy | | **fidelity** | OpenAI Whisper | Yes | Yes (Silero) | Maximum accuracy, slower | | **transformers** | HuggingFace | Optional | Internal | Japanese-optimized model, customizable | ### Sensitivity Settings - **Conservative**: Higher thresholds, fewer hallucinations. Good for noisy content. - **Balanced**: Default. Works for most content. - **Aggressive**: Lower thresholds, catches more dialogue. Good for whisper/ASMR content. ### Transformers Mode (New in v1.7) Uses HuggingFace's `kotoba-tech/kotoba-whisper-v2.2` model, which is optimized for Japanese conversational speech: ```bash whisperjav video.mp4 --mode transformers # Customize parameters whisperjav video.mp4 --mode transformers --hf-beam-size 5 --hf-chunk-length 20 ``` **Transformers-specific options:** - `--hf-model-id`: Model (default: `kotoba-tech/kotoba-whisper-v2.2`) - `--hf-chunk-length`: Seconds per chunk (default: 15) - `--hf-beam-size`: Beam search width (default: 5) - `--hf-temperature`: Sampling temperature (default: 0.0) - `--hf-scene`: Scene detection method (`none`, `auditok`, `silero`, `semantic`) ### Two-Pass Ensemble Mode (New in v1.7) Runs your video through two different pipelines and merges results. Different models catch different things. ```bash # Pass 1 with transformers, Pass 2 with balanced whisperjav video.mp4 --ensemble --pass1-pipeline transformers --pass2-pipeline balanced # Custom sensitivity per pass whisperjav video.mp4 --ensemble --pass1-pipeline balanced --pass1-sensitivity aggressive --pass2-pipeline fidelity ``` **Merge strategies:** - `smart_merge` (default): Intelligent overlap detection - `pass1_primary` / `pass2_primary`: Prioritize one pass, fill gaps from other - `full_merge`: Combine everything from both passes ### Speech Enhancement tools (New in v1.7.3) Pre-process audio scenes. When selected runs per-scene after scene detection. Note: Only use for surgical reasons. In general any audio processing that may alter mel-spectogram has the potential to introduce more artefacts and hallucination. ```bash # ClearVoice denoising (48kHz, best quality) whisperjav video.mp4 --mode balanced --pass1-speech-enhancer clearvoice # ClearVoice with specific 16kHz model whisperjav video.mp4 --mode balanced --pass1-speech-enhancer clearvoice:FRCRN_SE_16K # FFmpeg DSP filters (lightweight, always available) whisperjav video.mp4 --mode balanced --pass1-speech-enhancer ffmpeg-dsp:loudnorm,denoise # ZipEnhancer (lightweight SOTA) whisperjav video.mp4 --mode balanced --pass1-speech-enhancer zipenhancer # BS-RoFormer vocal isolation whisperjav video.mp4 --mode balanced --pass1-speech-enhancer bs-roformer # Ensemble with different enhancers per pass whisperjav video.mp4 --ensemble \ --pass1-pipeline balanced --pass1-speech-enhancer clearvoice \ --pass2-pipeline transformers --pass2-speech-enhancer none ``` **Available backends:** | Backend | Description | Models/Options | |---------|-------------|----------------| | `none` | No enhancement (default) | - | | `ffmpeg-dsp` | FFmpeg audio filters | `loudnorm`, `denoise`, `compress`, `highpass`, `lowpass`, `deess` | | `clearvoice` | ClearerVoice denoising | `MossFormer2_SE_48K` (default), `FRCRN_SE_16K` | | `zipenhancer` | ZipEnhancer 16kHz | `torch` (GPU), `onnx` (CPU) | | `bs-roformer` | Vocal isolation | `vocals`, `other` | **Syntax:** `--pass1-speech-enhancer ` or `--pass1-speech-enhancer :` ### GUI Parameter Customization The GUI has three tabs: 1. **Transcription Mode**: Select pipeline, sensitivity, language 2. **Advanced Options**: Model override, scene detection method, debug settings 3. **Two-Pass Ensemble**: Configure both passes with full parameter customization via JSON editor The Ensemble tab lets you customize beam size, temperature, VAD thresholds, and other ASR parameters without editing config files. ### AI Translation Generate subtitles and translate them in one step: ```bash # Generate and translate whisperjav video.mp4 --translate # Or translate existing subtitles whisperjav-translate -i subtitles.srt --provider deepseek ``` Supports DeepSeek (cheap), Gemini (free tier), Claude, GPT-4, OpenRouter, and local LLMs. #### Local LLM Translation (No API Key Required) Run translation entirely on your GPU - no cloud API needed: ```bash whisperjav-translate -i subtitles.srt --provider local ``` Available models: | Model | VRAM | Notes | |-------|------|-------| | `llama-8b` | 6GB+ | **Default** - Llama 3.1 8B | | `gemma-9b` | 8GB+ | Gemma 2 9B (alternative) | | `llama-3b` | 3GB+ | Llama 3.2 3B (low VRAM only) | | `auto` | varies | Auto-selects based on available VRAM | ```bash # Use specific model whisperjav-translate -i subtitles.srt --provider local --model gemma-9b ``` **Resume Support**: If translation is interrupted, just run the same command again. It automatically resumes from where it left off using the `.subtrans` project file. --- ## What Makes It Work for JAV ### Scene Detection Splits audio at natural breaks instead of forcing fixed-length chunks. This prevents cutting off sentences mid-word. Three methods are available: - **Auditok** (default): Energy-based detection, fast and reliable - **Silero**: Neural VAD-based detection, better for noisy audio - **Semantic** (new in v1.7.4): Texture-based clustering using MFCC features, groups acoustically similar segments together ### Voice Activity Detection (VAD) Identifies when someone is actually speaking vs. background noise or music. Reduces false transcriptions during quiet moments. ### Japanese Post-Processing - Handles sentence-ending particles (ね, よ, わ, の) - Preserves aizuchi (うん, はい, ええ) - Recognizes dialect patterns (Kansai-ben, feminine/masculine speech) - Filters out common Whisper hallucinations ### Hallucination Removal Whisper sometimes generates repeated text or phrases that weren't spoken. WhisperJAV detects and removes these patterns. --- ## Content-Specific Recommendations | Content Type | Mode | Sensitivity | Notes | |--------------|------|-------------|-------| | Drama / Dialogue Heavy | balanced | aggressive | Or try transformers mode | | Group Scenes | faster | conservative | Speed matters, less precision needed | | Amateur / Homemade | fast | conservative | Variable audio quality | | ASMR / VR / Whisper | fidelity | aggressive | Maximum accuracy for quiet speech | | Heavy Background Music | balanced | conservative | VAD helps filter music | | Maximum Accuracy | ensemble | varies | Two-pass with different pipelines | --- ## Installation ### Windows (Recommended) *Best for: Most users, beginners, and those who want a GUI.* 1. **Download the Installer:** **[Download WhisperJAV-1.7.5-Windows-x86_64.exe](https://github.com/meizhong986/WhisperJAV/releases/latest)** 2. **Run the File:** Double-click the downloaded `.exe`. 3. **Follow the Prompts:** The installer handles all dependencies (Python, FFmpeg, Git) automatically. 4. **Launch:** Open "WhisperJAV" from your Desktop shortcut. > **Note:** The first launch may take a few minutes as it initializes the engine. GPU is auto-detected; CPU-only mode is used if no compatible GPU is found. **Upgrading?** Just run the new installer. Your AI models (~3GB), settings, and cached downloads will be preserved. --- ### macOS (Apple Silicon & Intel) *Best for: M1/M2/M3/M4 users and Intel Mac users.* The install script auto-detects your Mac architecture and handles PyTorch dependencies automatically. **1. Install Prerequisites** ```bash # Install Xcode Command Line Tools (required for GUI) xcode-select --install # Install Homebrew (if not installed) /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" # Install system tools brew install python@3.11 ffmpeg git ``` > **GUI Requirement:** The Xcode Command Line Tools are required to compile `pyobjc`, which enables the GUI. Without it, only CLI mode will work. **2. Install WhisperJAV** ```bash git clone https://github.com/meizhong986/whisperjav.git cd whisperjav chmod +x installer/install_linux.sh # Run the installer (auto-detects Mac architecture) ./installer/install_linux.sh ``` > **Intel Macs:** The script automatically uses CPU-only mode. Expect slower processing (5-10x) compared to Apple Silicon with MPS acceleration. --- ### Linux (Ubuntu/Debian/Fedora) *Best for: Servers, desktops with NVIDIA GPUs.* The install script auto-detects NVIDIA GPUs and installs the matching CUDA version. **1. Install System Dependencies** ```bash # Debian / Ubuntu sudo apt-get update && sudo apt-get install -y python3-dev python3-pip build-essential ffmpeg libsndfile1 git # Fedora / RHEL sudo dnf install python3-devel gcc ffmpeg libsndfile git ``` **2. Install WhisperJAV** ```bash git clone https://github.com/meizhong986/whisperjav.git cd whisperjav chmod +x installer/install_linux.sh # Standard Install (auto-detects GPU) ./installer/install_linux.sh # Or force CPU-only (for servers without GPU) ./installer/install_linux.sh --cpu-only ``` > **Performance:** A 2-hour video takes ~5-10 minutes on GPU vs ~30-60 minutes on CPU. --- ### Advanced / Developer *Best for: Contributors and Python experts.*

Manual pip install

> **Warning:** Manual `pip install` is risky due to dependency conflicts (NumPy 2.x vs SciPy). We strongly recommend using the scripts above. **1. Create Environment** ```bash python -m venv whisperjav-env source whisperjav-env/bin/activate # Linux/Mac # whisperjav-env\Scripts\activate # Windows ``` **2. Install PyTorch First (Critical)** You must install PyTorch *before* the main package to ensure hardware acceleration works. - **NVIDIA GPU:** `pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124` - **Apple Silicon:** `pip install torch torchaudio` - **CPU only:** `pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu` **3. Install WhisperJAV** ```bash pip install git+https://github.com/meizhong986/whisperjav.git@main ```

Editable / Dev install

Use this if you plan to modify the code. ```bash git clone https://github.com/meizhong986/whisperjav.git cd whisperjav # Windows installer\install_windows.bat --dev # Mac/Linux ./installer/install_linux.sh --dev # Or manual pip install -e ".[dev]" ```

Windows source install

```batch git clone https://github.com/meizhong986/whisperjav.git cd whisperjav installer\install_windows.bat # Auto-detects GPU installer\install_windows.bat --cpu-only # Force CPU only installer\install_windows.bat --cuda118 # Force CUDA 11.8 installer\install_windows.bat --cuda124 # Force CUDA 12.4 ```

### Prerequisites - **Python 3.10-3.12** (3.9 not supported due to pysubtrans dependency, 3.13+ incompatible with openai-whisper) - **FFmpeg** in your system PATH - **GPU recommended**: NVIDIA CUDA, Apple MPS, or AMD ROCm - **8GB+ disk space** for installation

Detailed Windows Prerequisites

#### NVIDIA GPU Setup 1. Install latest [NVIDIA drivers](https://www.nvidia.com/drivers) 2. Install [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) matching your driver version 3. Install [cuDNN](https://developer.nvidia.com/cudnn) matching your CUDA version #### FFmpeg 1. Download from [gyan.dev/ffmpeg/builds](https://www.gyan.dev/ffmpeg/builds) 2. Extract to `C:\ffmpeg` 3. Add `C:\ffmpeg\bin` to your PATH #### Python Download from [python.org](https://www.python.org/downloads/windows/). Check "Add Python to PATH" during installation.

--- ## CLI Reference ```bash # Basic usage whisperjav video.mp4 whisperjav video.mp4 --mode balanced --sensitivity aggressive # All modes: faster, fast, balanced, fidelity, transformers whisperjav video.mp4 --mode fidelity # Transformers mode with custom parameters whisperjav video.mp4 --mode transformers --hf-beam-size 5 --hf-chunk-length 20 # Two-pass ensemble whisperjav video.mp4 --ensemble --pass1-pipeline transformers --pass2-pipeline balanced whisperjav video.mp4 --ensemble --pass1-pipeline balanced --pass2-pipeline fidelity --merge-strategy smart_merge # Output options whisperjav video.mp4 --output-dir ./subtitles whisperjav video.mp4 --subs-language english-direct # Batch processing whisperjav /path/to/folder --output-dir ./subtitles whisperjav /path/to/folder --skip-existing # Resume interrupted batch (skip already processed) # Debugging whisperjav video.mp4 --debug --keep-temp # Translation whisperjav video.mp4 --translate --translate-provider deepseek whisperjav-translate -i subtitles.srt --provider gemini ``` Run `whisperjav --help` for all options. --- ## Troubleshooting **FFmpeg not found**: Install FFmpeg and add it to your PATH. **Slow processing / GPU warning**: Your PyTorch might be CPU-only. Reinstall with GPU support: ```bash pip uninstall torch torchvision torchaudio pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124 ``` **model.bin error in faster mode**: Enable Windows Developer Mode or run as Administrator, then delete the cached model folder: ```powershell Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub\models--Systran--faster-whisper-large-v2" ``` --- ## Performance Rough estimates for processing time per hour of video: | Platform | Time | |----------|------| | NVIDIA GPU (CUDA) | 5-10 minutes | | Apple Silicon (MPS) | 8-15 minutes | | AMD GPU (ROCm) | 10-20 minutes | | CPU only | 30-60 minutes | --- ## Contributing Contributions welcome. See `CONTRIBUTING.md` for guidelines. ```bash git clone https://github.com/meizhong986/whisperjav.git cd whisperjav pip install -e .[dev] python -m pytest tests/ ``` --- ## License MIT License. See [LICENSE](LICENSE) file. --- ## Citation and credits - Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio." (2025). arXiv:2501.11378. - Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down." (2025). arXiv:2505.12969. - PromptASR for Contextualized ASR with Controllable Style." (2024). arXiv:2309.07414. - In-Context Learning Boosts Speech Recognition." (2025). arXiv:2505.1 - Koenecke, A., et al. (2024). "Careless Whisper: Speech-to-Text Hallucination Harms." ACM FAccT 2024. - Bain, M., et al. (2023). "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio." arXiv:2303.00747. ## Acknowledgments - [OpenAI Whisper](https://github.com/openai/whisper) - The underlying ASR model - [stable-ts](https://github.com/jianfch/stable-ts) - Timestamp refinement - [faster-whisper](https://github.com/guillaumekln/faster-whisper) - Optimized CTranslate2 inference - [HuggingFace Transformers](https://github.com/huggingface/transformers) - Transformers pipeline backend - [Kotoba-Whisper](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2) - Japanese-optimized Whisper model - The testing community for feedback and bug reports --- ## Disclaimer This tool generates accessibility subtitles. Users are responsible for compliance with applicable laws regarding the content they process.