# CutClaw **Repository Path**: cellinlab/CutClaw ## Basic Information - **Project Name**: CutClaw - **Description**: CutClaw: Agentic Hours-Long Video Editing via Music Synchronization - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-09 - **Last Updated**: 2026-04-09 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

## 🦞CutClaw: Agentic Hours-Long Video Editing via Music Synchronization **🎬 Your personal editor for turning hours of footage into cinematic montages.** [![arXiv](https://img.shields.io/badge/arXiv-paper-b31b1b.svg)](https://arxiv.org/abs/2603.29664) [![GitHub Stars](https://img.shields.io/github/stars/GVCLab/CutClaw?style=social)](https://github.com/GVCLab/CutClaw)

[Overview](#-overview) • [Features](#-key-features) • [Gallery](#️-gallery) • [Quick Start](#-quick-start) • [Troubleshooting](#️-troubleshooting) • [Citation](#-citation)

---

## 💡 Overview CutClaw is an end-to-end editing system for long-form footage + music. It first deconstructs raw video/audio into structured captions, then uses a multi-agent pipeline to plan shots (`shot_plan`), select clip timestamps (`shot_point`), and validate final quality before rendering. ![CutClaw Pipeline](asset/method.png) --- ## ✨ Key Features

### 🎬 **One-Click Deconstruction**

Effortlessly transforms hours-long raw video and audio into structured, searchable assets with a single click.

### 🎯 **Instruction Control**

Requires only one text instruction to steer the editing style—easily generating fast-paced character montages or slow-paced emotional narratives.

### 📱 **Smart Auto-Cropping**

Content-aware cropping automatically identifies core subjects and adjusts aspect ratios to fit various social platforms.

### 🎵 **Music-Aware Sync**

Extracts musical beats and energy signals to build rhythm-aware cuts that perfectly match the music's pacing.

--- ## 🖼️ Gallery（remember to turn on the audio）

---- ## 🚀 Quick Start ### 1. Install ```bash git clone https://github.com/GVCLab/CutClaw.git cd CutClaw conda create -n CutClaw python=3.12 conda activate CutClaw pip install -r requirements.txt ``` > We strongly recommend the GPU-accelerated Decord/NVDEC build for faster video decoding. Build from [source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source). ### 2. Add your files ``` resource/ ├── video/ ← put your .mp4 / .mkv here ├── audio/ ← put your .mp3 / .wav here └── subtitle/ ← optional .srt (skips ASR, saves time) ``` ### 3. Run **UI (recommended)** ```bash streamlit run app.py ``` Then open `http://localhost:8501` in your browser. (*If `http://localhost:8501` does not work well, try `http://127.0.0.1:8501`) ![CutClaw UI demo](asset/UI.png) > Place your footage in the paths above, then you can directly select those files in the UI. Model selection guidance: - **Video model** - **Role**: shot/scene understanding and visual captioning. - **Recommended**: Gemini-3, Qwen3.5, GPT-5.3 - **Audio model** - **Role**: ASR plus music-structure parsing (beat/downbeat, pitch, energy) for music-aware segmentation. - **Recommended**: Gemini-3 - **Agent model** - **Role**: drives the Screenwriter + Editor + Reviewer loop to generate `shot_plan` and `shot_point`. - **Recommended**: MiniMax-2.7, Kimi-2.5, Claude-4.5 We leverage `LiteLLM` as the api manager gateway, the typical Model name is e.g. 'openai/MiniMax-2.7' which means using openai protocol to call the given model, more information see [LiteLLM documents](https://github.com/BerriAI/litellm).

CLI (advanced)

```bash python local_run.py \ --Video_Path "resource/video/xxxx.mp4" \ --Audio_Path "resource/audio/xxxx.mp3" \ --Instruction "xxxx" ```

Common config overrides

Any `src/config.py` parameter can be overridden with `--config.PARAM_NAME VALUE`. | Parameter | Default | Effect | |---|---|---| | `VIDEO_PATH` | `"resource/video/The_Dark_Knight.mkv"` | Default input video path used by UI remembered inputs | | `AUDIO_PATH` | `"resource/audio/Way_Down_We_Go.mp3"` | Default input audio path used by UI remembered inputs | | `INSTRUCTION` | `"Joker's crazy that want to change the world."` | Default editing instruction prompt | | `ASR_BACKEND` | `"litellm"` | ASR engine (`litellm` cloud or `whisper_cpp` local) | | `VIDEO_FPS` | `2` | Sampling FPS for preprocessing | | `MAIN_CHARACTER_NAME` | `"Joker"` | Protagonist name for character-focused edits | | `AUDIO_MIN_SEGMENT_DURATION` | `3.0` | Minimum beat segment duration (seconds) | | `AUDIO_MAX_SEGMENT_DURATION` | `5.0` | Maximum beat segment duration (seconds) | | `AUDIO_DETECTION_METHODS` | `["downbeat", "pitch", "mel_energy"]` | Audio keypoint detection methods | | `PARALLEL_SHOT_MAX_WORKERS` | `4` | Parallel shot selection workers | Example: ```bash python local_run.py \ --Video_Path "resource/video/xxxx.mp4" \ --Audio_Path "resource/audio/xxxx.mp3" \ --Instruction "xxxx" \ --config.MAIN_CHARACTER_NAME "Batman" \ --config.VIDEO_FPS 2 \ --config.AUDIO_TOTAL_SHOTS 50 ```

Then render manually: ```bash python render/render_video.py \ --shot-plan "Output//shot_plan_*.json" \ --shot-json "Output//shot_point_*.json" \ --video "resource/video/xxxx.mp4" \ --audio "resource/audio/xxxx.mp3" \ --output "output/final.mp4" \ --crop-ratio "9:16" \ --no-labels --render-hook-dialogue ```

--- ## 🛠️ Troubleshooting **Very slow runtime** 1. **API latency** — the pipeline sends a large number of concurrent requests to vision/language APIs. Speed is heavily dependent on your API provider's response time and rate limits. 2. **First-run Footage Deconstruction** — the first time you process a video, shot detection, captioning, ASR, and scene analysis all run from scratch. This is a one-time cost per video; subsequent edits with the same footage reuse the cached results and are much faster. 3. **GPU acceleration** — a CUDA-capable GPU significantly speeds up video decoding and encoding. We recommend building Decord with NVDEC support (see Install section). 4. **Video codec compatibility** — if the pipeline appears to hang during video-related steps, the source video's encoding may be the cause. In our testing, videos encoded with `libx264` worked reliably. ## ⭐ Citation If you find CutClaw useful for your research, welcome to cite our work using the following BibTeX: ```bibtex @article{cutclaw, title={CutClaw: Agentic Hours-Long Video Editing via Music Synchronization}, author={Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun}, journal={arXiv preprint arXiv:2603.29664}, year={2026} } ```