# 桌宠 **Repository Path**: yxhn05/desktop-pet ## Basic Information - **Project Name**: 桌宠 - **Description**: 用于唤醒网页并发送消息的桌面应用（网页的外挂形式） - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2025-10-19 - **Last Updated**: 2026-05-10 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Desktop Pet Assistant A smart desktop pet assistant with voice interaction, voiceprint recognition, and AI capabilities.

Desktop Pet Speaking Animation

## Table of Contents - [Overview](#overview) - [Features](#features) - [Demo](#demo) - [Prerequisites](#prerequisites) - [Installation](#installation) - [Configuration](#configuration) - [Usage](#usage) - [Project Structure](#project-structure) - [Dependencies](#dependencies) - [Contributing](#contributing) - [License](#license) - [Acknowledgments](#acknowledgments) ## Overview Desktop Pet Assistant is a Python-based application that creates an interactive desktop companion. It features voice interaction, voiceprint recognition for user authentication, and integration with large language models for intelligent conversations. The pet appears as a small window on your desktop and responds to voice commands. ## Features ### Core Functionality - **Real-time Speech Recognition**: Uses iFlytek WebSocket API for real-time speech-to-text conversion - **Speaker Verification**: Integrated voice biometrics to identify and authenticate users - **Interactive Pet Interface**: Visual pet with animated states and responses - **Multi-user Support**: Supports multiple user profiles with automatic speaker identification - **Session Management**: Handles user sessions with automatic inactivity detection and reminders ### Technical Implementation - **Audio Queue System**: Asynchronous processing of audio segments with message queuing - **Audio Concatenation**: Automatically concatenates audio segments for complete session audio - **Wake Word Detection**: Extended wake word support with multiple variations - **Real-time Feedback**: Visual and audio feedback for all voice interactions ### Additional Features - **Interactive Desktop Pet**: A cute pet that lives on your desktop with four different states (idle, listening, thinking, speaking) - **Voice Interaction**: Real-time speech-to-text and text-to-speech capabilities using iFlytek API - **Voiceprint Recognition**: User authentication through voice biometrics with local and cloud database support - **AI Integration**: Powered by GLM-4 large language model via Ollama for intelligent conversations - **WebSocket Communication**: Real-time communication with web frontends - **Tool Calling**: Browser control, message sending, and connection monitoring capabilities - **Advanced Voice Activity Detection**: Uses Silero VAD for accurate voice detection, eliminating false triggers from background noise - **Real-time User Switching**: Automatically performs voiceprint recognition during voice interactions to switch between users seamlessly - **Multi-user Support**: Handles multiple registered users with different identities and permissions - **Visual Feedback**: Rich visual states showing pet's current activity (idle, listening, thinking, speaking) - **Role-based Access Control**: Users can have different roles and permissions based on their job titles in two business domains - **Database Management**: MySQL database for storing voiceprint data and user information ## Demo ### Desktop Pet States | State | Appearance | Description | |-------|------------|-------------| | Idle | ![Idle](assets/icon.png) | Waiting for interaction | | Listening | ![Listening](assets/icon.png) | Actively recording your voice | | Thinking | ![Thinking](assets/icon.png) | Processing your request with AI | | Speaking | ![Speaking](assets/speak_icon.webp) | Responding with synthesized voice | ### Role-based Access Control The system supports two business domains with hierarchical role structures: 1. **Mechanical Arm Business**: - Data Entry Clerk (L1) - Data Analyst (L2) - Systems Engineer (L3) - Technical Expert (L4) - Smart Manufacturing Director (L5) 2. **Flower Cultivation Business**: - Data Collector (L1) - Data Analyst (L2) - Cultivation Expert (L3) - Systems Engineer (L4) - Smart Agriculture Director (L5) ## Prerequisites - Python 3.7+ - Windows/Linux/macOS - iFlytek API credentials (for ASR and voiceprint recognition) - Ollama with GLM-4 model (for AI capabilities) - MySQL database for storing voiceprint data and user information ## Installation 1. Clone the repository: ``` git clone cd pet ``` 2. Install required packages: ``` pip install -r requirements.txt ``` 3. Configure environment variables in `config.py`: - Set your iFlytek API credentials - Configure database settings - Adjust WebSocket and frontend settings - Set up voiceprint group settings 4. Set up the database: ``` mysql -u username -p < database/user_title_schema.sql ``` 5. Make sure Ollama is installed and running with the GLM-4 model: ``` ollama pull glm-4 ollama run glm-4 ``` ## Usage ### Starting the Application ``` python main.py ``` ### Voice Interaction Flow 1. Say any wake word to activate the pet 2. The pet will enter listening mode (visual feedback) 3. Speak your command or query 4. The system processes your voice in real-time 5. Speaker verification identifies you automatically 6. The pet responds based on your identity and the content ### Session Management - **3-minute inactivity reminder**: Pet reminds if inactive for 3 minutes - **10-minute session timeout**: Automatic session termination after 10 minutes - **Multi-user switching**: Seamless switching between identified users ### Voice Commands - Speak naturally to the pet to interact with the AI assistant - Use voiceprint verification to authenticate yourself - Ask the pet to open the web browser or send messages to the frontend - Request to add or delete voiceprints - Ask to end current task - Request user switching ### User Authentication Users can be authenticated through voiceprint recognition. Once authenticated, users can access features based on their assigned roles in either mechanical arm or flower cultivation business domains. ### Adding New Users To add a new user with voiceprint: 1. Say "添加声纹" (Add voiceprint) to the pet 2. Enter admin password when prompted 3. Provide English name, optional Chinese name 4. Select one or more job titles from the available options 5. Record a 4-second voice sample ### Removing Users To remove a user's voiceprint: 1. Say "删除声纹" (Delete voiceprint) to the pet 2. Enter admin password when prompted 3. Provide the English name of the user to remove ### States - **Idle** (Green): Waiting for interaction - **Listening** (Blue): Actively recording your voice - **Thinking** (Yellow): Processing your request with AI - **Speaking** (Animated): Responding to you with synthesized voice and animated WebP image The pet visually indicates its current state through color-coded icons, providing clear feedback on its activities. ## System Architecture ### Audio Processing Pipeline 1. **Voice Activity Detection** - Detects when user starts speaking 2. **Real-time Transcription** - Streams audio to iFlytek ASR API for real-time conversion 3. **Audio Segmentation** - Captures audio segments for processing 4. **Queue Management** - Adds segments to asynchronous processing queue 5. **Audio Concatenation** - Combines segments into complete session audio 6. **Speaker Verification** - Performs voice biometric identification 7. **Response Generation** - Generates appropriate responses based on user and content ### Key Components #### Voice Worker (`voice_worker/`) - **voice_worker_new.py**: Main voice worker class (refactored architecture) - **audio_recorder.py**: Audio recording with voice activity detection - **audio_processor.py**: Speech-to-text and text-to-speech processing - **voice_verifier.py**: Voice print verification and user authentication - **user_manager.py**: User state management and permissions - **agent_controller.py**: Agent controller for AI interactions - **websocket_client.py**: WebSocket client for frontend communication - **interaction.py**: Voice interaction handling logic (legacy) - **audio_utils.py**: Audio processing utilities and configuration #### Audio System (New Architecture) - **AudioRecorder**: Handles audio recording and voice activity detection - **AudioProcessor**: Manages STT and TTS operations - **VoiceVerifier**: Performs voice biometric identification - **UserManager**: Manages user state and permissions ## Project Structure ``` pet/ ├── main.py # Main application entry point ├── desktop_pet_window.py # Pet UI window ├── websocket_server.py # WebSocket server for frontend ├── config.py # Configuration settings ├── controller/ │ └── serviceController.py # Service controller with LLM integration ├── Tools/ │ ├── websocket_tool.py # WebSocket related tools │ └── voice_management_tool.py # Voice management tools ├── utils/ │ ├── audio_recorder.py # Audio recording utilities │ ├── tts_player.py # Text-to-speech player │ ├── silero_vad.py # Silero VAD implementation for voice activity detection │ ├── audio_player.py # Audio player │ └── webp_player.py # WebP animation player ├── Mapper/ │ ├── wav_mapper.py # WAV file data mapping │ ├── voiceDB_mapper.py # Voiceprint database mapping │ └── user_title_mapper.py # User title mapping ├── Impl/ │ ├── wav_impl.py # WAV file business implementation │ └── user_title_impl.py # User title business implementation ├── voice_worker/ # Voice processing modules (refactored) │ ├── __init__.py # Voice worker package init │ ├── voice_worker_new.py # Main voice worker class (new architecture) │ ├── audio_recorder.py # Audio recording with VAD │ ├── audio_processor.py # Audio processing (STT/TTS) │ ├── voice_verifier.py # Voice print verification │ ├── user_manager.py # User state management │ ├── agent_controller.py # Agent controller │ ├── websocket_client.py # WebSocket client │ ├── interaction.py # Interaction handling (legacy) │ ├── audio_utils.py # Audio utilities │ ├── db_utils.py # Database utilities │ ├── websocket_client.py # WebSocket client implementation │ ├── xfyun_auth.py # iFlytek API authentication │ └── README_REFACTOR.md # Refactoring documentation ├── assets/ # Resource files ├── database/ # Database schema files ├── models/ # ML models and data ├── temp_audio/ # Temporary audio files └── requirements.txt # Python dependencies ``` ## Configuration ### Audio Parameters ```python AUDIO_PARAMS = { "RATE": 16000, # 16kHz sample rate (iFlytek requirement) "CHANNELS": 1, # Mono channel "FORMAT": pyaudio.paInt16, # 16-bit PCM format "FRAME_SIZE": 640, # 20ms/frame "STREAM_FRAME_SIZE": 1280 # 40ms/frame - iFlytek API requirement } ``` ### Wake Words The system supports multiple wake word variations: - "小飞你好", "小v你好", "小v", "小V你好", "小V" - "你好小飞", "你好小v", "你好小V" - "小飞", "小飞在吗", "小飞醒醒", "小飞启动" - "小v", "小V", "小v在吗", "小V在吗" - "小v醒醒", "小V醒醒", "小v启动", "小V启动" Key configuration options in `config.py`: - **iFlytek API Settings**: For speech recognition and voiceprint services - **Voiceprint Database Settings**: Group names and IDs for voiceprint management - **WebSocket Settings**: Host and port configuration - **Database Configuration**: MySQL connection settings for storing voice data - **Frontend URL**: Web interface URL for extended functionality - **Admin Password**: Password for administrative functions - **State Images**: Paths to visual assets for different pet states ## Dependencies See `requirements.txt` for a complete list of dependencies: - PyQt5: GUI framework - PyAudio: Audio processing - WebSocket: Real-time communication - Langchain/Ollama: AI integration - Requests: HTTP requests - Pyttsx3: Text-to-speech - Psutil: System utilities - Torch/Torchaudio: For Silero VAD voice activity detection - MySQL Connector: Database connectivity - PyDub: Audio file manipulation - PyGame: Multimedia support including WebP animations - Pillow: Image processing - Scikit-learn: Machine learning utilities Install all dependencies with: ``` pip install -r requirements.txt ``` ## Troubleshooting ### Common Issues 1. **Audio Not Recognized**: Check microphone permissions and audio levels 2. **High Latency**: Adjust frame size and buffer parameters 3. **Speaker Verification Fails**: Ensure clear voice input and minimal background noise 4. **WebSocket Connection Issues**: Verify iFlytek API credentials and network connectivity ### Debug Information The system provides detailed logging for: - Audio capture and processing - ASR results and confidence scores - Speaker verification attempts and results - Queue processing status ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Acknowledgments - iFlytek for providing speech and voiceprint APIs - Ollama for local LLM deployment - PyQt5 for the GUI framework - Silero for the VAD model