# Multi-Modal_AI_Agents **Repository Path**: tomwoo/Multi-Modal_AI_Agents ## Basic Information - **Project Name**: Multi-Modal_AI_Agents - **Description**: 多模态AI智能体(LM_APP_002) - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-18 - **Last Updated**: 2025-08-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Multi-Modal_AI_Agents This is a multi-modal AI agent project aimed at processing and understanding various types of data, such as text, images, and videos by integrating multiple AI technologies. ## Project Overview - "OCR Pipelines" tab: Utilizes optical character recognition (OCR) tools to process PDF documents, extracting text, tables and images. - "Video Summarization and Q&A Agent" tab: Provides a VSS (Video Search and Summarization) agent for handling video uploads, generating summaries and Q&A. ## Project Structure - `main.py`: The main program. - `ocr_pipelines.py`: "OCR Pipelines" tab, provides functionality for extracting text, tables and images from PDF documents. - `vss_agent.py`: "Video Summarization and Q&A Agent" tab, defines a VssAgent class for handling video uploads, summary generation and Q&A tasks. - `requirements.txt`: List of project dependencies. - `LICENSE`: Project license. - `.gitignore`: Git ignore file configuration. - `images/`: Folder for storing project-related image files. - `pdf_docs/`: Folder containing sample PDF documents. - `prompt_examples/`: Folder for storing prompt example files. - `question_examples/`: Folder containing example question files. - `videos/`: Folder containing sample video files. ## Installation Ensure that you have a Python 3.10+ environment installed. Then, you can install the project dependencies using the following command: ```bash sudo apt install tesseract-ocr # Install Tesseract-OCR pip install -r requirements.txt # Install third-party Python libraries (including Python Tesseract) ``` ## Usage Instructions 1. Download the shared folder "Multi-Modal_AI_Agents/videos/" from Baidu Netdisk and save it to the project directory. Link: https://pan.baidu.com/s/10ok-FRWBgxevlw1fqiH4-A?pwd=wwhn Extraction Code: wwhn 2. Run the following command to start the multi-modal AI agent: ```bash python main.py ``` 3. Access the web page at http://localhost:7860/. ## Contributions Code contributions and issue reports are welcome. Please submit Pull Requests and Issues on Gitee. ## License This project is licensed under the MIT License. For details, please view the `LICENSE` file. ## Notes Create a ".env" file in the project directory with the following content: ```txt HF_ENDPOINT="HuggingFace's (proxy) server URL" VSS_HOST = "Name or address of the VSS Agent" # If not set, cannot upload video files, generate video summaries, and conduct video Q&A VSS_PORT = "Port number of the VSS Agent" # If not set, cannot upload video files, generate video summaries, and conduct video Q&A ```