# Multi-Modal_AI_Agents

**Repository Path**: tomwoo/Multi-Modal_AI_Agents

## Basic Information

- **Project Name**: Multi-Modal_AI_Agents
- **Description**: 多模态AI智能体(LM_APP_002)
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-18
- **Last Updated**: 2025-08-02

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Multi-Modal_AI_Agents

This is a multi-modal AI agent project aimed at processing and understanding various types of data, such as text, images, and videos by integrating multiple AI technologies.

## Project Overview

- "OCR Pipelines" tab: Utilizes optical character recognition (OCR) tools to process PDF documents, extracting text, tables and images.
- "Video Summarization and Q&A Agent" tab: Provides a VSS (Video Search and Summarization) agent for handling video uploads, generating summaries and Q&A.

## Project Structure

- `main.py`: The main program.
- `ocr_pipelines.py`: "OCR Pipelines" tab, provides functionality for extracting text, tables and images from PDF documents.
- `vss_agent.py`: "Video Summarization and Q&A Agent" tab, defines a VssAgent class for handling video uploads, summary generation and Q&A tasks.
- `requirements.txt`: List of project dependencies.
- `LICENSE`: Project license.
- `.gitignore`: Git ignore file configuration.
- `images/`: Folder for storing project-related image files.
- `pdf_docs/`: Folder containing sample PDF documents.
- `prompt_examples/`: Folder for storing prompt example files.
- `question_examples/`: Folder containing example question files.
- `videos/`: Folder containing sample video files.

## Installation

Ensure that you have a Python 3.10+ environment installed. Then, you can install the project dependencies using the following command:

```bash
sudo apt install tesseract-ocr # Install Tesseract-OCR
pip install -r requirements.txt # Install third-party Python libraries (including Python Tesseract)
```

## Usage Instructions

1. Download the shared folder "Multi-Modal_AI_Agents/videos/" from Baidu Netdisk and save it to the project directory.  
   Link: https://pan.baidu.com/s/10ok-FRWBgxevlw1fqiH4-A?pwd=wwhn Extraction Code: wwhn
2. Run the following command to start the multi-modal AI agent:

```bash
python main.py
```

3. Access the web page at http://localhost:7860/.

## Contributions

Code contributions and issue reports are welcome. Please submit Pull Requests and Issues on Gitee.

## License

This project is licensed under the MIT License. For details, please view the `LICENSE` file.

## Notes

Create a ".env" file in the project directory with the following content:

```txt
HF_ENDPOINT="HuggingFace's (proxy) server URL"

VSS_HOST = "Name or address of the VSS Agent" # If not set, cannot upload video files, generate video summaries, and conduct video Q&A
VSS_PORT = "Port number of the VSS Agent" # If not set, cannot upload video files, generate video summaries, and conduct video Q&A
```