# microsoft_visual-chatgpt
**Repository Path**: didi2050/microsoft_visual-chatgpt
## Basic Information
- **Project Name**: microsoft_visual-chatgpt
- **Description**: 源库地址: https://github.com/microsoft/visual-chatgpt
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 2
- **Forks**: 0
- **Created**: 2023-04-08
- **Last Updated**: 2024-02-04
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Visual ChatGPT
**Visual ChatGPT** connects ChatGPT and a series of Visual Foundation Models to enable **sending** and **receiving** images during chatting.
See our paper: [Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models](https://arxiv.org/abs/2303.04671)
## Updates:
- Now Visual ChatGPT can support Chinese! Thanks to **@Wang-Xiaodong1899** for his efforts.
- We propose the **template** idea in Visual ChatGPT!
- A template is a **pre-defined execution flow** that assists ChatGPT in assembling complex tasks involving multiple foundation models.
- A template contains the **experiential solution** to complex tasks as determined by humans.
- A template can **invoke multiple foundation models** or even **establish a new ChatGPT session**
- To define a **template**, simply adding a class with attributes `template_model = True`
- Thanks to **@ShengmingYin** and **@thebestannie** for providing a template example in `InfinityOutPainting` class (see the following gif)
- Firstly, run `python visual_chatgpt.py --load "ImageCaptioning_cuda:0,ImageEditing_cuda:1,VisualQuestionAnswering_cuda:2"`
- Secondly, say `extend the image to 2048x1024` to Visual ChatGPT!
- By simply creating an `InfinityOutPainting` template, Visual ChatGPT can seamlessly extend images to any size through collaboration with existing `ImageCaptioning`, `ImageEditing`, and `VisualQuestionAnswering` foundation models, **without the need for additional training**.
- **Visual ChatGPT needs the effort of the community! We crave your contribution to add new and interesting features!**
## Insight & Goal:
On the one hand, **ChatGPT (or LLMs)** serves as a **general interface** that provides a broad and diverse understanding of a
wide range of topics. On the other hand, **Foundation Models** serve as **domain experts** by providing deep knowledge in specific domains.
By leveraging **both general and deep knowledge**, we aim at building an AI that is capable of handling various tasks.
## Demo
## System Architecture

## Quick Start
```
# clone the repo
git clone https://github.com/microsoft/visual-chatgpt.git
# Go to directory
cd visual-chatgpt
# create a new environment
conda create -n visgpt python=3.8
# activate the new environment
conda activate visgpt
# prepare the basic environments
pip install -r requirements.txt
# prepare your private OpenAI key (for Linux)
export OPENAI_API_KEY={Your_Private_Openai_Key}
# prepare your private OpenAI key (for Windows)
set OPENAI_API_KEY={Your_Private_Openai_Key}
# Start Visual ChatGPT !
# You can specify the GPU/CPU assignment by "--load", the parameter indicates which
# Visual Foundation Model to use and where it will be loaded to
# The model and device are separated by underline '_', the different models are separated by comma ','
# The available Visual Foundation Models can be found in the following table
# For example, if you want to load ImageCaptioning to cpu and Text2Image to cuda:0
# You can use: "ImageCaptioning_cpu,Text2Image_cuda:0"
# Advice for CPU Users
python visual_chatgpt.py --load ImageCaptioning_cpu,Text2Image_cpu
# Advice for 1 Tesla T4 15GB (Google Colab)
python visual_chatgpt.py --load "ImageCaptioning_cuda:0,Text2Image_cuda:0"
# Advice for 4 Tesla V100 32GB
python visual_chatgpt.py --load "ImageCaptioning_cuda:0,ImageEditing_cuda:0,
Text2Image_cuda:1,Image2Canny_cpu,CannyText2Image_cuda:1,
Image2Depth_cpu,DepthText2Image_cuda:1,VisualQuestionAnswering_cuda:2,
InstructPix2Pix_cuda:2,Image2Scribble_cpu,ScribbleText2Image_cuda:2,
Image2Seg_cpu,SegText2Image_cuda:2,Image2Pose_cpu,PoseText2Image_cuda:2,
Image2Hed_cpu,HedText2Image_cuda:3,Image2Normal_cpu,
NormalText2Image_cuda:3,Image2Line_cpu,LineText2Image_cuda:3"
```
## GPU memory usage
Here we list the GPU memory usage of each visual foundation model, you can specify which one you like:
| Foundation Model | GPU Memory (MB) |
|------------------------|-----------------|
| ImageEditing | 3981 |
| InstructPix2Pix | 2827 |
| Text2Image | 3385 |
| ImageCaptioning | 1209 |
| Image2Canny | 0 |
| CannyText2Image | 3531 |
| Image2Line | 0 |
| LineText2Image | 3529 |
| Image2Hed | 0 |
| HedText2Image | 3529 |
| Image2Scribble | 0 |
| ScribbleText2Image | 3531 |
| Image2Pose | 0 |
| PoseText2Image | 3529 |
| Image2Seg | 919 |
| SegText2Image | 3529 |
| Image2Depth | 0 |
| DepthText2Image | 3531 |
| Image2Normal | 0 |
| NormalText2Image | 3529 |
| VisualQuestionAnswering| 1495 |
## Acknowledgement
We appreciate the open source of the following projects:
[Hugging Face](https://github.com/huggingface)
[LangChain](https://github.com/hwchase17/langchain)
[Stable Diffusion](https://github.com/CompVis/stable-diffusion)
[ControlNet](https://github.com/lllyasviel/ControlNet)
[InstructPix2Pix](https://github.com/timothybrooks/instruct-pix2pix)
[CLIPSeg](https://github.com/timojl/clipseg)
[BLIP](https://github.com/salesforce/BLIP)
## Contact Information
For help or issues using the Visual ChatGPT, please submit a GitHub issue.
For other communications, please contact Chenfei WU (chewu@microsoft.com) or Nan DUAN (nanduan@microsoft.com).