# self-operating-computer **Repository Path**: felixchina2024/self-operating-computer ## Basic Information - **Project Name**: self-operating-computer - **Description**: Self-Operating Computer是一个开源框架，它允许多模态模型通过模拟人类的鼠标点击和键盘输入来操作计算机。这个框架使用了先进的模型，例如GPT-4V，来实现自主操作，例如在浏览器中打开页面并撰写内容。它的核心能力是估计鼠标点击的正确坐标位置以及进行适当的键盘输入，以实现给定目标 - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: add-agent-1 - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2024-08-10 - **Last Updated**: 2024-09-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: AI代理 ## README

Self-Operating Computer Framework

A framework to enable multimodal models to operate a computer.

Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.

## Key Features - **Compatibility**: Designed for various multimodal models. - **Integration**: Currently integrated with **GPT-4v** as the default model, with extended support for Gemini Pro Vision. - **Future Plans**: Support for additional models. ## Current Challenges > **Note:** GPT-4V's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation. ## Ongoing Development At [HyperwriteAI](https://www.hyperwriteai.com/), we are developing Agent-1-Vision a multimodal model with more accurate click location predictions. ## Agent-1-Vision Model API Access We will soon be offering API access to our Agent-1-Vision model. If you're interested in gaining access to this API, sign up [here](https://othersideai.typeform.com/to/FszaJ1k8?typeform-source=www.hyperwriteai.com). ## Demo https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0 ## Run `Self-Operating Computer` 1. **Install the project** ``` pip install self-operating-computer ``` 2. **Run the project** ``` operate ``` 3. **Enter your OpenAI Key**: If you don't have one, you can obtain an OpenAI key [here](https://platform.openai.com/account/api-keys)

4. **Give Terminal app the required permissions**: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".

### Alternatively installation with `.sh` 1. **Clone the repo** to a directory on your computer: ``` git clone https://github.com/OthersideAI/self-operating-computer.git ``` 2. **Cd into directory**: ``` cd self-operating-computer ``` 3. **Run the installation script**: ``` ./run.sh ``` ## Using `operate` Modes ### Multimodal Models `-m` An additional model is now compatible with the Self Operating Computer Framework. Try Google's `gemini-pro-vision` by following the instructions below. Start `operate` with the Gemini model ``` operate -m gemini-pro-vision ``` **Enter your Google AI Studio API key when terminal prompts you for it** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR: ### Set-of-Mark Prompting `-m gpt-4-with-som` The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models. Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441). For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR). Start `operate` with the SoM model ``` operate -m gpt-4-with-som ``` ### Voice Mode `--voice` The framework supports voice inputs for the objective. Try voice by following the instructions below. **Clone the repo** to a directory on your computer: ``` git clone https://github.com/OthersideAI/self-operating-computer.git ``` **Cd into directory**: ``` cd self-operating-computer ``` Install the additional `requirements-audio.txt` ``` pip install -r requirements-audio.txt ``` **Install device requirements** For mac users: ``` brew install portaudio ``` For Linux users: ``` sudo apt install portaudio19-dev python3-pyaudio ``` Run with voice mode ``` operate --voice ``` ## Contributions are Welcomed!: If you want to contribute yourself, see [CONTRIBUTING.md](https://github.com/OthersideAI/self-operating-computer/blob/main/CONTRIBUTING.md). ## Feedback For any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter. ## Join Our Discord Community For real-time discussions and community support, join our Discord server. - If you're already a member, join the discussion in [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157). - If you're new, first [join our Discord Server](https://discord.gg/YqaKtyBEzM) and then navigate to the [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157). ## Follow HyperWriteAI for More Updates Stay updated with the latest developments: - Follow HyperWriteAI on [Twitter](https://twitter.com/HyperWriteAI). - Follow HyperWriteAI on [LinkedIn](https://www.linkedin.com/company/othersideai/). ## Compatibility - This project is compatible with Mac OS, Windows, and Linux (with X server installed). ## OpenAI Rate Limiting Note The ```gpt-4-vision-preview``` model is required. To unlock access to this model, your account needs to spend at least \$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \$5. Learn more **[here](https://platform.openai.com/docs/guides/rate-limits?context=tier-one)**