🌐 Website | 🐦 Twitter | ✉️ Google Group | 📃 Paper
👋 Join our Slack for Q & A or collaboration on next version of AgentBench!
You are now browsing AgentBench v0.2. If you wish to use the older version, you can revert to v0.1.
Based on v0.1, we:
https://github.com/THUDM/AgentBench/assets/129033897/656eed6e-d9d9-4d07-b568-f43f5a451f04
AgentBench is the first benchmark designed to evaluate LLM-as-Agent across a diverse spectrum of different environments. It encompasses 8 distinct environments to provide a more comprehensive evaluation of the LLMs' ability to operate as autonomous agents in various scenarios. These environments include 5 freshly created domains, namely
as well as 3 recompiled from published datasets:
We offer two splits for each dataset: Dev and Test. The multi-turn interaction requires an LLMs to generate around 4k and 13k times respectively.
Here is the scores on test set (standard) results of AgentBench.
While LLMs begin to manifest their proficiency in LLM-as-Agent, gaps between models and the distance towards practical usability are significant.
This section will guide you on how to quickly use gpt-3.5-turbo-0613 as an agent to launch the dbbench-std
and os-std
tasks.
For the specific framework structure, please refer to Framework Introduction.
For more detailed configuration and launch methods, please check Configuration Guide
and Program Entrance Guide.
Clone this repo and install the dependencies.
cd AgentBench
conda create -n agent-bench python=3.9
conda activate agent-bench
pip install -r requirements.txt
Ensure that Docker is properly installed.
docker ps
Build required images for dbbench-std
and os-std
.
docker pull mysql
docker pull ubuntu
docker build -f data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles --tag local-os/default
docker build -f data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles --tag local-os/packages
docker build -f data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles --tag local-os/ubuntu
Fill in your OpenAI API Key at the correct location in configs/agents/openai-chat.yaml
. (e.g. gpt-3.5-turbo-0613
)
You can try using python -m src.client.agent_test
to check if your agent is configured correctly.
By default, gpt-3.5-turbo-0613
will be started. You can replace it with other agents by modifying the parameters:
python -m src.client.agent_test --config configs/agents/api_agents.yaml --agent gpt-3.5-turbo-0613
Starting the task worker involves specific tasks. Manual starting might be cumbersome; hence, we provide an automated script.
The assumption for this step is that ports from 5000 to 5015 are available.
python -m src.start_task -a
This will launch five task_workers each for dbbench-std
and os-std
tasks and automatically connect them
to the controller on port 5000. After executing this command, please allow approximately 1 minute for the task setup to complete.
This step is to actually start the tasks.
If everything is correctly configured so far, you can now initiate the task tests.
python -m src.assigner
If you wish to launch more tasks or use other models, you can refer to the content in Configuration Guide and Program Entrance Guide.
For the environment of the remaining five tasks, you will need to download the Docker images we provide.
longinyu/agentbench-ltp
longinyu/agentbench-webshop
longinyu/agentbench-mind2web
longinyu/agentbench-card_game
longinyu/agentbench-alfworld
The resource consumption of a single task_worker for the eight tasks is roughly as follows; consider this when launching:
Task Name | Start-up Speed | Memory Consumption |
---|---|---|
webshop | ~3min | ~15G |
mind2web | ~5min | ~1G |
db | ~20s | < 500M |
alfworld | ~10s | < 500M |
card_game | ~5s | < 500M |
ltp | ~5s | < 500M |
os | ~5s | < 500M |
kd | ~5s | < 500M |
@article{liu2023agentbench,
title = {AgentBench: Evaluating LLMs as Agents},
author = {Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang},
year = {2023},
journal = {arXiv preprint arXiv: 2308.03688}
}
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。