🌐 Website | 🐦 Twitter | ✉️ Google Group | 📃 Paper
👋 Join our Slack for Q & A or collaboration on AgentBench v2.0!
https://github.com/THUDM/AgentBench/assets/129033897/656eed6e-d9d9-4d07-b568-f43f5a451f04
AgentBench is the first benchmark designed to evaluate LLM-as-Agent across a diverse spectrum of different environments. It encompasses 8 distinct environments to provide a more comprehensive evaluation of the LLMs' ability to operate as autonomous agents in various scenarios. These environments include 5 freshly created domains, namely
as well as 3 recompiled from published datasets:
We offer three splits for each dataset: Dev, Test, and Extend. Dev is fully public, while Test and Extend are private. In designing AgentBench, we balance evaluation thoroughness and efficiency. Though the number of problems in Dev and Test may seem small at 289 and 1,141, the multi-turn interaction requires an LLMs to generate around 4k and 13k times, making the testing time-consuming.
Here is the scores on test set (standard) results of AgentBench.
While LLMs begin to manifest their proficiency in LLM-as-Agent, gaps between models and the distance towards practical usability are significant.
Xlsx-format leaderboard data is available here.
To quickly understand how the framework works, you can follow the instructions below to run a simple evaluation.
pip install --upgrade pip
pip install -r requirements.txt
python eval.py \
--task configs/tasks/example.yaml \
--agent configs/agents/do_nothing.yaml
HINT: Example Assigment is composed of
gpt-3.5-turbo
andExampleTask
defined insrc/tasks/example_task.py
.
You need to fill your OPENAI KEY in configs/assignments/example.yaml
first.
Authorization: Bearer <%% PUT-YOUR-OPENAI-KEY-HERE %%>
Then run the following command:
python create_assignment.py \
--assignment configs/assignments/example.yaml
And you can see the target assignment bash script from the output like this:
[System] Run the following command to start evaluation:
bash .assignments/<TIMESTAMP>.sh
Finally, run the assignment bash script that displayed in the output to start evaluation. After that, you can check your output in the outputs
folder.
For more detailed instructions and advanced usage, please refer to our tutorial.
@article{liu2023agentbench,
title = {AgentBench: Evaluating LLMs as Agents},
author = {Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang},
year = {2023},
journal = {arXiv preprint arXiv: 2308.03688}
}
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。