1 Star 0 Fork 0

ChangFeng2015/AgentBench

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README

🌐 Website | 🐦 Twitter | ✉️ Google Group | 📃 Paper

👋 Join our Slack for Q & A or collaboration on AgentBench v2.0!

AgentBench: Evaluating LLMs as Agents

https://github.com/THUDM/AgentBench/assets/129033897/656eed6e-d9d9-4d07-b568-f43f5a451f04

AgentBench is the first benchmark designed to evaluate LLM-as-Agent across a diverse spectrum of different environments. It encompasses 8 distinct environments to provide a more comprehensive evaluation of the LLMs' ability to operate as autonomous agents in various scenarios. These environments include 5 freshly created domains, namely

  • Operating System (OS)
  • Database (DB)
  • Knowledge Graph (KG)
  • Digital Card Game (DCG)
  • Lateral Thinking Puzzles (LTP)

as well as 3 recompiled from published datasets:

Table of Contents

Dataset Summary

We offer three splits for each dataset: Dev, Test, and Extend. Dev is fully public, while Test and Extend are private. In designing AgentBench, we balance evaluation thoroughness and efficiency. Though the number of problems in Dev and Test may seem small at 289 and 1,141, the multi-turn interaction requires an LLMs to generate around 4k and 13k times, making the testing time-consuming.

Leaderboard

Here is the scores on test set (standard) results of AgentBench.

While LLMs begin to manifest their proficiency in LLM-as-Agent, gaps between models and the distance towards practical usability are significant.

Xlsx-format leaderboard data is available here.

Quick Start

To quickly understand how the framework works, you can follow the instructions below to run a simple evaluation.

Step 1. Clone this repo and run the following command to install the requirements:

pip install --upgrade pip
pip install -r requirements.txt

Step 2. Verify that you have successfully installed the requirements by running the following command:

python eval.py \
    --task configs/tasks/example.yaml \
    --agent configs/agents/do_nothing.yaml

Step 3. Run Example Assignment

HINT: Example Assigment is composed of gpt-3.5-turbo and ExampleTask defined in src/tasks/example_task.py.

You need to fill your OPENAI KEY in configs/assignments/example.yaml first.

Authorization: Bearer <%% PUT-YOUR-OPENAI-KEY-HERE %%>

Then run the following command:

python create_assignment.py \
    --assignment configs/assignments/example.yaml

And you can see the target assignment bash script from the output like this:

[System] Run the following command to start evaluation:
    bash .assignments/<TIMESTAMP>.sh

Finally, run the assignment bash script that displayed in the output to start evaluation. After that, you can check your output in the outputs folder.

Tutorial

For more detailed instructions and advanced usage, please refer to our tutorial.

Citation

@article{liu2023agentbench,
  title   = {AgentBench: Evaluating LLMs as Agents},
  author  = {Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang},
  year    = {2023},
  journal = {arXiv preprint arXiv: 2308.03688}
}

空文件

简介

AgentBench 展开 收起
取消

发行版

暂无发行版

贡献者

全部

近期动态

不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/chengfeng2015_bolt/AgentBench.git
git@gitee.com:chengfeng2015_bolt/AgentBench.git
chengfeng2015_bolt
AgentBench
AgentBench
main

搜索帮助

371d5123 14472233 46e8bd33 14472233