代码拉取完成,页面将自动刷新
This repository contains scripts for running evaluations and autograding on model outputs.
To evaluate and autograde DataFrame outputs:
python -m evals.autograde_dataframe --csv_path <path_to_csv> --output_path <path_to_output_csv>
Example:
python evals/autograde_df.py output/fireworks_ai__accounts__fireworks__models__qwq-32b/codeact/simple_qa_test_set/fireworks_ai__accounts__fireworks__models__qwq-32b__codeact__simple_qa_test_set__trial1.jsonl
This command processes the specified JSONL file and performs automated grading on DataFrame outputs.
To run evaluations on a dataset with parallel processing:
python ./evals/eval_tasks.py --parallel-workers=8 --num-trials=1 --eval-tasks=./evals/datasets/frames_test_set.csv ./evals/datasets/simple_qa_test_set.csv
Parameters:
--date
: Optional date for the evaluation--eval-tasks
: List of paths to CSV files containing evaluation tasks (default: ["./evals/datasets/frames_test_set.csv", "./evals/datasets/simple_qa_test_set.csv"])--search-model-id
: Model ID for the search tool (default: "fireworks_ai/accounts/fireworks/models/llama-v3p3-70b-instruct")--model-type
: Type of model to use, either "LiteLLMModel" or "HfApiModel" (default: "LiteLLMModel")--model-id
: ID of the model to use (default: "fireworks_ai/accounts/fireworks/models/qwq-32b")--agent-action-type
: Type of agent action: "codeact", "tool-calling", or "vanilla" (default: "codeact")--parallel-workers
: Number of parallel workers to use (default: 8)--num-trials
: Number of evaluation trials to run (default: 1)The results will be saved as a DataFrame in the evals
directory.
Evaluation results are stored in the following locations:
evals/
directory此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。