# olmes
**Repository Path**: mirrors_allenai/olmes
## Basic Information
- **Project Name**: olmes
- **Description**: Reproducible, flexible LLM evaluations
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-11-21
- **Last Updated**: 2026-04-25
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
## Open Language Model Evaluation System (OLMES)
The OLMES (Open Language Model Evaluation System) repository is used within [Ai2](https://allenai.org)'s Open
Language Model efforts to evaluate base and
instruction-tuned LLMs on a range of tasks.
more details
The repository includes code to faithfully reproduce the evaluation results:
* **OLMo 3:** TBD Title ([TBD Citation](...))
* **OLMo 2:** 2 OLMo 2 Furious ([Team OLMo et al, 2024](https://arxiv.org/abs/2501.00656))
* **TÜLU 3:** Pushing Frontiers in Open Language Model Post-Training ([Lambert et al, 2024](https://www.semanticscholar.org/paper/T/%22ULU-3%3A-Pushing-Frontiers-in-Open-Language-Model-Lambert-Morrison/5ca8f14a7e47e887a60e7473f9666e1f7fc52de7))
* **OLMES:** A Standard for Language Model Evaluations ([Gu et al, 2024](https://www.semanticscholar.org/paper/c689c37c5367abe4790bff402c1d54944ae73b2a))
* **OLMo:** Accelerating the Science of Language Models ([Groeneveld et al, 2024](https://www.semanticscholar.org/paper/ac45bbf9940512d9d686cf8cd3a95969bc313570))
The code base uses helpful features from the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
by Eleuther AI, with a number of modifications and enhancements, including:
* Support deep configurations for variants of tasks
* Record more detailed data about instance-level predictions (logprobs, etc)
* Custom metrics and metric aggregations
* Integration with external storage options for results
### Setup
```sh
git clone https://github.com/allenai/olmes.git
cd olmes
# To install with uv:
uv sync
uv sync --group gpu # for vLLM support
# To install with pip:
pip install -e .
pip install -e ".[gpu]" # for vLLM support
```
## Usage
To evaluate a model and task (or task suite):
```bash
olmes \
--model allenai/OLMo-2-0425-1B \
--task arc_challenge::olmes \
--output-dir workspace
```
This will launch the standard [OLMES](https://www.semanticscholar.org/paper/c689c37c5367abe4790bff402c1d54944ae73b2a)
version of [ARC Challenge](https://www.semanticscholar.org/paper/88bb0a28bb58d847183ec505dda89b63771bb495)
(which uses a curated 5-shot example, trying both multiple-choice and cloze formulations, and reporting
the max) with [OLMo 2 1B](https://huggingface.co/allenai/OLMo-2-0425-1B), storing the output in `workspace`.
### More options
**Multiple tasks** can be specified after `--task`:
```bash
olmes \
--model allenai/OLMo-2-0425-1B \
--task arc_challenge::olmes hellaswag::olmes \
--output-dir workspace
```
**Inspect tasks.** You can sanity check using `--inspect`, which shows a sample prompt (and does 5-instance eval with a small `pythia`):
```bash
olmes --task arc_challenge:mc::olmes --inspect
```
**Dry run.** You can inspect the launch command with `--dry-run`:
```bash
olmes \
--model allenai/OLMo-2-0425-1B \
--task mmlu::olmes \
--output-dir workspace \
--dry-run
```
For a full list of arguments run `olmes --help`.
## Running Eval Suites
We include full suites for our releases in [`task_suites.py`](oe_eval/configs/task_suites.py):
### Olmo 3 eval suite
To run all tasks specified in the [OLMo 3](.) technical report:
**Base Model Evaluation**
```bash
# Run the base easy evaluation (for evaluating small-scale experiments)
olmes \
--model allenai/Olmo-3-1025-7B \
--task \
olmo3:base_easy:code_bpb \
olmo3:base_easy:math_bpb \
olmo3:base_easy:qa_rc \
olmo3:base_easy:qa_bpb \
--output-dir workspace
# Run the base main evaluation
olmes \
--model allenai/Olmo-3-1025-7B \
--task \
olmo3:base:stem_qa_mc \
olmo3:base:nonstem_qa_mc \
olmo3:base:gen \
olmo3:base:math \
olmo3:base:code \
olmo3:base:code_fim \
--output-dir workspace
# Run the base held-out evaluation
olmes \
--model allenai/Olmo-3-1025-7B \
--task \
olmo3:heldout \
--output-dir workspace
```
**Instruct Model Evaluation**
```bash
# Run the instruct model evaluation
export OPENAI_API_KEY="..." # OpenAI models used for SimpleQA, AlpacaEval LLM Judge
olmes \
--model allenai/Olmo-3-1025-7B \
--task \
olmo3:adapt \
--output-dir workspace
```
Run safety evaluation
```bash
export OPEN_API_KEY="..." # An OpenAI API key
export HF_TOKEN="..." # Your huggingface token
oe-eval \
--model hf-safety-eval \
--task safety::olmo3 \
--task-args '{ "generation_kwargs": { "max_gen_toks": 2048, "truncate_context": false } }' \
--model-args '{"model_path":"allenai/Olmo-3-1025-7B", "max_length": 2048, "trust_remote_code": "true", "process_output": "r1_style"}'
# Note: For reasoning models, we use a max_gen_toks=32768, max_length=32768
```
### OLMo 2 eval suite
To run all tasks specified in the [OLMo 2](https://www.semanticscholar.org/paper/2-OLMo-2-Furious-OLMo-Walsh/685418141037ca44bb60b0b09d40fe521ec9e734) technical report:
```bash
olmes \
--model allenai/OLMo-2-1124-7B \
--task \
core_9mcqa::olmes \
mmlu:mc::olmes \
olmo_2_generative::olmes \
olmo_2_heldout::olmes \
--output-dir workspace
```
### TÜLU 3 eval suite
To run all tasks used in the [TÜLU 3](https://www.semanticscholar.org/paper/T/%22ULU-3%3A-Pushing-Frontiers-in-Open-Language-Model-Lambert-Morrison/5ca8f14a7e47e887a60e7473f9666e1f7fc52de7)
paper:
```bash
olmes \
--model allenai/OLMo-2-1124-7B \
--task \
tulu_3_dev \
tulu_3_unseen \
--output-dir workspace
```
### OLMES: Standard 10 MCQA tasks
To run all 10 multiple-choice tasks from the [OLMES](https://www.semanticscholar.org/paper/c689c37c5367abe4790bff402c1d54944ae73b2a) paper:
```bash
olmes \
--model allenai/OLMo-2-1124-7B \
--task \
core_9mcqa::olmes \
mmlu::olmes \
--output-dir workspace
```
### OLMo eval suite
To reproduce numbers in the [OLMo](https://www.semanticscholar.org/paper/ac45bbf9940512d9d686cf8cd3a95969bc313570) paper:
```bash
olmes \
--model allenai/OLMo-7B-hf \
--task \
main_suite::olmo1 \
mmlu::olmo1 \
--output-dir workspace
```
## New Models / Tasks
Configure new models
### Model configuration
Models can be directly referenced by their Huggingface model path, e.g., `--model allenai/OLMoE-1B-7B-0924`,
or by their key in the [model library](oe_eval/configs/models.py), e.g., `--model olmoe-1b-7b-0924` which
can include additional configuration options (such as `max_length` for max context size and `model_path` for
local path to model).
The default model type uses the Huggingface model implementations, but you can also use the `--model-type vllm` flag to use
the vLLM implementations for models that support it, as well as `--model-type litellm` to run API-based models.
You can specify arbitrary JSON-parse-able model arguments directly in the command line as well, e.g.
```bash
olmes --model google/gemma-2b --model-args '{"trust_remote_code": true, "add_bos_token": true}' ...
```
To see a list of available models, run `oe-eval --list-models`, for a list of models containing a certain phrase,
you can follow this with a substring (any regular expression), e.g., `oe-eval --list-models llama`.
Configure new tasks
### Task configuration
To specify a task, use the [task library](oe_eval/configs/tasks.py) which have
entries like
```json
"arc_challenge:rc::olmes": {
"task_name": "arc_challenge",
"split": "test",
"primary_metric": "acc_uncond",
"num_shots": 5,
"fewshot_source": "OLMES:ARC-Challenge",
"metadata": {
"regimes": ["OLMES-v0.1"],
},
},
```
Each task can also have custom entries for `context_kwargs` (controlling details of the prompt),
`generation_kwargs` (controlling details of the generation), and `metric_kwargs` (controlling details of the metrics). The `primary_metric` indicates which metric field will be reported as the "primary score" for the task.
The task configuration parameters can be overridden on the command line, these will generally apply to all tasks, e.g.,
```bash
olmes --task arc_challenge:rc::olmes hellaswag::rc::olmes --split dev ...
```
but using a json format for each task, can be on per-task (but it's generally better to use the task
library for this), e.g.,
```bash
olmes --task '{"task_name": "arc_challenge:rc::olmes", "num_shots": 2}' '{"task_name": "hellasag:rc::olmes", "num_shots": 4}' ...
```
For complicated commands like this, using `--dry-run` can be helpful to see the full command before running it.
To see a list of available tasks, run `oe-eval --list-tasks`, for a list of tasks containing a certain phrase,
you can follow this with a substring (any regular expression), e.g., `oe-eval --list-tasks arc`.
### Task suite configurations
To define a suite of tasks to run together, use the [task suite library](oe_eval/configs/task_suites.py),
with entries like:
```python
TASK_SUITE_CONFIGS["mmlu:mc::olmes"] = {
"tasks": [f"mmlu_{sub}:mc::olmes" for sub in MMLU_SUBJECTS],
"primary_metric": "macro",
}
```
specifying the list of tasks as well as how the metrics should be aggregated across the tasks.
## Evaluation output
Results are stored in the `--output-dir ...` directory. See [`OUTPUT_FORMATS.md`](OUTPUT_FORMATS.md) for more details. Other formats include:
- **Google Sheets** -- Save outputs to a Google Sheet by `--gsheet` (with authentication
stored in env var `GDRIVE_SERVICE_ACCOUNT_JSON`).
- **HF** -- Store in a Huggingface dataset directory by specifying `--hf-save-dir`
- **S3** -- Store in a remote directory (like `s3://...`) by specifying `--remote-output-dir`
- **Wandb** -- Store in a W&B project by specifying `--wandb-run-path`