# Judges-Verdict **Repository Path**: mirrors_NVIDIA/Judges-Verdict ## Basic Information - **Project Name**: Judges-Verdict - **Description**: Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-10 - **Last Updated**: 2026-03-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Judge's Verdict [![Python Version](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/) [![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE) [![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Dataset](https://img.shields.io/badge/dataset-HuggingFace-yellow.svg)](https://huggingface.co/datasets/nvidia/judges-verdict) A comprehensive framework for evaluating Large Language Models (LLMs) as judges for assessing the quality of AI-generated responses. ## Table of Contents - [Overview](#overview) - [Features](#features) - [Installation](#installation) - [Project Structure](#project-structure) - [Quick Start](#quick-start) - [Configuration](#configuration) - [Command Line Options](#command-line-options) - [Troubleshooting](#troubleshooting) - [Extending the Framework](#extending-the-framework) - [Development](#development) - [Contributing](#contributing) ## Overview This repository contains the evaluation pipeline for running Judge's Verdict on the `nvidia/judges-verdict` HuggingFace dataset. The framework uses LiteLLM as the officially supported approach for interfacing with various LLM providers through a unified API. **Supported Providers via LiteLLM:** - OpenAI (GPT-4, GPT-4o, etc.) - Anthropic (Claude models) - NVIDIA NIM (Llama, Nemotron, Gemma, etc.) - Local models (via vLLM or other compatible servers) - Many other providers (see [LiteLLM docs](https://docs.litellm.ai/docs/providers)) ## Features - **Flexible Judge Configuration**: YAML-based configuration system for managing multiple judge models - **Multiple Serving Frameworks**: Support for various LLM serving backends - **Annotation Preparation**: Scripts for preparing and merging human annotations - **Parallel Processing**: Efficient evaluation with configurable workers - **Extensible Architecture**: Easy to add new judge models and metrics ## Installation ```bash # Clone the repository git clone https://github.com/NVIDIA/Judges-Verdict.git cd Judges-Verdict # Install using Poetry (recommended) poetry install # Or install with pip pip install -e . ``` ## Project Structure ``` Judges-Verdict/ ├── llm_judge_benchmark/ # Core package │ ├── scoring/ # Judge scoring implementation │ ├── utils/ # Utility functions │ ├── judge_config.py # Judge configuration classes │ └── judge_config_manager.py # YAML config management ├── scripts/ # Standalone scripts │ └── data_prep.py # Data preparation script (fetches from HuggingFace) ├── config/ # Configuration files │ ├── judge_config_litellm.yaml │ └── judge_config_litellm_example.yaml ├── data/ # Data directory │ ├── annotations_sample.json # Sample annotations │ └── evaluation.json # Merged evaluation dataset (created by data_prep.py) ├── results/ # Evaluation results └── docs/ # Documentation ``` ## Quick Start ### 1. Install dependencies ```bash # Using Poetry (recommended) poetry install # Or using pip pip install -e . ``` ### 2. Set up API keys The project uses the LiteLLM framework as the officially supported approach for judge scoring, which provides a unified interface for multiple LLM providers. ```bash # For OpenAI models (gpt-4o, gpt-4, etc.) export OPENAI_API_KEY="your-openai-api-key" # For Anthropic models (claude-sonnet-4, etc.) export ANTHROPIC_API_KEY="your-anthropic-api-key" # For NVIDIA NIM models (llama, nemotron, gemma, etc.) export NVIDIA_NIM_API_KEY="your-nvidia-nim-api-key" export NVIDIA_NIM_API_BASE="https://integrate.api.nvidia.com/v1" # or your custom endpoint # For other providers supported by LiteLLM, see: https://docs.litellm.ai/docs/providers ``` Note: The LiteLLM framework automatically handles provider-specific authentication and API formatting. ### 3. Prepare the data ```bash # Run the data preparation script python scripts/data_prep.py ``` This will: - Download the `nvidia/judges-verdict` dataset from HuggingFace - Process the train split (200 techqa entries, annotations removed) - Process the test split (1,794 entries across 6 datasets): - **coral** (318 entries): Joins with CORAL dataset from HuggingFace - **dc767** (347 entries): Joins with DC767 CSV from GitHub - **enterprise_rag_benchmark** (346 entries): Used as-is - **hotpotqa** (342 entries): Used as-is - **squad** (346 entries): Used as-is - **techqa** (95 entries): Used as-is - Merge all datasets into `data/evaluation.json` (1,994 total entries) ### 4. Run Judge Scoring ```bash # Using the installed script poetry run llm-judge-score \ --evaluation-file data/evaluation.json \ --output-dir results/ # Or run directly python -m llm_judge_benchmark.scoring.llm_judge_scoring \ --evaluation-file data/evaluation.json \ --config config/judge_config_litellm.yaml \ --output-dir results/ ``` ## Configuration ### Judge Configuration Judges are configured using YAML files with LiteLLM as the primary framework. Each judge requires: - `identifier`: Unique identifier for the judge (it's also used as the folder name for judge evaluation results) - `framework`: Always use `litellm` for the officially supported approach - `model`: Model name in LiteLLM format (e.g., `openai/gpt-4o`, `anthropic/claude-sonnet-4-20250514`, `nvidia_nim/meta/llama-3.1-70b-instruct`) - `temperature`: Temperature setting (usually 0.0 for consistency) - `max_tokens`: Maximum tokens for response - `num_workers`: Number of parallel workers - `timeout`: Request timeout in seconds The model identifiers in the configuration YAML will be the folder names where results will be stored. For example: ```yaml models: gpt-4o: # This creates results in results/gpt-4o/ framework: litellm model: openai/gpt-4o meta_llama-3.1-70b-instruct: # This creates results in results/meta_llama-3.1-70b-instruct/ framework: litellm model: nvidia_nim/meta/llama-3.1-70b-instruct ``` ### Local Model Configuration Local models are also supported by the `litellm` framework. When configuring a local model, you need to specify: - `framework: litellm` - `model`: The model name in litellm format (e.g., `hosted_vllm/...`) - `base_url`: The URL of your local model server - `api_key`: Usually set to "EMPTY" for local models Example configuration for a local Qwen model: ```yaml local/qwen-0.5b: framework: litellm model: hosted_vllm/Qwen/Qwen2-0.5B-Instruct base_url: http://localhost:8000/v1 api_key: EMPTY num_workers: 1 ``` This assumes you have a VLLM server running locally on port 8000 serving the Qwen2-0.5B-Instruct model. The repository includes configuration files for the LiteLLM framework: - `config/judge_config_litellm.yaml`: Main configuration file with all supported judge models using LiteLLM - `config/judge_config_litellm_example.yaml`: Minimal example configuration with verified working models ### Environment Variables The framework requires API keys to be set as environment variables. See the Quick Start section above for details on setting up API keys for different providers. ### LiteLLM Model Name Format When configuring models in the YAML files, use the following format for model names: - **OpenAI**: `openai/model-name` (e.g., `openai/gpt-4o`, `openai/gpt-4`) - **Anthropic**: `anthropic/model-name` (e.g., `anthropic/claude-sonnet-4-20250514`) - **NVIDIA NIM**: `nvidia_nim/provider/model-name` (e.g., `nvidia_nim/meta/llama-3.1-70b-instruct`) - **Local Models**: `hosted_vllm/model-path` (e.g., `hosted_vllm/Qwen/Qwen2-0.5B-Instruct`) For a complete list of supported providers and their formats, see the [LiteLLM Provider Documentation](https://docs.litellm.ai/docs/providers). ## Evaluation Data Format The framework uses evaluation data from the HuggingFace dataset `nvidia/judges-verdict`. After processing by `data_prep.py`, the data is in the following format: ```json [ { "item_name": "unique_identifier", "dataset_name": "dataset_source", "question": "The question text", "gt_answer": "The ground truth/reference answer", "gen_answer": "The generated answer" } ] ``` The `data_prep.py` script: - Downloads the `nvidia/judges-verdict` dataset from HuggingFace - Processes train split data (removes annotations field) - Processes test split data from 6 different datasets - For coral and dc767 datasets, joins with external data sources to get complete question/answer pairs - Merges all data into a single `data/evaluation.json` file ## Command Line Options The main scoring script supports the following options: ```bash llm-judge-score \ --evaluation-file PATH # Path to evaluation JSON file (default: ./data/evaluation.json) --output-dir PATH # Output directory for results (default: ./results/) --judges NAME [NAME ...] # Specific judges to run (space-separated list, optional) --max-trials N # Maximum trials per judge (default: 3) --default-workers N # Default workers for judges not in config (default: 16) --config-file PATH # Path to judge config YAML file (optional) ``` ## Troubleshooting - **Import errors**: Make sure to install the package with `poetry install` or `pip install -e .` - **Missing API keys**: Set the appropriate environment variables for your judge models - **Missing data files**: Run `python scripts/data_prep.py` to download and prepare the evaluation data - **File not found errors**: Ensure the data preparation script has been run to create `evaluation.json` - **HuggingFace dataset errors**: Make sure you have the `datasets` library installed (automatically installed with Poetry) ## Extending the Framework ### Adding a New Judge Model 1. Add the model configuration to your YAML config file using the LiteLLM framework 2. Ensure you use the correct LiteLLM model name format (e.g., `provider/model-name`) 3. Set up the required API keys for your model provider ### Adding New Metrics 1. Create a new metric class following the RAGAS metric interface 2. Update the scoring script to use the new metric 3. Add configuration options as needed ## Development ### Code Style and Formatting This project uses several tools to maintain consistent code quality: - **Black**: Code formatter with 120-character line limit - **isort**: Import statement organizer - **flake8**: Style guide enforcement #### Running Code Formatters ```bash # Format all Python files with black poetry run black llm_judge_benchmark/ scripts/ # Sort imports with isort poetry run isort llm_judge_benchmark/ scripts/ # Or run both together poetry run black llm_judge_benchmark/ scripts/ && poetry run isort llm_judge_benchmark/ scripts/ ``` #### Running Linter Checks ```bash # Check code style with flake8 poetry run flake8 llm_judge_benchmark/ scripts/ # Show statistics poetry run flake8 llm_judge_benchmark/ scripts/ --count --statistics ``` #### Pre-commit Checks (Recommended) Before committing code, run all formatters and linters: ```bash # Format code poetry run black llm_judge_benchmark/ scripts/ poetry run isort llm_judge_benchmark/ scripts/ # Check for any remaining issues poetry run flake8 llm_judge_benchmark/ scripts/ ``` #### Configuration The project's formatting tools are configured as follows: - **Black** (`pyproject.toml`): - Line length: 120 characters - Target Python version: 3.12 - **isort** (`pyproject.toml`): - Line length: 120 characters - Multi-line output style: 3 (vertical hanging indent) - Trailing commas in multi-line imports - **flake8** (`.flake8`): - Line length: 120 characters - Ignored rules: W503, E203 (conflicts with black) - Complexity threshold: 10 ## Contributing Contributions are welcome! Please ensure your code follows the project's style guidelines by running the formatters and linters before submitting a Pull Request.