# LiveCodeBench-Pro **Repository Path**: winterxxx/live-code-bench-pro ## Basic Information - **Project Name**: LiveCodeBench-Pro - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-02 - **Last Updated**: 2025-12-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # LiveCodeBench Pro - LLM Benchmarking Toolkit

This repository contains a benchmarking toolkit for evaluating Large Language Models (LLMs) on competitive programming tasks. The toolkit provides a standardized way to test your LLM's code generation capabilities across a diverse set of problems. ## Overview LiveCodeBench Pro evaluates LLMs on their ability to generate solutions for programming problems. The benchmark includes problems of varying difficulty levels from different competitive programming platforms. ## Getting Started ### Prerequisites - Ubuntu 20.04 or higher (or other distros with kernel version >= 3.10, and cgroup support. Refer to [go-judge](https://github.com/criyle/go-judge) for more details) - Python 3.12 or higher - pip package manager - docker (for running the judge server), and ensure the user has permission to run docker commands ### Installation 1. Install the required dependencies: ```bash pip install -r requirements.txt ``` Or install directly using `uv`: ```bash uv sync ``` 2. Ensure Docker is installed and running: ```bash docker --version ``` Make sure your user has permission to run Docker commands. On Linux, you may need to add your user to the docker group: ```bash sudo usermod -aG docker $USER ``` Then log out and back in for the changes to take effect. ## How to Use ### Step 1: Implement Your LLM Interface Create your own LLM class by extending the abstract `LLMInterface` class in `api_interface.py`. Your implementation needs to override the `call_llm` method. Example: ```python from api_interface import LLMInterface class YourLLM(LLMInterface): def __init__(self): super().__init__() # Initialize your LLM client or resources here def call_llm(self, user_prompt: str): # Implement your logic to call your LLM with user_prompt # Return a tuple containing (response_text, metadata) # Example: response = your_llm_client.generate(user_prompt) return response.text, response.metadata ``` You can use the `ExampleLLM` class as a reference, which shows how to integrate with OpenAI's API. ### Step 2: Configure the Benchmark Edit the `benchmark.py` file to use your LLM implementation: ```python from your_module import YourLLM # Replace this line: llm_instance = YourLLM() # Update with your LLM class ``` And change the number of judge workers (recommended to <= physical CPU cores). ### Step 3: Run the Benchmark Execute the benchmark script: ```bash python benchmark.py ``` The script will: 1. Load the LiveCodeBench-Pro dataset from Hugging Face 2. Process each problem with your LLM 3. Extract C++ code from LLM responses automatically 4. Submit solutions to the integrated judge system for evaluation 5. Collect judge results and generate comprehensive statistics 6. Save the results to `benchmark_result.json` ### (Optional) Step 4: Submit Your Results Email your `benchmark_result.json` file to zz4242@nyu.edu to have it displayed on the leaderboard. Please include the following information in your submission: - LLM name and version - Any specific details - Contact information ## Understanding the Codebase ### api_interface.py This file defines the abstract interface for LLM integration: - `LLMInterface`: Abstract base class with methods for LLM interaction - `ExampleLLM`: Example implementation with OpenAI's GPT-4o ### benchmark.py The main benchmarking script that: - Loads the dataset - Processes each problem through your LLM - Extracts C++ code from responses - Submits solutions to the judge system - Collects results and generates statistics - Saves comprehensive results with judge verdicts ### judge.py Contains the judge system integration: - `Judge`: Abstract base class for judge implementations - `LightCPVerifierJudge`: LightCPVerifier integration for local solution evaluation - Automatic problem data downloading from Hugging Face ### util.py Utility functions for code processing: - `extract_longest_cpp_code()`: Intelligent C++ code extraction from LLM responses ### Dataset The benchmark uses the [QAQAQAQAQ/LiveCodeBench-Pro](https://huggingface.co/datasets/QAQAQAQAQ/LiveCodeBench-Pro) and [QAQAQAQAQ/LiveCodeBench-Pro-Testcase](https://huggingface.co/datasets/QAQAQAQAQ/LiveCodeBench-Pro-Testcase) datasets from Hugging Face, which contains competitive programming problems with varying difficulty levels. ## Contact For questions or support, please contact us at zz4242@nyu.edu.