# LogSage **Repository Path**: mirrors_NVIDIA/LogSage ## Basic Information - **Project Name**: LogSage - **Description**: LogSage is an LLM-based system that analyzes AI training logs to extract errors, find root causes, and suggest recovery actions, reducing downtime and manual intervention in large GPU clusters. - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-17 - **Last Updated**: 2026-03-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # LogSage LogSage is an LLM-powered toolkit that analyzes AI training job logs, attributes root causes, and recommends an auto-resume policy to reduce wasted GPU time. It provides: - Error extraction and de-duplication from SLURM job logs - Root-cause attribution (e.g., hardware, configuration, memory, communication) - Action recommendations (restart immediately vs stop) with justification - Optional node isolation hints for hardware-related failures ## Table of Contents - [Description](#description) - [Quickstart](#quickstart) - [Components](#components) - [Running the API](#running-the-api) - [Configuration](#configuration) - [Testing](#testing) - [Roadmap and Project Status](#roadmap-and-project-status) - [Additional Resources](#additional-resources) - [Contributing](#contributing) ## Description - **Problem**: Training jobs fail for many reasons (hardware, networking, configuration, data). Manual root-cause analysis is slow and wastes GPU hours. - **Solution**: LogSage uses log parsing + NVIDIA NIM to extract error patterns, attribute likely causes, and recommend whether to restart or stop, with an explanation. - **How it works (high level)**: 1. Receive job logs (client-provided) 2. Extract and cluster error lines; remove noise 3. Attribute errors with an LLM using structured prompts and heuristics 4. Recommend restart/stop and, when applicable, suggest temporal isolation of suspect nodes - **Benefits**: Faster triage, increased data center availability, reduced GPU downtime, and improved error extraction coverage by leveraging LLMs. ## Quickstart Prerequisites: - Python >= 3.11 - Poetry (recommended) or pip Setup (using Poetry): ```bash make install # or poetry install --with dev --all-extras ``` Install via pip: ```bash python3 -m venv myenv source myenv/bin/activate pip install -U logsage --index-url=https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local/simple --extra-index-url https://pypi.org/simple ``` Run the API locally: ```bash python -m logsage.auto_resume_policy.run_server # or uvicorn logsage.auto_resume_policy.server:app --host 0.0.0.0 --port 8000 --reload ``` Open the docs: - Swagger UI: `http://localhost:8000/docs` - ReDoc: `http://localhost:8000/redoc` ## Components - `logsage/auto_resume_policy/` - FastAPI server exposing endpoints to create an attribution ID, ingest logs, and retrieve analysis - CLI utilities to analyze local log files - Recommendations engine that suggests job auto-resume policies: 'STOP', 'RESTART', 'TEMPORAL-ISOLATION+RESTART' - Detailed API and usage docs: see [`logsage/auto_resume_policy/README.md`](logsage/auto_resume_policy/README.md) ## Running the API The FastAPI service is defined in `logsage/auto_resume_policy/server.py` and can be run locally during development: ```bash python -m logsage.auto_resume_policy.run_server ``` Key endpoints (full specs in module README and Swagger): - `GET /healthz`: liveness check - `GET /version`: version info - `POST /errors/attribution_id`: create an attribution ID for a job - `POST /errors/logs`: submit logs under the attribution ID - `POST /errors/attribution`: run attribution and get recommendation For request/response schemas and cURL examples, see [`logsage/auto_resume_policy/README.md`](logsage/auto_resume_policy/README.md). ## Configuration Configuration is managed via `logsage/auto_resume_policy/config.py` (Pydantic settings). Key variables: - `NVIDIA_API_KEY` (required in production): API key for NVIDIA AI Endpoints (required for LLM calls) - `FAST_API_ROOT_PATH` (optional): Root path when running behind a proxy - `DEBUG` (optional): `true`/`false` (default `true` locally; set `false` in prod) ## Testing Run the test suite: ```bash make test # or poetry run pytest ``` Coverage reports are configured in `pyproject.toml`. ## Roadmap and Project Status - Add API authentication and rate limiting - Streamed/async log ingestion paths - Integration with log collector like loki - Expanded attribution categories and guardrails - Improve test coverage and add end-to-end examples ## Additional Resources - Auto-Resume-Policy API & details: [`logsage/auto_resume_policy/README.md`](logsage/auto_resume_policy/README.md) - Fetcher & deployment details: [`logsage/fetcher/README.md`](logsage/fetcher/README.md) - Project configuration and developer tooling: `pyproject.toml`, `Makefile` Internal dashboards (if applicable): - Grafana (LogSage): `https://grafana.nvidia.com/d/aeutclepcu41sf/logsage?orgId=290` - Kibana/ES example index (sandbox): `https://gpuwa.nvidia.com/elasticsearch/df-sandbox-ohazai-logsage-test2-202508/_search` ## Contributing For setup, development guidelines, and versioning information, see the [Contributing Guide](CONTRIBUTING.md).