# LogSage

**Repository Path**: mirrors_NVIDIA/LogSage

## Basic Information

- **Project Name**: LogSage
- **Description**: LogSage is an LLM-based system that analyzes AI training logs to extract errors, find root causes, and suggest recovery actions, reducing downtime and manual intervention in large GPU clusters.
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-17
- **Last Updated**: 2026-03-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# LogSage

LogSage is an LLM-powered toolkit that analyzes AI training job logs, attributes root causes, and recommends an auto-resume policy to reduce wasted GPU time. It provides:
- Error extraction and de-duplication from SLURM job logs
- Root-cause attribution (e.g., hardware, configuration, memory, communication)
- Action recommendations (restart immediately vs stop) with justification
- Optional node isolation hints for hardware-related failures

## Table of Contents

- [Description](#description)
- [Quickstart](#quickstart)
- [Components](#components)
- [Running the API](#running-the-api)
- [Configuration](#configuration)
- [Testing](#testing)
- [Roadmap and Project Status](#roadmap-and-project-status)
- [Additional Resources](#additional-resources)
- [Contributing](#contributing)

## Description

- **Problem**: Training jobs fail for many reasons (hardware, networking, configuration, data). Manual root-cause analysis is slow and wastes GPU hours.
- **Solution**: LogSage uses log parsing + NVIDIA NIM to extract error patterns, attribute likely causes, and recommend whether to restart or stop, with an explanation.
- **How it works (high level)**:
  1. Receive job logs (client-provided)
  2. Extract and cluster error lines; remove noise
  3. Attribute errors with an LLM using structured prompts and heuristics
  4. Recommend restart/stop and, when applicable, suggest temporal isolation of suspect nodes
- **Benefits**: Faster triage, increased data center availability, reduced GPU downtime, and improved error extraction coverage by leveraging LLMs.

## Quickstart

Prerequisites:
- Python >= 3.11
- Poetry (recommended) or pip

Setup (using Poetry):
```bash
make install
# or
poetry install --with dev --all-extras
```

Install via pip:
```bash
python3 -m venv myenv
source myenv/bin/activate
pip install -U logsage --index-url=https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local/simple --extra-index-url https://pypi.org/simple
```

Run the API locally:
```bash
python -m logsage.auto_resume_policy.run_server
# or
uvicorn logsage.auto_resume_policy.server:app --host 0.0.0.0 --port 8000 --reload
```

Open the docs:
- Swagger UI: `http://localhost:8000/docs`
- ReDoc: `http://localhost:8000/redoc`

## Components

- `logsage/auto_resume_policy/`
  - FastAPI server exposing endpoints to create an attribution ID, ingest logs, and retrieve analysis
  - CLI utilities to analyze local log files
  - Recommendations engine that suggests job auto-resume policies: 'STOP', 'RESTART', 'TEMPORAL-ISOLATION+RESTART'
  - Detailed API and usage docs: see [`logsage/auto_resume_policy/README.md`](logsage/auto_resume_policy/README.md)

## Running the API

The FastAPI service is defined in `logsage/auto_resume_policy/server.py` and can be run locally during development:
```bash
python -m logsage.auto_resume_policy.run_server
```
Key endpoints (full specs in module README and Swagger):
- `GET /healthz`: liveness check
- `GET /version`: version info
- `POST /errors/attribution_id`: create an attribution ID for a job
- `POST /errors/logs`: submit logs under the attribution ID
- `POST /errors/attribution`: run attribution and get recommendation

For request/response schemas and cURL examples, see [`logsage/auto_resume_policy/README.md`](logsage/auto_resume_policy/README.md).

## Configuration

Configuration is managed via `logsage/auto_resume_policy/config.py` (Pydantic settings). Key variables:
- `NVIDIA_API_KEY` (required in production): API key for NVIDIA AI Endpoints (required for LLM calls)
- `FAST_API_ROOT_PATH` (optional): Root path when running behind a proxy
- `DEBUG` (optional): `true`/`false` (default `true` locally; set `false` in prod)

## Testing

Run the test suite:
```bash
make test
# or
poetry run pytest
```
Coverage reports are configured in `pyproject.toml`.

## Roadmap and Project Status

- Add API authentication and rate limiting
- Streamed/async log ingestion paths
- Integration with log collector like loki
- Expanded attribution categories and guardrails
- Improve test coverage and add end-to-end examples

## Additional Resources

- Auto-Resume-Policy API & details: [`logsage/auto_resume_policy/README.md`](logsage/auto_resume_policy/README.md)
- Fetcher & deployment details: [`logsage/fetcher/README.md`](logsage/fetcher/README.md)
- Project configuration and developer tooling: `pyproject.toml`, `Makefile`

Internal dashboards (if applicable):
- Grafana (LogSage): `https://grafana.nvidia.com/d/aeutclepcu41sf/logsage?orgId=290`
- Kibana/ES example index (sandbox): `https://gpuwa.nvidia.com/elasticsearch/df-sandbox-ohazai-logsage-test2-202508/_search`

## Contributing

For setup, development guidelines, and versioning information, see the [Contributing Guide](CONTRIBUTING.md).