# AIOpsLab **Repository Path**: cherrute/AIOpsLab ## Basic Information - **Project Name**: AIOpsLab - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-01 - **Last Updated**: 2026-01-01 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

AIOpsLab

[πŸ€–Overview](#πŸ€–overview) | [πŸš€Quick Start](#πŸš€quickstart) | [πŸ“¦Installation](#πŸ“¦installation) | [βš™οΈUsage](#βš™οΈusage) | [πŸ“‚Project Structure](#πŸ“‚project-structure) | [πŸ“„How to Cite](#πŸ“„how-to-cite) [![ArXiv Link](https://img.shields.io/badge/arXiv-2501.06706-red?logo=arxiv)](https://arxiv.org/pdf/2501.06706) [![ArXiv Link](https://img.shields.io/badge/arXiv-2407.12165-red?logo=arxiv)](https://arxiv.org/pdf/2407.12165)

πŸ€– Overview

![alt text](./assets/images/aiopslab-arch-open-source.png) AIOpsLab is a holistic framework to enable the design, development, and evaluation of autonomous AIOps agents that, additionally, serve the purpose of building reproducible, standardized, interoperable and scalable benchmarks. AIOpsLab can deploy microservice cloud environments, inject faults, generate workloads, and export telemetry data, while orchestrating these components and providing interfaces for interacting with and evaluating agents. Moreover, AIOpsLab provides a built-in benchmark suite with a set of problems to evaluate AIOps agents in an interactive environment. This suite can be easily extended to meet user-specific needs. See the problem list [here](/aiopslab/orchestrator/problems/registry.py#L15).

πŸ“¦ Installation

### Requirements - Python >= 3.11 - [Helm](https://helm.sh/) - Additional requirements depend on the deployment option selected, which is explained in the next section Recommended installation: ```bash sudo apt install python3.11 python3.11-venv python3.11-dev python3-pip # poetry requires python >= 3.11 ``` We recommend [Poetry](https://python-poetry.org/docs/) for managing dependencies. You can also use a standard `pip install -e .` to install the dependencies. ```bash git clone --recurse-submodules cd AIOpsLab poetry env use python3.11 export PATH="$HOME/.local/bin:$PATH" # export poetry to PATH if needed poetry install # -vvv for verbose output poetry self add poetry-plugin-shell # installs poetry shell plugin poetry shell ```

πŸš€ Quick Start

Choose either a) or b) to set up your cluster and then proceed to the next steps. ### a) Local simulated cluster AIOpsLab can be run on a local simulated cluster using [kind](https://kind.sigs.k8s.io/) on your local machine. Please look at this [README](kind/README.md#prerequisites) for a list of prerequisites. ```bash # For x86 machines kind create cluster --config kind/kind-config-x86.yaml # For ARM machines kind create cluster --config kind/kind-config-arm.yaml ``` If you're running into issues, consider building a Docker image for your machine by following this [README](kind/README.md#deployment-steps). Please also open an issue. ### [Tips] If you are running AIOpsLab using a proxy, beware of exporting the HTTP proxy as `172.17.0.1`. When creating the kind cluster, all the nodes in the cluster will inherit the proxy setting from the host environment and the Docker container. The `172.17.0.1` address is used to communicate with the host machine. For more details, refer to the official guide: [Configure Kind to Use a Proxy](https://kind.sigs.k8s.io/docs/user/quick-start/#configure-kind-to-use-a-proxy). Additionally, Docker doesn't support SOCKS5 proxy directly. If you're using a SOCKS5 protocol to proxy, you may need to use [Privoxy](https://www.privoxy.org) to forward SOCKS5 to HTTP. If you're running VLLM and the LLM agent locally, Privoxy will by default proxy `localhost`, which will cause errors. To avoid this issue, you should set the following environment variable: ```bash export no_proxy=localhost ``` After finishing cluster creation, proceed to the next "Update `config.yml`" step. ### b) Remote cluster AIOpsLab supports any remote kubernetes cluster that your `kubectl` context is set to, whether it's a cluster from a cloud provider or one you build yourself. We have some Ansible playbooks to setup clusters on providers like [CloudLab](https://www.cloudlab.us/) and our own machines. Follow this [README](./scripts/ansible/README.md) to set up your own cluster, and then proceed to the next "Update `config.yml`" step. ### Update `config.yml` ```bash cd aiopslab cp config.yml.example config.yml ``` Update your `config.yml` so that `k8s_host` is the host name of the control plane node of your cluster. Update `k8s_user` to be your username on the control plane node. If you are using a kind cluster, your `k8s_host` should be `kind`. If you're running AIOpsLab on cluster, your `k8s_host` should be `localhost`. ### Running agents locally Human as the agent: ```bash python3 cli.py (aiopslab) $ start misconfig_app_hotel_res-detection-1 # or choose any problem you want to solve # ... wait for the setup ... (aiopslab) $ submit("Yes") # submit solution ``` Run GPT-4 baseline agent: ```bash # Create a .env file in the project root (if not exists) echo "OPENAI_API_KEY=" > .env # Add more API keys as needed: # echo "QWEN_API_KEY=" >> .env # echo "DEEPSEEK_API_KEY=" >> .env python3 clients/gpt.py # you can also change the problem to solve in the main() function ``` Our repository comes with a variety of pre-integrated agents, including agents that enable **secure authentication with Azure OpenAI endpoints using identity-based access**. Please check out [Clients](/clients) for a comprehensive list of all implemented clients. The clients will automatically load API keys from your .env file. You can check the running status of the cluster using [k9s](https://k9scli.io/) or other cluster monitoring tools conveniently. To browse your logged `session_id` values in the W&B app as a table: 1. Make sure you have W&B installed and configured. 2. Set the USE_WANDB environment variable: ```bash # Add to your .env file echo "USE_WANDB=true" >> .env ``` 3. In the W&B web UI, open any run and click Tables β†’ Add Query Panel. 4. In the key field, type `runs.summary` and click `Run`, then you will see the results displayed in a table format.

βš™οΈ Usage

AIOpsLab can be used in the following ways: - [Onboard your agent to AIOpsLab](#how-to-onboard-your-agent-to-aiopslab) - [Add new applications to AIOpsLab](#how-to-add-new-applications-to-aiopslab) - [Add new problems to AIOpsLab](#how-to-add-new-problems-to-aiopslab) ### Running agents remotely You can run AIOpsLab on a remote machine with larger computational resources. This section guides you through setting up and using AIOpsLab remotely. 1. **On the remote machine, start the AIOpsLab service**: ```bash SERVICE_HOST= SERVICE_PORT= SERVICE_WORKERS= python service.py ``` 2. **Test the connection from your local machine**: In your local machine, you can test the connection to the remote AIOpsLab service using `curl`: ```bash # Check if the service is running curl http://:/health # List available problems curl http://:/problems # List available agents curl http://:/agents ``` 3. **Run vLLM on the remote machine (if using vLLM agent):** If you're using the vLLM agent, make sure to launch the vLLM server on the remote machine: ```bash # On the remote machine chmod +x ./clients/launch_vllm.sh ./clients/launch_vllm.sh ``` You can customize the model by editing `launch_vllm.sh` before running it. 4. **Run the agent**: In your local machine, you can run the agent using the following command: ```bash curl -X POST http://:/simulate \ -H "Content-Type: application/json" \ -d '{ "problem_id": "misconfig_app_hotel_res-mitigation-1", "agent_name": "vllm", "max_steps": 10, "temperature": 0.7, "top_p": 0.9 }' ``` ### How to onboard your agent to AIOpsLab? AIOpsLab makes it extremely easy to develop and evaluate your agents. You can onboard your agent to AIOpsLab in 3 simple steps: 1. **Create your agent**: You are free to develop agents using any framework of your choice. The only requirements are: - Wrap your agent in a Python class, say `Agent` - Add an async method `get_action` to the class: ```python # given current state and returns the agent's action async def get_action(self, state: str) -> str: # ``` 2. **Register your agent with AIOpsLab**: You can now register the agent with AIOpsLab's orchestrator. The orchestrator will manage the interaction between your agent and the environment: ```python from aiopslab.orchestrator import Orchestrator agent = Agent() # create an instance of your agent orch = Orchestrator() # get AIOpsLab's orchestrator orch.register_agent(agent) # register your agent with AIOpsLab ``` 3. **Evaluate your agent on a problem**: 1. **Initialize a problem**: AIOpsLab provides a list of problems that you can evaluate your agent on. Find the list of available problems [here](/aiopslab/orchestrator/problems/registry.py) or using `orch.probs.get_problem_ids()`. Now initialize a problem by its ID: ```python problem_desc, instructs, apis = orch.init_problem("k8s_target_port-misconfig-mitigation-1") ``` 2. **Set agent context**: Use the problem description, instructions, and APIs available to set context for your agent. (*This step depends on your agent's design and is left to the user*) 3. **Start the problem**: Start the problem by calling the `start_problem` method. You can specify the maximum number of steps too: ```python import asyncio asyncio.run(orch.start_problem(max_steps=30)) ``` This process will create a [`Session`](/aiopslab/session.py) with the orchestrator, where the agent will solve the problem. The orchestrator will evaluate your agent's solution and provide results (stored under `data/results/`). You can use these to improve your agent. ### How to add new applications to AIOpsLab? AIOpsLab provides a default [list of applications](/aiopslab/service/apps/) to evaluate agents for operations tasks. However, as a developer you can add new applications to AIOpsLab and design problems around them. > *Note*: for auto-deployment of some apps with K8S, we integrate Helm charts (you can also use `kubectl` to install as [HotelRes application](/aiopslab/service/apps/hotelres.py)). More on Helm [here](https://helm.sh). To add a new application to AIOpsLab with Helm, you need to: 1. **Add application metadata** - Application metadata is a JSON object that describes the application. - Include *any* field such as the app's name, desc, namespace, etc. - We recommend also including a special `Helm Config` field, as follows: ```json "Helm Config": { "release_name": "", "chart_path": "", "namespace": "" } ``` > *Note*: The `Helm Config` is used by the orchestrator to auto-deploy your app when a problem associated with it is started. > *Note*: The orchestrator will auto-provide *all other* fields as context to the agent for any problem associated with this app. Create a JSON file with this metadata and save it in the [`metadata`](/aiopslab/service/metadata) directory. For example the `social-network` app: [social-network.json](/aiopslab/service/metadata/social-network.json) 2. **Add application class** Extend the base class in a new Python file in the [`apps`](/aiopslab/service/apps) directory: ```python from aiopslab.service.apps.base import Application class MyApp(Application): def __init__(self): super().__init__("") ``` The `Application` class provides a base implementation for the application. You can override methods as needed and add new ones to suit your application's requirements, but the base class should suffice for most applications. ### How to add new problems to AIOpsLab? Similar to applications, AIOpsLab provides a default [list of problems](/aiopslab/orchestrator/problems/registry.py) to evaluate agents. However, as a developer you can add new problems to AIOpsLab and design them around your applications. Each problem in AIOpsLab has 5 components: 1. *Application*: The application on which the problem is based. 2. *Task*: The AIOps task that the agent needs to perform. Currently we support: [Detection](/aiopslab/orchestrator/tasks/detection.py), [Localization](/aiopslab/orchestrator/tasks/localization.py), [Analysis](/aiopslab/orchestrator/tasks/analysis.py), and [Mitigation](/aiopslab/orchestrator/tasks/mitigation.py). 3. *Fault*: The fault being introduced in the application. 4. *Workload*: The workload that is generated for the application. 5. *Evaluator*: The evaluator that checks the agent's performance. To add a new problem to AIOpsLab, create a new Python file in the [`problems`](/aiopslab/orchestrator/problems) directory, as follows: 1. **Setup**. Import your chosen application (say `MyApp`) and task (say `LocalizationTask`): ```python from aiopslab.service.apps.myapp import MyApp from aiopslab.orchestrator.tasks.localization import LocalizationTask ``` 2. **Define**. To define a problem, create a class that inherits from your chosen `Task`, and defines 3 methods: `start_workload`, `inject_fault`, and `eval`: ```python class MyProblem(LocalizationTask): def __init__(self): self.app = MyApp() def start_workload(self): # def inject_fault(self) # def eval(self, soln, trace, duration): # ``` 3. **Register**. Finally, add your problem to the orchestrator's registry [here](/aiopslab/orchestrator/problems/registry.py). See a full example of a problem [here](/aiopslab/orchestrator/problems/k8s_target_port_misconfig/target_port.py).
Click to show the description of the problem in detail - **`start_workload`**: Initiates the application's workload. Use your own generator or AIOpsLab's default, which is based on [wrk2](https://github.com/giltene/wrk2): ```python from aiopslab.generator.workload.wrk import Wrk wrk = Wrk(rate=100, duration=10) wrk.start_workload(payload="", url="") ``` > Relevant Code: [aiopslab/generators/workload/wrk.py](/aiopslab/generators/workload/wrk.py) - **`inject_fault`**: Introduces a fault into the application. Use your own injector or AIOpsLab's built-in one which you can also extend. E.g., a misconfig in the K8S layer: ```python from aiopslab.generators.fault.inject_virtual import * inj = VirtualizationFaultInjector(testbed="") inj.inject_fault(microservices=[""], fault_type="misconfig") ``` > Relevant Code: [aiopslab/generators/fault](/aiopslab/generators/fault) - **`eval`**: Evaluates the agent's solution using 3 params: (1) *soln*: agent's submitted solution if any, (2) *trace*: agent's action trace, and (3) *duration*: time taken by the agent. Here, you can use built-in default evaluators for each task and/or add custom evaluations. The results are stored in `self.results`: ```python def eval(self, soln, trace, duration) -> dict: super().eval(soln, trace, duration) # default evaluation self.add_result("myMetric", my_metric(...)) # add custom metric return self.results ``` > *Note*: When an agent starts a problem, the orchestrator creates a [`Session`](/aiopslab/session.py) object that stores the agent's interaction. The `trace` parameter is this session's recorded trace. > Relevant Code: [aiopslab/orchestrator/evaluators/](/aiopslab/orchestrator/evaluators/)

πŸ“‚ Project Structure

aiopslab
Generators
  generators - the problem generators for aiopslab
  β”œβ”€β”€ fault - the fault generator organized by fault injection level
  β”‚   β”œβ”€β”€ base.py
  β”‚   β”œβ”€β”€ inject_app.py
  β”‚  ...
  β”‚   └── inject_virtual.py
  └── workload - the workload generator organized by workload type
      └── wrk.py - wrk tool interface
  
Orchestrator
  orchestrator
  β”œβ”€β”€ orchestrator.py - the main orchestration engine
  β”œβ”€β”€ parser.py - parser for agent responses
  β”œβ”€β”€ evaluators - eval metrics in the system
  β”‚   β”œβ”€β”€ prompts.py - prompts for LLM-as-a-Judge
  β”‚   β”œβ”€β”€ qualitative.py - qualitative metrics
  β”‚   └── quantitative.py - quantitative metrics
  β”œβ”€β”€ problems - problem definitions in aiopslab
  β”‚   β”œβ”€β”€ k8s_target_port_misconfig - e.g., A K8S TargetPort misconfig problem
  β”‚  ...
  β”‚   └── registry.py
  β”œβ”€β”€ actions - actions that agents can perform organized by AIOps task type
  β”‚   β”œβ”€β”€ base.py
  β”‚   β”œβ”€β”€ detection.py
  β”‚   β”œβ”€β”€ localization.py
  β”‚   β”œβ”€β”€ analysis.py
  β”‚   └── mitigation.py
  └── tasks - individual AIOps task definition that agents need to solve
      β”œβ”€β”€ base.py
      β”œβ”€β”€ detection.py
      β”œβ”€β”€ localization.py
      β”œβ”€β”€ analysis.py
      └── mitigation.py
  
Service
  service
  β”œβ”€β”€ apps - interfaces/impl. of each app
  β”œβ”€β”€ helm.py - helm interface to interact with the cluster
  β”œβ”€β”€ kubectl.py - kubectl interface to interact with the cluster
  β”œβ”€β”€ shell.py - shell interface to interact with the cluster
  β”œβ”€β”€ metadata - metadata and configs for each apps
  └── telemetry - observability tools besides observer, e.g., in-memory log telemetry for the agent
  
Observer
  observer
  β”œβ”€β”€ filebeat - Filebeat installation
  β”œβ”€β”€ logstash - Logstash installation
  β”œβ”€β”€ prometheus - Prometheus installation
  β”œβ”€β”€ log_api.py - API to store the log data on disk
  β”œβ”€β”€ metric_api.py - API to store the metrics data on disk
  └── trace_api.py - API to store the traces data on disk
  
Utils
  β”œβ”€β”€ config.yml - aiopslab configs
  β”œβ”€β”€ config.py - config parser
  β”œβ”€β”€ paths.py - paths and constants
  β”œβ”€β”€ session.py - aiopslab session manager
  └── utils
      β”œβ”€β”€ actions.py - helpers for actions that agents can perform
      β”œβ”€β”€ cache.py - cache manager
      └── status.py - aiopslab status, error, and warnings
  
cli.py: A command line interface to interact with AIOpsLab, e.g., used by human operators.

πŸ“„ How to Cite

```bibtex @inproceedings{ chen2025aiopslab, title={{AIO}psLab: A Holistic Framework to Evaluate {AI} Agents for Enabling Autonomous Clouds}, author={Yinfang Chen and Manish Shetty and Gagan Somashekar and Minghua Ma and Yogesh Simmhan and Jonathan Mace and Chetan Bansal and Rujia Wang and Saravan Rajmohan}, booktitle={Eighth Conference on Machine Learning and Systems}, year={2025}, url={https://openreview.net/forum?id=3EXBLwGxtq} } @inproceedings{shetty2024building, title = {Building AI Agents for Autonomous Clouds: Challenges and Design Principles}, author = {Shetty, Manish and Chen, Yinfang and Somashekar, Gagan and Ma, Minghua and Simmhan, Yogesh and Zhang, Xuchao and Mace, Jonathan and Vandevoorde, Dax and Las-Casas, Pedro and Gupta, Shachee Mishra and Nath, Suman and Bansal, Chetan and Rajmohan, Saravan}, year = {2024}, booktitle = {Proceedings of 15th ACM Symposium on Cloud Computing}, } ``` ## Code of Conduct This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. ## License Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the [MIT](LICENSE.txt) license. ### Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos is subject to those third-party’s policies.