# dynolog
**Repository Path**: mirrors_facebookincubator/dynolog
## Basic Information
- **Project Name**: dynolog
- **Description**: Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-11-03
- **Last Updated**: 2025-09-13
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Dynolog: a performance monitoring daemon for heterogeneous CPU-GPU systems
[](https://github.com/facebookincubator/dynolog/blob/main/LICENSE)
[](https://github.com/facebookincubator/dynolog/actions)
[](https://github.com/facebookincubator/dynolog/releases)
[](https://github.com/facebookincubator/dynolog/issues)
[](https://makeapullrequest.com)
## Introduction
Dynolog is a lightweight monitoring daemon for heterogeneous CPU-GPU systems. It supports both **always-on performance monitoring**, as well as **deep-dive profiling** modes. The latter can be activated by making a remote procedure call to the daemon.
Below are some of the key features, which we will explore in more detail later in this Readme.
* Dynolog integrates with the [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) and provides **[on-demand remote tracing features](https://pytorch.org/blog/performance-debugging-of-production-pytorch-models-at-meta/).** One can use a single command line tool (dyno CLI) to **simultaneously trace hundreds of GPUs** and examine the collected traces (available from PyTorch v1.13.0 onwards).
* It incorporates **[GPU performance monitoring](#gpu-monitoring)** for NVIDIA GPUs using [DCGM](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/index.html#).
* Dynolog manages counters for **micro-architecture specific performance events** related to CPU Cache, TLBs etc on **Intel** and **AMD** CPUs. Additionally, it instruments telemetry from the Linux kernel including **CPU, network and IO** resource usage.
* We are actively implementing new features, including support for **[Intel Processor Trace](https://engineering.fb.com/2021/04/27/developer-tools/reverse-debugging/)** as well as **memory latency and bandwidth monitoring**.
We focus on Linux platforms as it is leveraged heavily in cloud environments.
### Motivation
Large scale AI models use **distributed AI training** across multiple compute nodes. They also leverage hardware accelerators like **GPUs** to boost performance. One has to carefully optimize their AI applications to make the most of the underlying hardware while avoiding performance bottlenecks. This is where great performance monitoring and profiling tools become indispensable.
While there are existing solutions for monitoring ([1](https://www.intel.com/content/www/us/en/cloud-computing/telemetry.html), [2](https://cloud.google.com/learn/what-is-opentelemetry)) and profiling CPUs ([Intel’s VTune](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html)) and GPUs ([NSight](https://developer.nvidia.com/nsight-compute)); it is challenging to assemble them together to get a holistic view of the system. For example, we need to understand whether an inefficiency on one resource like the communication fabric is slowing down the overall computation. Additionally, these solutions need to work in a production enviroment without causing performance degradation.
Dynolog leverages the underlying monitoring and profiling interfaces from the Linux kernel, CPU Performance Monitoring Units (PMUs) and GPUs. It also interacts with the pytorch profiler within the application to support on-demand profiling. In this way, it helps identify bottlenecks at various points in the system.
### Supported Metrics
Dynolog’s always-on or continuous monitoring supports the following class of metrics:
1. System/kernel metrics.
2. CPU Performance Monitoring Unit (PMU) metrics using linux perf_event.
3. NVIDIA GPU metrics from DCGM if enabled.
Detailed list of covered metrics are provided in [docs/Metrics.md](docs/Metrics.md).
## Getting Started
### Installation
Dynolog can be installed using the package manager of your choice with either RPM for CentoOS or Debian for Ubuntu like distros. We do not support non-Linux platforms.
There are no required dependencies except for DCGM if you need to monitor GPUs, please see the section on [GPU monitoring](#gpu-monitoring) below.
Obtain the latest dynolog release or pick up one of the releases [here](https://github.com/facebookincubator/dynolog/releases)
```bash
# for CentoOS
wget https://github.com/facebookincubator/dynolog/releases/downdload/v0.2.1/dynolog-0.2.1-1.el8.x86_64.rpm
sudo rpm -i dynolog-0.2.1-1.el8.x86_64.rpm
# for Ubuntu or similar Debian based linux distros
wget https://github.com/facebookincubator/dynolog/releases/download/v0.2.1/dynolog_0.2.1-0-amd64.deb
sudo dpkg -i dynolog_0.2.1-0-amd64.deb
```
#### No sudo access?
Dynolog can run in userspace mode with most features functional. There are a few options to run dynolog userspace.
One way is to simply decompress the RPM or debian packages as shown below.
```bash
mkdir -p dynolog_pkg; cd dynolog_pkg
wget https://github.com/facebookincubator/dynolog/releases/download/v0.2.1/dynolog_0.2.1-0-amd64.deb
ar x dynolog_0.2.1-0-amd64.deb; tar xvf data.tar.xz
# binaries should now be available in ./usr/local/bin, you can add this directory to your $PATH.
```
Alternatively, you can [build dynolog from source](#building-from-source). The binaries should be present in the `build/bin` directory.
The packages provides systemd support to run the server as a daemon. You can however still run dynolog server directly in a separate terminal.
### Running dynolog
Start the Dynolog service using systemd -
```bash
sudo systemctl start dynolog
```
Note:
* The dynolog service picks up runtime flags from `/etc/dynolog.gflags` if the file is present.
* Output logs will be written to `/var/logs/dynolog.log` and logs are automatically rotated.
One can check the values of the metrics emitted in the output log file.
```bash
$> tail /var/log/dynolog.log
I20220721 23:42:34.141083 3632432 Logger.cpp:37] Logging : 12 values
I20220721 23:42:34.141104 3632432 Logger.cpp:38] time = 2022-07-21T23:42:34.141Z data = {"cpu_i":"71.342" …
```
The `dyno` command line tool communicates with the dynolog daemon on the local or remote system.
For example, we can verify if the daemon is running using the `status` subcommand.
```bash
$> dyno status
response length = 12
response = {"status":1}
$> dyno --hostname some_remote_host.com status
response length = 12
response = {"status":1}
```
Run `dyno --help` for help on other subcommands.
**Server Command Line options**
Lastly, the dynolog server provides various flags, we list the key ones here. Run `dynolog --help` for more info.
* `--port` (default = 1778) - the port used to setup a service for remote queries.
* `--reporting_interval_s` (default=60) - the reporting interval for metrics. Please see the [Logging](#logging) section for more details.
* `--enable_ipc_monitor` sets up inter-process communication endpoint. This can be used to talk to applications like pytorch trainers.
### Collecting pytorch/GPU traces
To enable pytorch profiling add the flag `--enable_ipc_monitor`. This enables the server to communicate with the pytorch process.
If you are running the server with systemd, do the following-
```bash
echo "--enable_ipc_monitor" | sudo tee -a /etc/dynolog.gflags
sudo systemctl restart dynolog
```
You also need to use a compatible version of **pytorch v1.13.0** and set the env variable `KINETO_USE_DAEMON=1` before running the pytorch program.
See [docs/pytorch_profiler.md](docs/pytorch_profiler.md) for details.
Traces can now be captured using the `gputrace` subcommand
```bash
dyno gputrace --pids --log_file