# mlperf-common

**Repository Path**: mirrors_NVIDIA/mlperf-common

## Basic Information

- **Project Name**: mlperf-common
- **Description**: NVIDIA's launch, startup, and logging scripts used by our MLPerf Training and HPC submissions
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-09-01
- **Last Updated**: 2026-03-29

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# MLPerf Common - a collection of common MLPerf tools


## MLPerf Logging

MLPerf common can be installed via `pip install` by adding the following line to the `requirements.txt` file:
```
git+https://github.com/NVIDIA/mlperf-common.git
```

### Integration using torch.distributed (pytorch)

In `mlperf_logger.py` module define:
```
from mlperf_common.logging import MLLoggerWrapper
from mlperf_common.frameworks.pyt import PyTCommunicationHandler

mllogger = MLLoggerWrapper(PyTCommunicationHandler(), value=None)
```
Then use `mllogger` by importing `from mlperf_logger import mllogger` in other modules.

### Integration using MPI (horovod/hugectr/mxnet/tensorflow)

In `mlperf_logger.py` global module define:
```
from mlperf_common.logging import MLLoggerWrapper
from mlperf_common.frameworks.mxnet import MPICommunicationHandler

mllogger = MLLoggerWrapper(MPICommunicationHandler(), value=None)
```
Then use `mllogger` by importing `from mlperf_logger import mllogger` in other modules.

Optionally, you can pass an MPI communicator during the initialization of `MPICommunicationHandler()`.
```
comm = MPI.COMM_WORLD
mllogger = MLLoggerWrapper(MPICommunicationHandler(comm), value=None)
```
by default, `MPICommunicationHandler()` creates a global communicator.

### Logging additional metrics
MLPerf logger can be used to track additional non-required metric, for example `throughput`. The recommended way is to add a line such as:
```
mllogger.event(key='tracked_stats', metadata={'step': epoch}, value={"throughput": throughput, "metric_a": metric_a, 'metric_b': metric_b})
```
where `throughput` is recommended to be `samples per second`, logged every epoch or as often as it is reasonable for a given benchmark. Additional metrics, `metric_a` and `metric_b`, can represent any numerical value that requires logging. The key `tracked_stats` and an increasing value for `step` are required.

## Scaleout Bridge

#### init_bridge
Instead of previous `sbridge = init_bridge(rank)`, initialize sbridge as follows:
```
from mlperf_common.frameworks.pyt import PyTNVTXHandler, PyTCommunicationHandler

sbridge = init_bridge(PyTNVTXHandler(), PyTCommunicationHandler(), mllogger)
```
or, for `horovod/tf/mxnet`:
```
from mlperf_common.frameworks.mxnet import MXNetNVTXHandler, MPICommunicationHandler

sbridge = init_bridge(MXNetNVTXHandler(), MPICommunicationHandler(), mllogger)
```
and start your profiling as usual
```
sbridge.start_prof()
sbridge.stop_prof()
```

#### EmptyObject
Current `ScaleoutBridgeBase` class replaces previous `EmptyObject` class,
so just replace `EmptyObject()` with `ScaleoutBridgeBase()`.

## Mount check
### Get mount info
`get-mount-info.sh` prints a description of given directory and takes one argument:
`paths_to_verify` that contains paths separated by commas.  

Example of use:
```
get-mount-info.sh "/data,/checkpoints"
``` 

Example output:
```
declare -a directory_sizes
declare -a number_of_paths_in_dir
# ----------
directory_sizes+=(
",3743665220,2"
"coco2014,1445240,3"
"laion-400m,3742219976,2"
"coco2014/val2014_512x512_30k,1412012,30000"
"laion-400m/webdataset-moments-filtered,876410532,833"
"laion-400m/webdataset-moments-filtered-encoded,2865809440,832"
)
number_of_paths_in_dir+=(6)
# ----------
directory_sizes+=(
",52403072,5"
"clip,7704556,3"
"inception,93392,1"
"sd,20888768,2"
)
number_of_paths_in_dir+=(4)
```
`number_of_paths_in_dir` for each path specified contains the number of subdirectories in it.  
The fields in directory_sizes contain 3 values separated by commas:  
relative path path to the directory, its size in KB, the number of directories and files inside it.


### Verify mounts
`verify-mounts.sh` checks if a given directory is consistent with a description generated with get-mount-info.sh and takes one argument:
`paths_to_verify` that contains paths separated by commas.  

Example of use:
```
verify-mounts.sh "/data,/checkpoints"
```
The directory where `verify-mounts.sh` is located should contain `cont-mount-info.sh` file generated earlier by `get-mount-info.sh`.