# ai-dial-log-parser

**Repository Path**: mirrors_epam/ai-dial-log-parser

## Basic Information

- **Project Name**: ai-dial-log-parser
- **Description**: A tool that parses ai-dial-core logs into structured parquet datasets organized by deployment name and date, enabling fine-grained access control and easy analysis. Supports S3 and other storage backends.
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: development
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-05-23
- **Last Updated**: 2025-12-27

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Dial log parser

## Overview

Dial log parser is a tool to parse dial log files and repack it to parquet dataset.

Example:
```
docker run ai-dial-log-parser:development --input s3://bucket-with-dial-core-logs/ --output s3://bucket-with-parsed-logs/parsed_logs
```
The command above will read files like `s3://bucket-with-dial-core-logs/date=2023-11-061699285645-11111111-2222-3333-4444-555555555555.log.gz` for yesterday's date, split the logs by deployment name and repack it into parquet tables.

Example of list of output parquet files:
```
s3://bucket-with-dial-core-logs/parsed_logs/some-assistant/2023-11-06/part-0.parquet
...
s3://bucket-with-dial-core-logs/parsed_logs/gpt-35-turbo/2023-11-06/part-0.parquet
s3://bucket-with-dial-core-logs/parsed_logs/gpt-35-turbo/2023-11-06/part-1.parquet
...
s3://bucket-with-dial-core-logs/parsed_logs/some-application/2023-11-06/part-0.parquet
```
Then you could configure an access control by prefixes like `s3://bucket-with-dial-core-logs/parsed_logs/some-application/`, to allow the developers of the application to have an access to their prompt logs.

The following directory structure could be read by the tools like pyarrow as a single dataset.
```python
import pyarrow.dataset as ds

data = ds.dataset(
    "s3://bucket-with-dial-core-logs/parsed_logs/",
    partitioning=ds.partitioning(field_names=["deployment_name", "date"]),
    exclude_invalid_files=True)
data.head(
    10,
    filter=ds.field("deployment_name") == "some-application"
).to_pandas()
```

## Configuration

The configuration could be set using environment variables or as command-line arguments.

### Environment variables

Following environment variables could be used for the configuration:

|Variable|Required|Description|
|---|---|---|
|`DIAL_LOG_PARSER_INPUT`| required | Path to input log directory |
|`DIAL_LOG_PARSER_OUTPUT`| required | Path to output log directory |
|`DIAL_LOG_PARSER_DATE`| optional | Date to process logs for (default: yesterday) |
|`DIAL_LOG_PARSER_DEBUG`| optional | Enables debug logging |
|`DIAL_LOG_PARSER_FILENAME_REGEX`| optional | Allows to override the regex to match log file names (default: `date=(\d{4}-\d{2}-\d{2})(\d+)-(\w{8}-\w{4}-\w{4}-\w{4}-\w{12}).log(.gz)?`) |

### Storage specific environment variables

Specific storage implementations may require additional environment variables to be set.

For example, for S3, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY may be required. See https://s3fs.readthedocs.io/en/latest/#credentials

Fsspec compatible implementations should be supported (may require to install the extra packages to the docker).
Check the list [Built-in Fsspec Implementations](https://filesystem-spec.readthedocs.io/en/latest/api.html#implementations) and [Other Known Fsspec Implementations](https://filesystem-spec.readthedocs.io/en/latest/api.html#external-implementations) for more details.


### Command-line arguments
```
Usage: python -m aidial_log_parser.parse_logs [OPTIONS]

  Parse dial log files and repack it to parquet dataset.

Options:
  -i, --input TEXT       Path to input log directory  [env var:
                         DIAL_LOG_PARSER_INPUT; required]
  -o, --output TEXT      Path to output log directory  [env var:
                         DIAL_LOG_PARSER_OUTPUT; required]
  -d, --date [%Y-%m-%d]  Date to process logs for  [env var:
                         DIAL_LOG_PARSER_DATE; default: 2024-06-09]
  --debug                Enable debug logging  [env var:
                         DIAL_LOG_PARSER_DEBUG]
  --filename-regex TEXT  Regex to match log file names  [env var:
                         DIAL_LOG_PARSER_FILENAME_REGEX; default: date=(\d{4}-
                         \d{2}-\d{2})(\d+)-(\w{8}-\w{4}-\w{4}-\w{4}-\w{12}).lo
                         g(.gz)?]
  --help                 Show this message and exit.
```


## Output format
Output format tries to preserve all the data from the raw logs adding a few columns to help easily access most useful data.

The fields in path:
* **deployment_name** - name of the deployment (e.g. `gpt-35-turbo`, `some-assistant`, `some-application`)
* **date** - date of the log file (e.g. `2023-11-06`)

The fields in the parquet file:
* **request** - structure with the request data. It has the following fields:
  * **uri** - URI of the request
  * **time** - timestamp of the request
  * **body** - string with the body of the request. See the [Dial API documentation](https://dialx.ai/dial_api#operation/sendChatCompletionRequest) for the format of the request body.
* **response** - structure with the response data. It has the following fields:
  * **status** - status code of the response
  * **body** - string with the body of the response. See the [Dial API documentation](https://dialx.ai/dial_api#operation/sendChatCompletionRequest) for the format of the response body.
* **token_usage** - structure with the token usage data. It has the following fields:
  * **prompt_tokens** - number of tokens in the prompt
  * **completion_tokens** - number of tokens in the completion
  * **total_tokens** - total number of tokens in the request
  * **deployment_price** - the cost of this specific request, excluding the cost of any requests it directly or indirectly initiated.
  * **price** - the total cost of the request, including the cost of this request and all related requests it directly or indirectly triggered.
* **assembled_response** - json with assembled response for the chat/completion requests. In case if the request was made with streaming=true, the field will contains an assembled streaming response.
* **question** - last user message in the message history for the chat/completion requests.
* **answer** - string with the application/model response for the chat/completion requests.

The **question** and **answer** fields are not present in the raw logs, but are added to the parquet file for convenience. These fields could simplify the analysis of the logs for a simple applications which do not require a message history or choice of multiple answers.


## Developer environment

This project uses [Python>=3.12](https://www.python.org/downloads/) and [Poetry>=1.8.5](https://python-poetry.org/) as a dependency manager.

Check out Poetry's [documentation on how to install it](https://python-poetry.org/docs/#installation) on your system before proceeding.

To install requirements:

```sh
poetry install
```

This will install all requirements for running the package, linting, formatting and tests.

### IDE configuration

The recommended IDE is [VSCode](https://code.visualstudio.com/).
Open the project in VSCode and install the recommended extensions.

The VSCode is configured to use PEP-8 compatible formatter [Black](https://black.readthedocs.io/en/stable/index.html).

Alternatively you can use [PyCharm](https://www.jetbrains.com/pycharm/).

Set-up the Black formatter for PyCharm [manually](https://black.readthedocs.io/en/stable/integrations/editors.html#pycharm-intellij-idea) or
install PyCharm>=2023.2 with [built-in Black support](https://blog.jetbrains.com/pycharm/2023/07/2023-2/#black).

### Make on Windows

As of now, Windows distributions do not include the make tool. To run make commands, the tool can be installed using
the following command (since [Windows 10](https://learn.microsoft.com/en-us/windows/package-manager/winget/)):

```sh
winget install GnuWin32.Make
```

For convenience, the tool folder can be added to the PATH environment variable as `C:\Program Files (x86)\GnuWin32\bin`.
The command definitions inside Makefile should be cross-platform to keep the development environment setup simple.

## Lint

Run the linting before committing:

```sh
make lint
```

To auto-fix formatting issues run:

```sh
make format
```

## Test

Run unit tests locally:

```sh
make test
```

## Clean

To remove the virtual environment and build artifacts:

```sh
make clean
```