# drain-java

**Repository Path**: haoxm/drain-java

## Basic Information

- **Project Name**: drain-java
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MPL-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-11-15
- **Last Updated**: 2023-11-15

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

= drain java

image:https://github.com/bric3/drain-java/actions/workflows/gradle.yml/badge.svg[Java CI with Gradle,link=https://github.com/bric3/drain-java/actions/workflows/gradle.yml]

== Introduction

drain-java is a continuous _log template miner_, for each log message it extracts
tokens and group them into _clusters of tokens_. As new log messages are added,
drain-java will identify similar token and update the cluster with the new template,
or simply create a new token cluster. Each time a cluster is matched a counter is
incremented.

These clusters are stored in prefix tree, which is somewhat similar to a trie, but
here the tree as a fixed depth in order to avoid long tree traversal.
In avoiding deep trees this also helps to keep it balance.

== Usage

First, https://foojay.io/almanac/jdk-11/[Java 11] is required to run drain-java.

=== As a dependency

You can consume drain-java as a dependency in your project `io.github.bric3.drain:drain-java-core`,
currently only https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/bric3/drain/[snapshots]
are available by adding this repository.

[source, kotlin]
----
repositories {
    maven {
        url("https://oss.sonatype.org/content/repositories/snapshots/")
    }
}
----

=== From command line

Since this tool is not yet released the tool needs to be built locally.
Also, the built jar is not yet super user-friendly. Since it's not a finished
product, anything could change.

.Example usage
[source, shell]
----
$ ./gradlew build
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar -h

tail - drain
Usage: tail [-dfhV] [--verbose] [-n=NUM]
            [--parse-after-str=FIXED_STRING_SEPARATOR]
            [--parser-after-col=COLUMN] FILE
...
      FILE          log file
  -d, --drain       use DRAIN to extract log patterns
  -f, --follow      output appended data as the file grows
  -h, --help        Show this help message and exit.
  -n, --lines=NUM   output the last NUM lines, instead of the last 10; or use
                      -n 0 to output starting from beginning
      --parse-after-str=FIXED_STRING_SEPARATOR
                    when using DRAIN remove the left part of a log line up to
                      after the FIXED_STRING_SEPARATOR
      --parser-after-col=COLUMN
                    when using DRAIN remove the left part of a log line up to
                      COLUMN
  -V, --version     Print version information and exit.
      --verbose     Verbose output, mostly for DRAIN or errors
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar --version
Versioned Command 1.0
Picocli 4.6.3
JVM: 19 (Amazon.com Inc. OpenJDK 64-Bit Server VM 19+36-FR)
OS: Mac OS X 12.6 x86_64
----

By default, the tool act similarly to `tail`, and it will output the file to the stdout.
The tool can _follow_ a file if the `--follow` option is passed.
However, when run with the `--drain` this tool will classify log lines using DRAIN, and will
output identified clusters.
Note that this tool doesn't handle multiline log messages (like logs that contains a stacktrace).

On the SSH log data set we can use it this way.

[source, shell]
----
$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar \
  -d \ <1>
  -n 0 \ <2>
  --parse-after-str "]: " <3>
  build/resources/test/SSH.log <4>
----
<1> Identify patterns in the log
<2> Starts from the beginning of the file (otherwise it starts from the last 10 lines)
<3> Remove the left part of log line (`Dec 10 06:55:46 LabSZ sshd[24200]: `), ie effectively
ignoring some variable elements like the time.
<4> The log file

.log pattern clusters and their occurences
[source]
--------
---- Done processing file. Total of 655147 lines, done in 1.588 s, 51 clusters <1>
0010 (size 140768): Failed password for <*> from <*> port <*> ssh2 <2>
0009 (size 140701): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0007 (size 68958): Connection closed by <*> [preauth]
0008 (size 46642): Received disconnect from <*> 11: <*> <*> <*>
0014 (size 37963): PAM service(sshd) ignoring max retries; <*> > 3
0012 (size 37298): Disconnecting: Too many authentication failures for <*> [preauth]
0013 (size 37029): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0011 (size 36967): message repeated <*> times: [ Failed password for <*> from <*> port <*> ssh2]
0006 (size 20241): Failed <*> for invalid user <*> from <*> port <*> ssh2
0004 (size 19852): pam unix(sshd:auth): check pass; user unknown
0001 (size 18909): reverse mapping checking getaddrinfo for <*> <*> failed - POSSIBLE BREAK-IN ATTEMPT!
0002 (size 14551): Invalid user <*> from <*>
0003 (size 14551): input userauth request: invalid user <*> [preauth]
0005 (size 14356): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*>
0018 (size 1289): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*>
0024 (size 952): fatal: Read from socket failed: Connection reset by peer [preauth]
...
--------
<1> 51 _types_ of logs were identified from 655147 lines in 1.588s
<2> There was `140768` similar log messages with this pattern, with `3` positions
where the token is identified as parameter `<*>`.

On the same dataset, the java implementation performed roughly around 10 times faster.
As my implementation does not yet have masking, mask configuration was removed in the
Drain3 implementation.

=== From Java

This tool is not yet intended to be used as a library, but for the curious
the DRAIN algorythm can be used this way:

.Minimal DRAIN example
[source, java]
----
var drain = Drain.drainBuilder()
                 .additionalDelimiters("_")
                 .depth(4)
                 .build()
Files.lines(Paths.get("build/resources/test/SSH.log"),
            StandardCharsets.UTF_8)
     .forEach(drain::parseLogMessage);

// do something with clusters
drain.clusters();
----


== Status

Pieces of puzzle are coming in no particular order, I first bootstrapped the code from a simple Java
file. Then I wrote in Java an implementation of Drain. Now here's what I would like to do.

.Todo
- [ ] More unit tests
- [x] Wire things together
- [ ] More documentation
- [x] Implement _tail follow_ mode (currently in drain mode the whole file is read and stops once finished)
- [ ] In follow drain mode dump clusters on forced exit (e.g. for example when hitting `ctrl`+`c`)
- [x] Start reading from the last x lines (like `tail -n 30`)
- [ ] Implement log masking (e.g. log contain an email, or an IP address which may be considered as private data)

.For later
- [ ] Json message field extraction
- [ ] How to handle prefixes : Dates, log level, etc. ; possibly using masking
- [ ] Investigate marker with specific behavior, e.g. log level severity
- [ ] Investigate log with stacktraces (likely multiline)
- [ ] Improve handling of very long lines
- [ ] Logback appender with micrometer counter

== Motivation

I was inspired by a https://sayr.us/log-pattern-recognition/logmine/[blog article from one of my colleague on LogMine],
-- many thanks to him for doing the initial research and explaining concepts --, we were both impressed by the log
pattern extraction of https://docs.datadoghq.com/logs/explorer/patterns/[Datadog's Log explorer], his blog post
sparked my interest.

After some discussion together, we saw that Drain was a bit superior to LogMine.
Googling Drain in Java didn't yield any result, although I certainly didn't search exhaustively,
but regardless this triggered the idea to implement this algorithm in Java.

== References

The Drain port is mostly a port of https://github.com/IBM/Drain3[Drain3]
done by IBM folks (_David Ohana_, _Moshik Hershcovitch_). IBM's Drain3 is a fork of the
https://github.com/logpai/logparser[original work] done by the LogPai team based on the paper of
_Pinjia He_, _Jieming Zhu_, _Zibin Zheng_, and _Michael R. Lyu_.

_I didn't follow up on other contributors of these projects, reach out if you think you have been omitted._


For reference here's the linked I looked at:

* https://logparser.readthedocs.io/
* https://github.com/logpai/logparser
* https://github.com/IBM/Drain3
* https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf
(a copy of this publication accessible link:doc/pjhe_icws2017.pdf[there])