# DitDetector

**Repository Path**: yjasper/dit-detector

## Basic Information

- **Project Name**: DitDetector
- **Description**: DitDetector leverages bimodal learning based on deceptive images and text for macro malware detection.
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2022-09-15
- **Last Updated**: 2023-06-07

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# DitDetector

## Background 
DitDetector leverages bimodal learning based on deceptive images and text for macro malware detection.
Specifically, we extract preview images of documents based on an image export SDK of Oracle and 
extract textual information from preview images based on an open-source OCR engine. 
And bimodal model of DitDetector contains a visual encoder, a textual encoder, and a forward neural 
network, which learns based on the joint representation of the two encoders' outputs.

DitDetector can detect the malicious Microsoft Office documents with deceptive information, but cannot
detect the malicious MS documents that only contains VBA codes.

![image text](media/workflow.png)


## Install

**Oracle Outside in Technology (OIT)**  -  a tool statically export the preview images from the MS Office 
documents, download and related information in the Oracle 
[blog](https://blogs.oracle.com/fusionmiddlewaresupport/post/oracle-outside-in-technology-855-has-been-released).
We have offered OIT at version of 8.5.4 in directory `./src/oit_linux_8.5.4`.
Before next step, please make demo:
```commandline
cd ./src/oit_linux.8.5.4 && bash makedemo.sh  && sudo chmod 777 sdk/demo/exsimple
```

**[Tesseract](https://tesseract-ocr.github.io/tessdoc/Installation.html)** - an open source text recognition 
(OCR) engine, we install it on CentOS

```commandline
yum install epel-release -y
yum install tesseract -y
```


**Create the environment through Anaconda**
```commandline
conda env create -f environment.yaml
```
You should change the prefix of conda environment in the yaml file.


**unzip the test data**: The test data contains malicious samples, so we have compressed the test data by zip 
encryption. Before running DitDetector, please use the *password* `ditdetector` to unzip the `ms_files.zip` archive.
```commandline
cd ./data/test_data && unzip -P ditdetector -x ms_files.zip -d ./
```


## Code Structure
The proposed detector code is in `./src`, some pre-trained models and test data are in `./data`.

Modules:
- Textual encoder `./src/TextCNN`
- Visual encoder `./src/MobileNetV3`
- Bimodal detector `./src/BiModal`

End-to-end tool:
- DitDetector `./src/DitDetector`


## Usage
This is an end-to-end detector, which take input as the Microsoft (MS) Office documents and output 
a label if the sample is malicious or benign (i.e., '1' for malice and '0' for benign).
```commandline
cd ./src/DitDetector
python ditDetector.py  -m ur_ms_files_dir  -e ur_export_img_dir
```

The `ur_ms_files_dir` is the directory to store the MS documents to detect, the default value is 
`./data/test_data/ms_files/`.
The `ur_export_img_dir` is the directory to cache the exported images by Oracle Outside in 
Technology (OIT), and the default value is `./data/test_data/export_img/`.


*Note: If you encounter the error message shown below when parsing the MS Office files with the OIT, use the 
`export GDFONTPATH=your_font_folder_path` to specify to the font folder.*
```commandline
EXRunExport() failed: No valid fonts found (0x0B03)
```


The workflow of DitDetector is as follows.
- Step 1: export the preview images of MS Office documents in `ur_ms_files_dir` by `exsimple`
in Oracle OIT, and cache the exported images into the directory `ur_export_img_dir`.
- Step 2: extract the textual information from the preview images via `Tesseract`, an open-source 
Optical Character Recognition (OCR) tool, clean the textual information and save them into a 
temp file `./data/test_data/docText.csv`.
- Step 3: generate the visual and textual feature representations via encoders (i.e., `TextCNN` 
and `MobileNetV3`).
- Step 4: concatenate these two modalities of feature representations as the bi-modal representations 
and feed them into the `biModal` to detect.
- Step 5: output the final decision whether the input document is malicious (i.e., `1`) or benign (i.e., `0`).

## Test Environment
- Intel(R) Xeon(R) CPU E5-2630 v4 (2.20GHz and 40 cores)
- NVIDIA Tesla P100 (12GB)
- CentOS 7.9.2009
- Python 3.6.13
- conda 4.14.0

## Execution Time Estimate 
DitDetector execution time is flexible. In fact, the most time-consuming part is the preprocessing module, including OIT exporting preview images and OCR extracting text. Especially the OCR module, the more text in the document preview images, the more time OCR takes.
In addition, we give time estimates containing the loading time of each model, and the actual sample inference time should be shorter.
We detect 20 MS office documents with an end-to-end execution time of about 130s, and if we exclude both OCR and OIT modules, the remaining process takes only 13.71s.