DitDetector leverages bimodal learning based on deceptive images and text for macro malware detection. Specifically, we extract preview images of documents based on an image export SDK of Oracle and extract textual information from preview images based on an open-source OCR engine. And bimodal model of DitDetector contains a visual encoder, a textual encoder, and a forward neural network, which learns based on the joint representation of the two encoders' outputs.
DitDetector can detect the malicious Microsoft Office documents with deceptive information, but cannot detect the malicious MS documents that only contains VBA codes.
Oracle Outside in Technology (OIT) - a tool statically export the preview images from the MS Office
documents, download and related information in the Oracle
blog.
We have offered OIT at version of 8.5.4 in directory ./src/oit_linux_8.5.4
.
Before next step, please make demo:
cd ./src/oit_linux.8.5.4 && bash makedemo.sh && sudo chmod 777 sdk/demo/exsimple
Tesseract - an open source text recognition (OCR) engine, we install it on CentOS
yum install epel-release -y
yum install tesseract -y
Create the environment through Anaconda
conda env create -f environment.yaml
You should change the prefix of conda environment in the yaml file.
unzip the test data: The test data contains malicious samples, so we have compressed the test data by zip
encryption. Before running DitDetector, please use the password ditdetector
to unzip the ms_files.zip
archive.
cd ./data/test_data && unzip -P ditdetector -x ms_files.zip -d ./
The proposed detector code is in ./src
, some pre-trained models and test data are in ./data
.
Modules:
./src/TextCNN
./src/MobileNetV3
./src/BiModal
End-to-end tool:
./src/DitDetector
This is an end-to-end detector, which take input as the Microsoft (MS) Office documents and output a label if the sample is malicious or benign (i.e., '1' for malice and '0' for benign).
cd ./src/DitDetector
python ditDetector.py -m ur_ms_files_dir -e ur_export_img_dir
The ur_ms_files_dir
is the directory to store the MS documents to detect, the default value is
./data/test_data/ms_files/
.
The ur_export_img_dir
is the directory to cache the exported images by Oracle Outside in
Technology (OIT), and the default value is ./data/test_data/export_img/
.
Note: If you encounter the error message shown below when parsing the MS Office files with the OIT, use the
export GDFONTPATH=your_font_folder_path
to specify to the font folder.
EXRunExport() failed: No valid fonts found (0x0B03)
The workflow of DitDetector is as follows.
ur_ms_files_dir
by exsimple
in Oracle OIT, and cache the exported images into the directory ur_export_img_dir
.Tesseract
, an open-source
Optical Character Recognition (OCR) tool, clean the textual information and save them into a
temp file ./data/test_data/docText.csv
.TextCNN
and MobileNetV3
).biModal
to detect.1
) or benign (i.e., 0
).DitDetector execution time is flexible. In fact, the most time-consuming part is the preprocessing module, including OIT exporting preview images and OCR extracting text. Especially the OCR module, the more text in the document preview images, the more time OCR takes. In addition, we give time estimates containing the loading time of each model, and the actual sample inference time should be shorter. We detect 20 MS office documents with an end-to-end execution time of about 130s, and if we exclude both OCR and OIT modules, the remaining process takes only 13.71s.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。