# Benchmarking-Malware-Family-Classification

**Repository Path**: frontxiang/Benchmarking-Malware-Family-Classification

## Basic Information

- **Project Name**: Benchmarking-Malware-Family-Classification
- **Description**: No description available
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-06-12
- **Last Updated**: 2024-06-12

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

## Datasets
Because of copyright issues, both the MalwareBazaar dataset and the MalwareDrift dataset just contain the malware SHA-256 hash and all of the related information which can be find in the `Datasets` folder. You can download raw malware samples from the open-source malware release website by applying an api-key, and use disassembly tool to convert the malware into binary and disassembly files.  
* **The MalwareBazaar dataset** : you can download the samples from [MalwareBazaar](https://bazaar.abuse.ch/).  
* **The MalwareDrift dataset** : you can download the samples from [VirusShare](https://virusshare.com/).

## Experimental Settings
|     Model    | Training Strategy | Optimizer | Learning Rate | Batch Size |                      Input Format                     |
|:------------:|:-----------------:|:---------:|:-------------:|:----------:|:-----------------------------------------------------:|
|   ResNet-50  |    From Scratch   |    Adam   |      1e-3     |     64     |                  224*224 color image                  |
|   ResNet-50  |      Transfer     |    Adam   |      1e-3     |  All data* |                  224*224 color image                  |
|    VGG-16    |    From Scratch   |    SGD    |     5e-6**    |     64     |                  224*224 color image                  |
|    VGG-16    |      Transfer     |    SGD    |      5e-6     |     64     |                  224*224 color image                  |
| Inception-V3 |    From Scratch   |    Adam   |      1e-3     |     64     |                  224*224 color image                  |
| Inception-V3 |      Transfer     |    Adam   |      1e-3     |  All data  |                  224*224 color image                  |
|     IMCFN    |    From Scratch   |    SGD    |    5e-6***    |     32     |                  224*224 color image                  |
|     IMCFN    |      Transfer     |    SGD    |    5e-6***    |     32     |                  224*224 color image                  |
|   CBOW+MLP   |         -         |    SGD    |      1e-3     |     128    |       CBOW: byte sequences; MLP: 256*256 matrix       |
|    MalConv   |         -         |    SGD    |      1e-3     |     32     |                  2MB raw byte values                  |
|     MAGIC    |         -         |    Adam   |      1e-4     |     10     |                          ACFG                         |
| Word2Vec+KNN |         -         |     -     |       -       |      -     | Word2Vec: Opcode sequences; KNN distance measure: WMD |
|     MCSC     |         -         |    SGD    |      5e-3     |     64     |                    Opcode sequences                   |
  
\* The batch size is set to `128` for the MalwareBazaar dataset  
\** The learning rate is set to `5e-5` for the Malimg dataset and `1e-5` for the MalwareBazaar dataset  
\*** The learning rate is set to `1e-5` for the MalwareBazaar dataset  
CBOW is with default parameters in the Word2Vec package in the Gensim library of Python  


## Graphically Analysis of Table 4 and Table 5  
Here is a more detailed figure analysis for **Table 4** and **Table 5** in order to make the raw information in the paper easier to digest.

### **Table 4**

- **The classification performance (F1-Score) of each approach on three datasets**
![classification performance](https://github.com/MHunt-er/Benchmarking-Malware-Family-Classification/blob/main/Graphically%20Analysis%20of%20Table%204%20and%20Table%205/table_4_1.png)

   The figure shows the classification performance (F1-Score) of each methods on three datasets. It is noteworthy that the Malimg dataset only contains malware images, and thus it can only be used to evaluate the 4 image-based methods.

- **The average classification performance (F1-Score) of each approach for three datasets**
![average classification performance](https://github.com/MHunt-er/Benchmarking-Malware-Family-Classification/blob/main/Graphically%20Analysis%20of%20Table%204%20and%20Table%205/table_4_2.png)

   The figure shows the average classification performance (F1-Score) of each method for the three datasets. Among them, the F1-score corresponding to each model is obtained by averaging the F1-score of the model on three datasets, which represents the average performance.

- **The train time and resource overhead of each method on three datasets**  
![resource consumption](https://github.com/MHunt-er/Benchmarking-Malware-Family-Classification/blob/main/Graphically%20Analysis%20of%20Table%204%20and%20Table%205/table_4_resource.png)

   The figure shows the train time (left subgraph) and resource overhead (right subgraph) needed for every method on three datasets. The bar immediately to the right of the train time bar is the memory overhead of this model. Similarly, there are only 4 image-based models for the Malimg dataset.

### **Table 5**
- **The classification performance (F1-Score) of transfer learning for image-based approaches on three datasets**
![transfer learning](https://github.com/MHunt-er/Benchmarking-Malware-Family-Classification/blob/main/Graphically%20Analysis%20of%20Table%204%20and%20Table%205/table_5_performance.png)

   This figure shows the F1-Score obtained by every image-based model using the strategy of training from scratch, 10% transfer learning, 50% transfer learning, 80% transfer learning, and 100% transfer learning, respectively. Every subgraph correspond to the BIG-15, Malimg, and MalwareBazaar dataset, respectively.

- **The train time and resource overhead of transfer learning for image-based approaches on three datasets**  
![resource consumption](https://github.com/MHunt-er/Benchmarking-Malware-Family-Classification/blob/main/Graphically%20Analysis%20of%20Table%204%20and%20Table%205/table_5_resource.png)

   Each row correspond to the BIG-15, Mmalimg, and MalwareBazaar dataset, respectively. For each row, there are 4 models (ResNet-50, VGG-16, Inception-V3 and IMCFN). For each model, there are 8 bars on the right,  the left 4 bars stands for the train time under 10%, 50%, 80% and 100% transfer learning, and the right 4 bars are the memory overhead under 10%, 50%, 80% and 100% transfer learning.