# CADE **Repository Path**: frontxiang/CADE ## Basic Information - **Project Name**: CADE - **Description**: Code for our USENIX Security 2021 paper -- CADE: Detecting and Explaining Concept Drift Samples for Security Applications - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-04-28 - **Last Updated**: 2024-11-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # CADE: Contrastive Autoencoder for Drifting detection and Explanation The repository contains the code for detecting and explaining a specific type of concept drift (i.e., previously unseen families) in security applications like malware attribution and network intrusion classification. Further details can be found in the paper "*CADE: Detecting and Explaining Concept Drift Samples for Security Applications*" by Limin Yang, Wenbo Guo, Qingying Hao, Arridhana Ciptadi, Ali Ahmadzadeh, Xinyu Xing, Gang Wang (USENIX Security 2021). We also include supplemental materials in the repo (`USENIX_21_drifting_Supplementary_Materials.pdf`) due to page limit. Check out http://liminyang.web.illinois.edu for up-to-date information on the project. If you end up building on this research or code as part of a project or publication, please include a reference to the USENIX Security paper: ``` @inproceedings{yang2021cade, title = {CADE: Detecting and Explaining Concept Drift Samples for Security Applications}, author = {Yang, Limin and Guo, Wenbo and Hao, Qingying and Ciptadi, Arridhana and Ahmadzadeh, Ali and Xing, Xinyu and Wang, Gang}, booktitle = {Proc. of USENIX Security}, year = {2021} } ``` ## 1. Installation Before getting started we recommend setting up a Python 3.6.5 or 3.6.8 virtual environment (other Python 3.6 or above versions might also work but didn't test). * If you are using CPU-based tensorflow, install all required packages: ```bash pip install -r requirements-tensorflow-cpu.txt python setup.py install ``` * If you are using GPU-based tensorflow, please try the following steps to setup: ```bash module load cuda-toolkit/9.0 # other versions might also work but didn't test # you may also try pyenv and virtualenv to create the virtual environment, here we use Anaconda conda create -n cade-gpu python=3.6.8 conda activate cade-gpu pip install scipy==1.3.3 pip install numpy==1.16.1 pip install --ignore-installed tensorflow-gpu==1.12.0 pip install keras==2.2.5 pip install sklearn==0.23.2 pip install matplotlib==3.1.2 pip install seaborn==0.11.0 pip install tqdm==4.49.0 python setup.py install ``` ## 2. Configuration The preprocessed Drebin and IDS2018 dataset can be found under the `data` folder. If you prefer to modify the preprocessing step, you may download the original dataset here: https://www.sec.cs.tu-bs.de/~danarp/drebin/index.html and https://www.unb.ca/cic/datasets/ids-2018.html and fill out the configuration in `cade/config.py`. ## 3. Usage There are a number of command line arguments to run our program: ```bash $ python main.py -h usage: main.py [-h] [--data DATA] [-c {mlp,rf}] [--stage {detect,explanation}] [--pure-ae {0,1}] [--quiet {0,1}] [--cae-hidden CAE_HIDDEN] [--cae-batch-size CAE_BATCH_SIZE] [--cae-lr CAE_LR] [--cae-epochs CAE_EPOCHS] [--cae-lambda-1 CAE_LAMBDA_1] [--similar-ratio SIMILAR_RATIO] [--margin MARGIN] [--display-interval DISPLAY_INTERVAL] [--mad-threshold MAD_THRESHOLD] [--exp-method {distance_mm1,approximation_loose}] [--exp-lambda-1 EXP_LAMBDA_1] [--mlp-retrain {0,1}] [--mlp-hidden MLP_HIDDEN] [--mlp-batch-size MLP_BATCH_SIZE] [--mlp-lr MLP_LR] [--mlp-epochs MLP_EPOCHS] [--mlp-dropout MLP_DROPOUT] [--newfamily-label NEWFAMILY_LABEL] [--tree TREE] [--rf-retrain {0,1}] ``` See `cade/utils.py` or run `python main.py -h` for detailed help. You may also check `run_drebin_cade.sh` for a bunch of examples. ## 4. Examples ### 4.1 Drift detection 1. To get the detection performance of CADE on the Drebin dataset (iteratively choose one family from 8 families as the unseen family): ```bash ./run_drebin_cade.sh # After the shell script finished running python -u average_all_detection_results.py drebin 0 # 0 means using CADE, while 1 means using Vanilla AE ``` 2. To get the detection performance of CADE on the IDS2018 dataset (iteratively choose one family from 3 families as the unseen family): ```bash ./run_ids_cade.sh # After the shell script finished running python -u average_all_detection_results.py IDS 0 ``` 3. To get the detection performance of Vanilla Autoencoder on the Drebin dataset: ```bash ./run_drebin_pure_ae.sh # After the shell script finished running python -u average_all_detection_results.py drebin 1 ``` 4. To get the detection performance of Vanilla Autoencoder on the IDS2018 dataset: ```bash ./run_ids_pure_ae.sh # After the shell script finished running python -u average_all_detection_results.py IDS 1 ``` ### 4.2 Drift explanation 1. CADE explaining drift samples on the Drebin-Fakedoc setting (i.e., drebin_new_7): ```bash ./run_cade_exp_drebin_fakedoc.sh # It will generate reports/drebin_new_7/mask_distance_mm1_0.001.npz, # which is already provided. # This step is time-consuming and non-deterministic, # so we include the explanation output for saving reproduction time and easier comparison. ``` 2. CADE explaining drift samples on the IDS2018-Infiltration setting: ```bash ./run_cade_exp_ids_infiltration.sh # It will generate reports/IDS_new_Infilteration/mask_distance_mm1_0.001.npz, # which is already provided. ``` 3. Boundary-based explanation on the Drebin-Fakedoc setting: ```bash ./run_boundary_exp_drebin_fakedoc.sh # It will generate reports/drebin_new_7/mask_approximation_loose_0.001.npz, # which is already provided. ``` 4. Boundary-based explanation on the IDS2018-Infiltration setting: ```bash ./run_boundary_exp_ids_infiltration.sh # It will generate reports/IDS_new_Infilteration/mask_approximation_loose_0.001.npz, # which is already provided. ``` 5. Compare CADE with boundary-based explanation and random explanation (using distance as the evaluation metric) 1. Drebin-FakeDoc ```bash # 1. To get original distance and CADE distance python -u evaluate_explanation_by_distance.py drebin_new_7 distance_mm1 0.001 1 0.1 # 2. To get random explanation distance python -u evaluate_explanation_by_distance.py drebin_new_7 random 0.001 0 0.1 # since we randomly run 100 times, there might be minor difference on the output. # 3. To get boundary-based explanation distance python -u evaluate_explanation_by_distance.py drebin_new_7 approximation_loose 0.001 0 0.1 # 4. To get gradient-based explanation distance nohup python -u evaluate_explanation_by_distance.py drebin_new_7 gradient 0.001 0 0.1 \ > logs/nohup-drebin_new_7-gradient-exp.log & ``` 2. IDS2018-Infiltration ```bash # 1. To get original distance and CADE distance nohup python -u evaluate_explanation_by_distance.py IDS_new_Infilteration distance_mm1 \ 0.001 1 0.1 > logs/nohup-IDS-distance-mm1-exp.log & # 2. To get random explanation distance nohup python -u evaluate_explanation_by_distance.py IDS_new_Infilteration random \ 0.001 0 0.1 > logs/nohup-IDS-random-exp.log & # since we randomly run 100 times, there might be minor difference on the output. # 3. To get boundary-based explanation distance nohup python -u evaluate_explanation_by_distance.py IDS_new_Infilteration \ approximation_loose 0.001 0 0.1 > logs/nohup-IDS-boundary-exp.log & # 4. To get gradient-based explanation distance nohup python -u evaluate_explanation_by_distance.py IDS_new_Infilteration gradient \ 0.001 0 0.1 > logs/nohup-IDS-gradient-exp.log & ``` ## 5. Contact If you have any questions, please contact Limin (liminy2@illinois.edu). ## 6. Licensing For ethical considerations, code and data is covered by a modified BSD 3-Clause License which restricts the use of the code to academic purposes and which specifically prohibits commercial applications. > Any redistribution or use of this software must be limited to the purposes of non-commercial scientific research or non-commercial education. Any other use, in particular any use for commercial purposes, is prohibited. This includes, without limitation, incorporation in a commercial product, use in a commercial service, or production of other artefacts for commercial purposes.