# VDPython

**Repository Path**: shuangyinren/VDPython

## Basic Information

- **Project Name**: VDPython
- **Description**: VulDeePecker algorithm implemented in Python
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-01-09
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# VDPython
VulDeePecker algorithm implemented in Python  

## VulDeePecker
* Detects exploitable code in C/C++ 
* Uses N-grams and deep learning with LSTMs to train detection model
* Invents idea of code gadgets for semantically-related code
  * Code gadgets are vectorized for input to neural network
  * [Training/testing set for this project includes existing code gadgets and vulnerability classification]
* Trained on two vulnerability types
* [Paper](https://arxiv.org/pdf/1801.01681)
* [GitHub](https://github.com/CGCL-codes/VulDeePecker)

## Running project
* To run program, use this command: `python vuldeepecker.py [gadget_file]`, where gadget_file is one of the text files containing a gadget set
* Program has 3 parts:
  * Performing gadget "cleaning"
    * Remove comments, string/character literals
    * Replacing all user-defined variables and functions with VAR# and FUN#, respectively
      * The # is an integer identifying the user-defined variable/function within the gadget
      * Note: this identifier only applies within the scope of the gadget
  * Vectorize gadget
    * Gadgets are parsed, tokenized, and transformed to vectors of embeddings
    * Vectors are normalized to a constant length through either truncation or padding
  * Train and test neural model
    * Gadget vectors are used as input to train the neural model 
    * Data is split into training set and testing set
    * Neural model is trained, tested, and accuracy is reported

## Code Files
* vuldeepecker.py
  * Interface to project, uses functionality from other code files
  * Fetches each gadget, cleans, buffers, trains Word2Vec model, vectorizes, passes to neural net
* clean_gadget.py
  * For each gadget, replaces all user variables with "VAR#" and user functions with "FUN#"
  * Removes content from string and character literals
* vectorize_gadget.py
  * Converts gadgets into vectors
  * Tokenizes gadget (converts to symbols, operators, keywords)
  * Uses Word2Vec to convert tokens to embeddings
  * Combines token embeddings in a gadget to create 2D gadget vector
* blstm.py
  * Defines Bidirectional Long Short Term Memory neural network for training/prediction of vulnerabilities
  * Gets gadget vectors as input
  * Implements functions for both training and testing the model
  * Uses parameters defined in VulDeePecker paper