# LibDB **Repository Path**: mirrors_promeG/LibDB ## Basic Information - **Project Name**: LibDB - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-06-16 - **Last Updated**: 2026-01-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # binary_tpl_detection Dataset url: https://figshare.com/s/4a007e78f29243531b8c ## Feature Extractor - The extractor extracts features from all binary files under a given directory and save features to a json file. - Input: directory - Output: two files, stored in a given target directory. - Information such as running time is stored in the `status` file. - Extracted features are stored in the features file, such as `9760608.json`. The format of this json is a list of BinaryFile entity. - It is recommended to put your task code under `consumer` directory (in `featureExtractor/bcat_client/src/main/java/thusca/bcat/client/consumer`). See the example in `consumer/BinFileFeatureExtractTest.java` ### Pre-requisites Basic knowledge about Java Development, Springboot and Annotation Development.
For example, if you use IDE like VScode or Idea, basic java development environment need to be installed such as `Java Extension Pack`, `MAVEN for JAVA`. It should be noted that we use Lombok Annotation and Springboot in code that may depend on extensions `Lombok Annotations Support` and `Spring Boot Tools` for IDE to debug or run. ### Build Artifact Env: - Java: Java 11. - IntelliJ Idea. (We have found that the extractor artifact works well only under IntelliJ Idea to build the artifact. Tested successful under Windows IntelliJ Idea 2021.2) Steps: 1. Ghidra: 9.1.2. The file `ghidra.jar` is stored under `/user/lib/ghidra.jar` you should put it under `/featureExtractor/bcat_client/lib` first. 2. Open Idea, open project "binary_lib_detection-main\featureExtractor". Wait until indexing finish, if error occurs, try reopen/clean the project. 3. File -> Project Structure -> Project SDK, select Java SDK 11. 4. File -> Project Structure -> Artifacts -> "+" -> jar -> from modules with dependencies -> Module ("bcat_client") -> Main Class ("ClientApplication") -> JAR files from libraries (select `copy to the output directory and link via manifest`) 5. The jars will be generated at path: featureExtractor\out\artifacts\bcat_client_jar, with `bcat_client.jar` inside. ### Task Methods for all tasks are stored under the directory `/consumer`. Building database: Code:`Task2ExtractCoreFedora.java`, Data: `FedoraLib_Dataset`. Set tha save path and get all features to build TPL feature database. We use the directory `../data/CoreFedoraFeatureJson0505` to represent the save path. ### Run Zip the bcat_client_jar folder and upload to a Linux server, unzip, and run: ```shell java -jar bcat_client.jar ``` Note: Java 11 required. ## Func similarity Model This model is used to determine if two functions are similar based on [Gemini](https://github.com/xiaojunxu/dnn-binary-code-similarity) Network. Prepration and Data Data is stored in `../data/vector_deduplicate_gemini_format_less_compilation_cases`.
or Cross-5C_Dataset.7z on figshare. By default, we use the path `../data` under `main/torch` to store the data. Please copy them under it. ### Environment Step The network is written using Torch 1.8 in Python 3.8. Torch installation is based on cuda 11. ``` conda create -n tpldetection python=3.8 ipykernel bash conda activate tpldetection pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html pip install -r requirements.txt ``` Milvus v1.1.1(vector search engine) is necessary for function retrival. It requires docker 19.03 or higher ref: https://milvus.io/docs/v1.1.1/milvus_docker-gpu.md ```shell sudo docker pull milvusdb/milvus:1.1.1-gpu-d061621-330cc6 mkdir -p /home/$USER/milvus/conf cd /home/$USER/milvus/conf wget https://raw.githubusercontent.com/milvus-io/milvus/v1.1.1/core/conf/demo/server_config.yaml sudo docker run -d --name milvus_gpu_1.1.1 --gpus all \ -p 19530:19530 \ -p 19121:19121 \ -v /home/$USER/milvus/db:/var/lib/milvus/db \ -v /home/$USER/milvus/conf:/var/lib/milvus/conf \ -v /home/$USER/milvus/logs:/var/lib/milvus/logs \ -v /home/$USER/milvus/wal:/var/lib/milvus/wal \ milvusdb/milvus:1.1.1-gpu-d061621-330cc6 ``` ## Run Run the following command to train the model: ```shell # train/validation dataset: /data/func_comparison/vector_deduplicate_our_format_less_compilation_cases/train_test # test dataset: /data/func_comparison/vector_deduplicate_our_format_less_compilation_cases/valid cd main/torch bash run.sh ``` A trained model is saved under `../data/7fea_contra_torch_b128/saved_model/` ## Library detection ### Database #### Embedding raw feature database: `../data/CoreFedoraFeatureJson0505` Embeddings: set the path `../data/CoreFedoraFeatureJson0505` as `args.fedora_js`. You can use mutilprocess to speed up and the code is writen in `core_fedora_embeddings.py` as follows: ```python with Pool(10) as p: p.starmap(core_fedora_embedding, [(i, True) for i in range(10)]) ``` all embeddings are saved under the `args.save_path`. We use the path `../data/7fea_contra_torch_b128/core_funcs` to represent it. #### Indexing and Building Milvus dataset run `build_milvus_database.py` to build function vector database using Mulvis. the function `get_bin_fcg` is used to generate an indexing file containing binary to functions to accelarate. `get_bin2func_num` generates an indexing from binary to the number of funtions in it. #### Detection Data: detection_targets. Firstly, extract features from APKs. See the method `localExtractOSSPoliceApks` in `TaskProcessTargets.java` under the directory `consumer`. We use the directory`../data/detection_targets/feature_json` to save all extracted features. see the function `detect_v2` in function_vector_channel. Other methods + FCG Filter can be seen in files `xxx_afcg.py`. Baselines are under the directory `/related_work`. We combine basic feature channel (B2SFinder(basic features) + FCG Filter) and function vector channel together to report the final results. All files named `analyze_results.py` are used to calculate precision and recall.