# NCKG **Repository Path**: ljyyyy1/NCKG ## Basic Information - **Project Name**: NCKG - **Description**: Replication package of NCKG - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-19 - **Last Updated**: 2026-02-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## **Repairing DNN Numerical Anomalies with Semantic-Driven Knowledge Graph Retrieval** replication package of "**Repairing DNN Numerical Anomalies with Semantic-Driven Knowledge Graph Retrieval**" ## 1 Quick Start ### 1.1 Data Preprocessing deepstability dataset: Filter the column "Old Solution" and "New Solution" without Nan:```parse_deepstability.filter_csv_advanced``` Parsing .csv file .jsonl file: ```build``` method in ```search_bm25_pure.py``` Dataset Split: ```train_test_split_by_patch_type.py``` --- ### 1.2 The Proposed Method #### 1.2.1 NCKG Here is an example how to build NCKG retrieval knowledge base in ```example.py``` (NCKG folder). The graph index is built using ```python knowledge_base.build_graph_index() ``` After, retrieval can be conducted using Symptom/Component/Graph. ```python nan_pairs = knowledge_base.search_by_symptom("NaN") # by symptom linear_pairs = knowledge_base.search_by_component("Linear") # by component results = knowledge_base.graph_based_search(query_graph, subgraph_type='subgraph2', top_k=5) # by graph ``` #### 1.2.2 Overall Pipeline The overall retrieval and generation pipeline is implemented in ```repair_new_original.py``` ```markdown key hyperparameter "--model_name": name of generation model (gpt-neo-125m/gpt-neo-1.3B/gpt-neo-2.7B/phi-2/CodeLlama-7b-hf/deepseek-coder-6.7b-instruct) "--retrieval_way": retrieval name (NCKG/BM25/DPR) "--folder": output folder ``` rememeber to update the ```beir``` folder in BM25, and the default values in ```dpr_main``` when change dataset. --- ### 1.3 Baseline Models #### 1.3.1 BM25 Algorithm Implementation **Build Corpus**: ```search_bm25_pure.py```. Then, run the ```search_bm25_pure.py``` to execute the BM25 algorithm ```commandline python search_bm25_pure.py ``` **retrieval + generation ** : ```commandline python repair_new_original.py --model_name gpt-neo-125B --dataset deepstability --retrieval_way BM25 --folder Results/BM25/PATH ``` ```model_name```: gpt-neo-125m, gpt-neo-1.3B, gpt-neo-2.7B, phi-2, CodeLlama-7b-hf, deepseek-coder-6.7b-instruct #### 1.3.2 DPR Algorithm Implementation First, train a dense retriever, with config file ```train_dpr_retriever.yaml``` ```commandline python Train_dpr_retriever.py ``` Then, use the trained dense retriever to embed the fixed_function of each bug in the corpus in to vectors, forming a vector retrievel database. ```commandline python fix2embedding.py --fix_path [retrievel corpus path] --pretrained_model_path [dense retriever] --output_dir [embedding corpus path] ``` for example, execute the following line to generate the **embedding corpus** ```commandline python fix2embedding.py --fix_path ../Dataset/deepstability/train_set.csv --pretrained_model_path ./wandb/run-20251215_185823-z5ha1otr/files/step-108/fix_encoder/ --output_dir ./corpus/ ``` Finally, run the ```search_dpr.py``` to execute the DPR algorithm ```commandline python search_dpr.py --retrieval_fix_path [retrieval corpus path] --bug_str [string to be retrieved] --top_k [top_k] --embedding_dir [embedding corpus path] --pretrained_model_path [dense retriever] ``` --- ### 1.4 Evaluation #### 1.4.1 Evaluation of the Retrieval Results Execute the ```evaluation_retrieval.py```. ```commandline python evaluaiton_retrieval.py ``` Remember to update the ```retrieval_way``` and ```retrieval_file``` in ```evaluation_retrieval.py```. #### 1.4.2 Evalution of the Generation Results First, execute the ```evaluation_generation_new.py```. ```commandline python evaluation_geneartion_new.py ``` Remember to update the ```retrieval_list```, ```model_list``` and ```random_seed```. The generation result is in json format, generated by a LLM-based tool. So the results need to be parsed for following calculated. ```commandline python parse_evaluation_result.py ``` The experimental result could be drawn using ```generation_result_plot.py```(for rq2) and ```ablation_result_plot.py```(for rq3). Notice that the final result is calculated outside, paste back to generate the plots. --- ## Project Structure ```markdown NCKG ├── BM25 # retrieval way: BM25 │ ├── beir # corpus of total pipeline (--temp_path in bm25_main) │ └──search_bm25_pure.py # BM25 │ ├── Dataset │ └── deepstability # deepstability dataset │ └── # example │ ├── DPR │ ├── corpus # save path for running 'fix2embedding.py' │ ├── utils │ ├── wandb # save path for running 'Train_dpr_retriever.py' │ ├──fix2embedding.py # run 'fix2embedding.py' to get embedding │ ├──search_dpr.py # DPR │ ├──Train_dpr_retriever.py # run 'Train_dpr_retriever.py' to get pretrained model │ └──train_dpr_retriever.yaml # config file for 'Train_dpr_retriever.py' │ ├── Evaluation │ ├──ablation_result_plot.py │ ├──evaluation_generation_new.py # generate result for generation task │ ├──evaluation_retrieval.py # generate result for retrieval task │ ├──generation_result_plot.py │ └──parse_evaluation_result.py │ ├── NCKG │ ├──build_graph.py # build knowledge base and graph index for the corpus │ ├──example.py │ ├──fuzzy_matcher.py │ ├──search_NCKG.py # NCKG │ ├──semantic_concept.py # definition of concept semantics │ └──utils.py │ ├── Results │ ├──(deepstability) │ ├── BM25 # result of BM25 │ │ ├── test-original_retrieval # retrieval result (rs:42/24/123456) │ │ ├── test-original_xx │ │ │ ├── DPR # result of DPR │ │ ├── test-original_retrieval # retrieval result (rs:42/24/123456) │ │ ├── test-original_xx │ │ │ └── NCKG # result of NCKG │ ├── test-original_retrieval # retrieval result (rs:42/24/123456) 'hybrid' │ ├── test-original_retrieval_graph # retrieval result 'graph-only' │ ├── test-original_retrieval_vector # retrieval result 'vector-only' │ ├── test-original_xx │ ├── utils │ ├──LLM_utils.py │ ├──model_original.py # RepairModel │ ├──parse_deepstability.py │ ├──prompt_utils.py # prompt template │ └──utils.py # set_seed, get_unified_diff │ ├──repair_new_original.py # using model_original.py └──README.md ```