# NCKG

**Repository Path**: ljyyyy1/NCKG

## Basic Information

- **Project Name**: NCKG
- **Description**: Replication package of NCKG
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-19
- **Last Updated**: 2026-02-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## **Repairing DNN Numerical Anomalies with Semantic-Driven Knowledge Graph Retrieval**

replication package of "**Repairing DNN Numerical Anomalies with Semantic-Driven Knowledge Graph Retrieval**"

## 1 Quick Start

### 1.1 Data Preprocessing
deepstability dataset:  
Filter the column "Old Solution" and "New Solution" without Nan:```parse_deepstability.filter_csv_advanced```  
Parsing .csv file .jsonl file: ```build``` method in ```search_bm25_pure.py```  
Dataset Split: ```train_test_split_by_patch_type.py```

---  
### 1.2 The Proposed Method
#### 1.2.1 NCKG
Here is an example how to build NCKG retrieval knowledge base in ```example.py``` (NCKG folder).
The graph index is built using
```python
knowledge_base.build_graph_index()
```
After, retrieval can be conducted using Symptom/Component/Graph.
```python
nan_pairs = knowledge_base.search_by_symptom("NaN") # by symptom
linear_pairs = knowledge_base.search_by_component("Linear") # by component
results = knowledge_base.graph_based_search(query_graph, subgraph_type='subgraph2', top_k=5) # by graph
```

#### 1.2.2 Overall Pipeline
The overall retrieval and generation pipeline is implemented in ```repair_new_original.py```
```markdown
key hyperparameter
"--model_name": name of generation model  (gpt-neo-125m/gpt-neo-1.3B/gpt-neo-2.7B/phi-2/CodeLlama-7b-hf/deepseek-coder-6.7b-instruct)
"--retrieval_way": retrieval name  (NCKG/BM25/DPR)
"--folder": output folder
```
rememeber to update the ```beir``` folder in BM25, and the default values in ```dpr_main``` when change dataset.


---
### 1.3 Baseline Models
#### 1.3.1 BM25 Algorithm Implementation
**Build Corpus**: ```search_bm25_pure.py```.  
Then, run the ```search_bm25_pure.py``` to execute the BM25 algorithm
```commandline
python search_bm25_pure.py
```
**retrieval + generation ** : 
```commandline
python repair_new_original.py --model_name gpt-neo-125B --dataset deepstability --retrieval_way BM25 --folder Results/BM25/PATH
```
```model_name```: gpt-neo-125m, gpt-neo-1.3B, gpt-neo-2.7B, phi-2, CodeLlama-7b-hf, deepseek-coder-6.7b-instruct

 
#### 1.3.2 DPR Algorithm Implementation
First, train a dense retriever, with config file ```train_dpr_retriever.yaml```
```commandline
python Train_dpr_retriever.py
```
Then, use the trained dense retriever to embed the fixed_function of each bug in the corpus in to vectors, forming a vector retrievel database.
```commandline
python fix2embedding.py --fix_path [retrievel corpus path] --pretrained_model_path [dense retriever] --output_dir [embedding corpus path]
```
for example, execute the following line to generate the **embedding corpus**
```commandline
python fix2embedding.py --fix_path ../Dataset/deepstability/train_set.csv --pretrained_model_path ./wandb/run-20251215_185823-z5ha1otr/files/step-108/fix_encoder/ --output_dir ./corpus/
```
Finally, run the ```search_dpr.py``` to execute the DPR algorithm
```commandline
python search_dpr.py --retrieval_fix_path [retrieval corpus path] --bug_str [string to be retrieved] --top_k [top_k] --embedding_dir [embedding corpus path] --pretrained_model_path [dense retriever]
```
---
### 1.4 Evaluation
#### 1.4.1 Evaluation of the Retrieval Results
Execute the ```evaluation_retrieval.py```.  
```commandline
python evaluaiton_retrieval.py
```
Remember to update the ```retrieval_way``` and ```retrieval_file``` in ```evaluation_retrieval.py```.  

#### 1.4.2 Evalution of the Generation Results
First, execute the ```evaluation_generation_new.py```.
```commandline
python evaluation_geneartion_new.py
```
Remember to update the ```retrieval_list```, ```model_list``` and ```random_seed```.  
The generation result is in json format, generated by a LLM-based tool. So the results need to be parsed for following calculated.
```commandline
python parse_evaluation_result.py
```
The experimental result could be drawn using ```generation_result_plot.py```(for rq2) and ```ablation_result_plot.py```(for rq3). Notice that the final result is calculated outside, paste back to generate the plots.


---
## Project Structure
```markdown
NCKG
├── BM25                            # retrieval way: BM25
│   ├── beir                        # corpus of total pipeline (--temp_path in bm25_main)
│   └──search_bm25_pure.py          # BM25
│
├── Dataset
│   └── deepstability               # deepstability dataset
│       └──                         # example
│
├── DPR
│   ├── corpus                      # save path for running 'fix2embedding.py' 
│   ├── utils 
│   ├── wandb                       # save path for running 'Train_dpr_retriever.py'
│   ├──fix2embedding.py             # run 'fix2embedding.py' to get embedding
│   ├──search_dpr.py                # DPR
│   ├──Train_dpr_retriever.py       # run 'Train_dpr_retriever.py' to get pretrained model
│   └──train_dpr_retriever.yaml     # config file for 'Train_dpr_retriever.py'
│
├── Evaluation
│   ├──ablation_result_plot.py      
│   ├──evaluation_generation_new.py # generate result for generation task
│   ├──evaluation_retrieval.py      # generate result for retrieval task
│   ├──generation_result_plot.py              
│   └──parse_evaluation_result.py     
│
├── NCKG
│   ├──build_graph.py               # build knowledge base and graph index for the corpus 
│   ├──example.py                   
│   ├──fuzzy_matcher.py             
│   ├──search_NCKG.py               # NCKG
│   ├──semantic_concept.py          # definition of concept semantics
│   └──utils.py                     
│
├── Results
│   ├──（deepstability）
│   ├── BM25                        # result of BM25
│   │   ├── test-original_retrieval # retrieval result (rs:42/24/123456)
│   │   ├── test-original_xx
│   │   
│   ├── DPR                         # result of DPR
│   │   ├── test-original_retrieval # retrieval result (rs:42/24/123456)
│   │   ├── test-original_xx
│   │   
│   └── NCKG                        # result of NCKG
│       ├── test-original_retrieval # retrieval result (rs:42/24/123456) 'hybrid'
│       ├── test-original_retrieval_graph  # retrieval result  'graph-only'
│       ├── test-original_retrieval_vector # retrieval result  'vector-only'
│       ├── test-original_xx
│
├── utils
│   ├──LLM_utils.py
│   ├──model_original.py            # RepairModel
│   ├──parse_deepstability.py       
│   ├──prompt_utils.py              # prompt template
│   └──utils.py                     # set_seed, get_unified_diff
│
├──repair_new_original.py           # using model_original.py
└──README.md

```