# LF-CFG **Repository Path**: sammmmy/lf-cfg ## Basic Information - **Project Name**: LF-CFG - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-09-03 - **Last Updated**: 2024-09-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # LF-CFG It's a tool to transfrom labelled FindBugs report to IR Alert Dataset Samples. # How to use LF-CFG Take java project "commons" as an example. ## Function 1: get IR alert sample 1. Get targeted revision for training and testing data from [SA_retrospective](https://github.com/soarsmu/SA_retrospective). The training revision means revision at 2013.01 and the testing revision means revision at 2014.01. Download source data by visiting [c4ecd7](https://github.com/apache/commons-lang/tree/c4ecd75ecd8b78c66cc51b49dd32989a3f1cde2e). Then get jar package and classes by using `ant` to compile. 2. Get FindBugs xml report from [findbugs-xml-reports](https://github.com/soarsmu/SA_retrospective/tree/main/findbugs-xml-reports). `commons-B.xml` is the report for training revision and `commons-C.xml` is for testing revision. 3. Get labelled alerts from [data](https://github.com/soarsmu/SA_retrospective/tree/main/data), we use data in dedup folders. `commons_B_features.csv` is the features and ground truth for training revision and `commons_C_features.csv` is for testing revision. The last column "category" means each alert's actionability, open means non-actionable and close means actionable, which used as ground truth for the dataset. 4. Run `xml2warning_info_csv.py` to parse xml report to `warning_info.csv`. LF-CFG only dealt with pairs that contain one method, so messages for more complex alert may have information gaps and errors. 5. Remove duplicate and unknown alerts from `warning_info.csv` manually. We can find more alerts in `warning_info.csv` compared with `commons_B/C_features.csv`, because it contains duplicate alerts and unknown alerts that are not useful for this project, just delete them by manually comparing information such as alert serial number, error method, and error type. 6. Run `get_soot_cfg.py` to parse each alert's error file to get a jimple representation of all the methods in each alert error file in dot form. To make it easier to manipulate in cmd, then replace the spaces in the dot filename with underscores to ensure every filename is modified. 7. Run `soot_find_dot_by_alarm_method.py` to find each alert's jimple representation of the error method. Once the specified dot file is found, the serial number is added to the beginning of the filename. For alerts that don't match the error method, add a blank dot file. 8. Run `cfg_dot2json.py` to generate samples in json form of each alert's features and labels. The features are labels_1 and graph_1, representing the point and edge information of the alert error method CFG respectively. LF-CFG ignores blank dot file here. Samples in the form of json form the IR dataset, which can be applied to the alert prediction model [CFG-GNN](https://github.com/lyxx2535/CFG-GNN) to obtain a score for each alert. ## Function 2: get cfg png Run `soot_dot2png.py` to convert dot to png, which can visualize CFGs. ## Function 3: get source code in ast/cfg/dfg form 1. Run `get_source_code_by_alarm.py` to get each alert's error content since `warning_info.csv` contains the start and end lines of the error content for each alert. 2. Run `java2ast_cfg_dfg.py` to get the ast/cfg/dfg form of the error content by using [COMEX](https://github.com/IBM/tree-sitter-codeviews). We can also run `soot_dot2png.py` to convert dot to png.