# Nemotron-CORTEXA

**Repository Path**: mirrors_NVIDIA/Nemotron-CORTEXA

## Basic Information

- **Project Name**: Nemotron-CORTEXA
- **Description**: Nemotron-CORTEXA is an open-source software engineering agent that fixes GitHub issues.
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-08-09
- **Last Updated**: 2026-03-29

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity

This is the official codebase for the ICML 2025 paper: [Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity](https://openreview.net/forum?id=k6p8UKRdH7). Please see this [blogpost](https://research.nvidia.com/labs/adlr/cortexa/) for a high-level overview.

## Requirements:
- Python 3.11 or higher

## Installation
Download the repo and install the repo with

```
pip install -e .
```

## Localization Stage
There are two steps in code localization: file localization and entity localization.

### File Localization
We have developed, NV-EmbedCode, a code embedding model that specializes in mapping bug descriptions to faulty codes. The model is available on [HuggingFace](https://huggingface.co/nvidia/NV-EmbedCode-7b-v1) and as a [NIM]((https://build.nvidia.com/nvidia/nv-embedcode-7b-v1)).

The following command runs the file localization using [NV-EmbedCode's NIM](https://build.nvidia.com/nvidia/nv-embedcode-7b-v1):
```bash
python -m cortexa.retrieval.embed_retrieve \
        --model_name nvidia/nv-embedcode-7b-v1 \
        --base_url https://integrate.api.nvidia.com/v1  \
        --log_dir ./logs \
        --repo_playground ./repos \
        --batch_size 16 \
        --max_length 450 \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
        --query_type llmsummary \
        --instance_id astropy__astropy-12907
```
The instance_id argument accepts a comma-separated list of instances to run. To run the entire benchmark, omit the argument entirely. 

You can measure the accuracy of retrieval by running:
```bash
python -m cortexa.retrieval.file_retrieval_eval \
        --log_dir ./logs \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
        --query_type llmsummary
```

For SWE-bench Lite and Verified sets, we generated file localization results and made them available at `src/cortexa/retrieval/files/cortexa_all_llmsummary_ordered_files.pickle`. The pickle file contains a dictionary with 707 instances results (the union of the Lite and Verified set). You can access the result for each instance using its instance id. Each result is a tuple with two elements: the first is the ranked list of files predicted by our NV-EmbedCode model, and the second is the list of files that were modified in the golden patch from the dataset.

### Entity Localization
To obtain more granular localization results, you can use our localization agent. It uses the file ranking results from the previous step and return a list of relevant entities, such as functions and classes.

For generating entity localization results:

```bash
python -m cortexa.localize.entity_localization \
       --log_dir ./logs \
       --repo_playground ./repos \
       --num_turns 5 \
       --num_top_files 6 \
       --embed_results cortexa_verified_llmsummary_ordered_files.pickle \
       --instance_id astropy__astropy-12907
```

`embed_results` should be the name of the file generated in the previous step. By default this agent uses the model config file at `model_config.yml`. For each model, you need to specify a url for API access, an `api_key_name` and a `model_name`. For security reason, the code would populate the actual API key value at runtime and assumes it is available via `os.environ[api_key_name]` so make sure the API key is accessible as an environment variable.

Empirically we found that running entity localization with different models and temperatures, then merging their results increase the recall accuracy. You can follow the example in the `src/cortexa/localize/configs.py` to add more model configurations.

To evaluate entity localization results:

```python src/cortexa/localize/entity_localization_eval.py --loc_f=LOC_RES_F --target_set=verified```

The `LOC_RES_F` needs to be a `jsonl` file following format in `src/cortexa/retrieval/files/cortexa_all_llmsummary_LA_DP_entity.jsonl`. The previous script will produce three `jsonl` files in this format, for direct prompt, localization agent and merged results.

Modify the `get_default_loc_log_file_map` of the file `src/cortexa/repair/config.py` with the results of the previous two stages. By default, it reads our pre-processed retrieved files and entities. This file also defined the candidates for how to generate the patches and reproduction tests. If you want to try other candidates, modify their respective functions.

## Repair Stage
In the repair stage, we generate patches and reproduction tests. We then run the patches through the generated tests and final utittests. The final unittests results are only used for final reporting of the resolution rate. Finally, we filter them based on the results of the reproduction tests and output a single patch for each instance.

The agent uses the OpenAI API to run the models and reads model details from the `model_config.yml` file. You can configure each patch and test generation candidates in `src/cortexa/repair/configs.py`.

To run the patch generation, test generation, and the evaluation use the following command. It optionally accepts `--instance_id` with a comma-separated instance IDs if you wish to run a subset. To run the entire benchmark, omit the `--instance_id` argument entirely. 
```bash
python -m cortexa.evaluate.run_evaluation  \
        --repro_test_dir ./repro_tests/merged \
        --repo_playground ./repos \
        --log_dir ./logs \
        --run_patch_generation \
        --run_test_generation \
        --num_workers_patch_gen 2 \
        --num_workers_test_gen 2 \
        --num_workers_eval 4 \
        --modes eval,repro \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
        --instance_id astropy__astropy-12907
```

Once the previous step is done, run the following command to filter the generated patches:
```bash
python -m cortexa.evaluate.run_filtering  \
        --log_dir ./logs \
        --repo_playground ./repos \
        --run_normalization \
        --vote_mode llm_judge \
        --model_name deepseek-v3-0324 \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified 
```

It produces a file at `{log_dir}/final_patches.json` with final selected patches in the diff format and `{log_dir}/results.json` with a summary of the resolved and unresolved instances.

Alternatively, if you want to run each of the reproduction test generation, patch generation, and evaluation seperately, run the following.

### Reproduction Test Generation
We use reproduction tests to filter patch candidates. 
Run

```bash
python -m cortexa.repair.repro_test_gen \
       --log_dir ./logs \
       --out_inference_file summary_test_gen.jsonl \
       --repro_test_dir ./reproduction_tests \
       --max_round 3 \
       --num_workers 2 \
       --repo_playground ./repos \
       --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
       --instance_id astropy__astropy-12907
``` 


### Repair Generation
Run 

```bash
python -m cortexa.repair.repair_gen \
       --log_dir ./logs \
       --out_inference_file summary_inference.jsonl \
       --num_workers 2 \
       --repo_playground ./repos \
       --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
       --instance_id astropy__astropy-12907
``` 

### Evaluation
You can evaluate the generated patches with the generated reproduction tests and the final SWE-bench unittests. 

Run

```bash
python -m cortexa.evaluate.run_evaluation  \
        --repro_test_dir ./repro_tests/merged \
        --modes eval,repro \
        --log_dir ./logs \
        --out_inference_file summary_inference.jsonl \
        --num_workers_eval 4 \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
        --instance_id astropy__astropy-12907
```

If you wish to run the final evaluation for only one patch per instance irrespective of the previous patch and test generations, run

```bash
python -m cortexa.evaluate.run_evaluation  \
        --modes eval \
        --log_dir ./logs \
        --out_inference_file summary_inference.jsonl \
        --num_workers_eval 4 \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
        --instance_id astropy__astropy-12907 \
        --single_patch_eval 
```

`out_inference_file` is a jsonl file with each line showing results for an instance. It must have at least the following attributes for each `instance_id`:
- `instance_id`
- `model_patch`: the patch for the instance in git diff format

### Citation
If you find our work useful, please cite our ICML 2025 paper

```
@inproceedings{
sohrabizadeh2025nemotroncortexa,
title={Nemotron-{CORTEXA}: Enhancing {LLM} Agents for Software Engineering Tasks via Improved Localization and Solution Diversity},
author={Atefeh Sohrabizadeh and Jialin Song and Mingjie Liu and Rajarshi Roy and Chankyu Lee and Jonathan Raiman and Bryan Catanzaro},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=k6p8UKRdH7}
}
```