# ForensicsGenome **Repository Path**: CholeMa/ForensicsGenome ## Basic Information - **Project Name**: ForensicsGenome - **Description**: A benchmark to evaluate MLLM's capability in the context of academic fraud - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2025-09-03 - **Last Updated**: 2025-09-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ForensicsGenome A benchmark to evaluate MLLM's capability in the context of academic fraud ---- ## Task 覆盖四类科研不端检测任务,涵盖 单篇内 (within)、跨篇 (across) | Scope | Primary Forensics | Level | Secondary Forensics | issue_id | |--------------|--------------------------|---------------------|------------------------|----------| | within | image forgery | panel_level_issue | splicing | FG_001 | | | | | copy-move | FG_002 | | | | | aigc-full | FG_003 | | | | | aigc-partial | FG_004 | | | | | scale bar manipulation | FG_005 | | | text-image consistency | figure_level_issue | numerical | TC_001 | | | | | trend | TC_002 | | | | | semantic | TC_003 | | | | | contextual | TC_004 | | within/across| image duplication | panel_level_issue | direct | DP_001 | | | | | scaling | DP_002 | | | | | rotated | DP_003 | | | | | flipped | DP_004 | | | | | parameter-modified | DP_005 | | across | inappropriate citation | paper_level_issue | irrelevant | IC_001 | | | | | improper | IC_002 | - **Image Forgery (within, panel level子图)** - *Splicing (FG_001)*: 拼接 - *Copy-move (FG_002)*: 拷贝移动 - *AIGC-full (FG_003)*: 全图生成 - *AIGC-partial (FG_004)*: 局部生成 - *Scale bar manipulation (FG_005)*: 标尺篡改 - **Text–Image Consistency (within, figure level大图)** - *Numerical (TC_001)*: 数值不符 - *Trend (TC_002)*: 趋势不符 - *Semantic (TC_003)*: 语义不符 - *Contextual (TC_004)*: 上下文不符 - **Image Duplication (within/across, panel level子图)** - *Direct (DP_001)*: 直接复用 - *Scaling (DP_002)*: 缩放复用 - *Rotated (DP_003)*: 旋转复用 - *Flipped (DP_004)*: 翻转复用 - *Parameter-modified (DP_005)*: 参数调整(如亮度、对比度、色彩等)后复用 - **Inappropriate Citation (across, paper level文档)** - *Irrelevant (IC_001)*: 无关引用(单篇内上下文无关) - *Improper (IC_002)*: 错误引用(跨篇引用错误) ---- ## Scripts - 请注意,需要自己写config文件,里面包括路径信息等 - 分块脚本(伪造标注需要) - code/patch/check_patch.ipynb:第1个cell:交互式可视化并翻页检查生成的分块结果;第2个cell:根据黑色区域自检分块编号并且重新更正 - code/patch/patch_generate_prompt.txt:提供分块规则的提示词模板。 - code/patch/patch_generate.py:批量调用模型生成分块标注结果。 - 正常pdf处理: - 主入口文件`code/base_scripts/main.py`: 批量处理指定根目录下的PDF,提取figure、切分panel、按顺序重命名并生成`dataset/real/real.json`与可视化结果(详情见注释)。 - aigc图像插入: - 主入口文件`code/aigc/insert_panels.py`: 批量处理aigc图像,通过比较相似度确定要替换的真图panels - 人工质量审核脚本(后期)`code/aigc/check_aigc_pdf.ipynb`:输入快捷键记录有问题的pdf元信息,删除有问题的pdf ---- ## Dataset ### 1. 数据集结构 ``` dataset/ ├─ real/ │ ├─ real.json │ └─ real_pdf/ │ └─ {paper_id}/ │ ├─ {paper_id}.pdf │ ├─ figure/ │ │ └─ {paper_id}_{page_index}_{figure_id}.png │ └─ panel/ │ └─ {paper_id}_{page_index}_{figure_id}_{panel_id}.png ├─ forgery/ │ ├─ forgery.json │ └─ forgery_pdf/ │ └─ {paper_id_SUFFIX}/ │ ├─ {paper_id_SUFFIX}.pdf │ ├─ figure/ │ │ └─ {paper_id_SUFFIX}_{page_index}_{figure_id}.png │ ├─ panel/ │ └─ {paper_id_SUFFIX}_{page_index}_{figure_id}/ │ └─ {paper_id_SUFFIX}_{page_index}_{figure_id}_{panel_id}.png │ └─ mask/ │ └─ {paper_id_SUFFIX}_{page_index}_{figure_id}/ │ └─ {paper_id_SUFFIX}_{page_index}_{figure_id}_{panel_id}.png ├─ consistency/ │ ├─ consistency.json │ └─ consistency_pdf/ │ └─ {paper_id}/ │ ├─ {paper_id}.pdf │ └─ figure/{paper_id}_{page_index}_{figure_id}.png ├─ duplication/ │ ├─ duplication.json │ ├─ reuse_pdf/ │ │ └─ {paper_id}/ │ │ ├─ {paper_id}.pdf │ │ ├─ figure/{paper_id}_{page_index}_{figure_id}.png │ │ └─ panel/{paper_id}_{page_index}_{figure_id}_{panel_id}.png │ └─ reused_pdf/ │ └─ {paper_id}/ │ ├─ {paper_id}.pdf │ ├─ figure/{paper_id}_{page_index}_{figure_id}.png │ └─ panel/{paper_id}_{page_index}_{figure_id}_{panel_id}.png ├─ citation/ │ ├─ citation.json │ ├─ citation_pdf/{paper_id}/{paper_id}.pdf │ └─ reference_pdf/{reference_id}/{reference_id}.pdf └─ comprehensive/ ├─ comprehensive.json ├─ comprehensive_pdf_label.json └─ comprehensive_pdf/ └─ {paper_id}/ ├─ {paper_id}.pdf ├─ figure/{paper_id}_{page_index}_{figure_id}.png └─ panel/{paper_id}_{page_index}_{figure_id}_{panel_id}.png ``` ### 2. 数据说明 - `_SUFFIX`: 二级取证的后缀,例如aigc-partial就是FG_004 - 例如`2010072670_FG_004` - `real/`: 包含真实的学术论文数据 - `real.json`: 真实论文的元数据和注释信息 - `real_pdf/`: 包含按论文ID组织的PDF文件及其图像提取结果 - `forgery/`: 子图伪造的学术论文数据,是`panel_level_issue` - `forgery.json`: 伪造论文的元数据和注释信息 - `forgery_pdf/`: 包含按论文ID组织的PDF文件及其图像提取结果 - `figure/`: 从PDF页面级别提取的整图 - `panel/`: 从大图中分割出的子图,按图级目录`{paper_id_SUFFIX}_{page_index}_{figure_id}/`组织 - `mask/`: 与`panel/`对应的掩码图片,文件名与对应`panel`保持一致的`{panel_id}`后缀 - `consistency/`: 图文一致性问题数据,是`figure_level_issue` - `consistency.json`: 标注与元数据 - `consistency_pdf/`: 论文PDF与图像资源(仅`figure`级) - `duplication/`: 子图复用问题数据(同文内或跨文复用),是`panel_level_issue` - `duplication.json`: 标注与元数据 - `reuse_pdf/`: 复用来源论文的PDF与图像(有错误的) - `reused_pdf/`: 被复用目标论文的PDF与图像(正常的) - `citation/`: 引用一致性与证据溯源数据,是`paper_level_issue` - `citation.json`: 引用与证据对应关系 - `citation_pdf/`: 论文PDF - `reference_pdf/`: 参考文献PDF - `comprehensive/`: 综合任务数据(多问题综合标注) - `comprehensive.json`: 多问题综合标注 - `comprehensive_pdf_label.json`: 标注某PDF包含的问题类型 - `comprehensive_pdf/`: 对应PDF与图像资源(含`figure`与`panel`) 每篇论文都有以下组织结构: - 原始PDF文件 - 从PDF中提取的大图(figure目录) - 从大图中分割的子图(panel目录) ### 3. JSON格式示例 #### 3.1 real.json完整示例 `dataset/real/real.json`示例条目(单个论文的完整单元,含1页、1个figure、1个panel): ```json { "papers": [ { "paper_id": 2010000000, "path": "dataset/real/real_pdf/2010000000/2010000000.pdf", "paper_level_issues": { "has_issue": false }, "pages": [ { "page_index": 1, "page_size_pt": { "width": 595.2760009765625, "height": 793.7009887695312 }, "figures": [ { "figure_id": "2010000000_1_1", "bbox_page_pixel": [17.82, 74.43, 140.73, 197.2], "path": "dataset/real/real_pdf/2010000000/figure/2010000000_1_1.png", "figure_level_issues": { "has_issue": false }, "panels": [ { "panel_id": "2010000000_1_1_1", "bbox_page_pixel": [17.82, 74.43, 140.73, 197.2], "path": "dataset/real/real_pdf/2010000000/panel/2010000000_1_1/2010000000_1_1_1.png", "panel_level_issues": { "has_issue": false } } ] } ] } ] } ] } ``` #### 3.2 forgery.json示例 ##### a) `splicing`拼接示例 ```json { "papers": [ { "paper_id": "2010000000_FG_001", "path": "dataset/forgery/forgery_pdf/2010000000_FG_001/2010000000_FG_001.pdf", "paper_level_issues": { "has_issue": false }, "pages": [ { "page_index": 1, "page_size_pt": { "width": 595.2760009765625, "height": 793.7009887695312 }, "figures": [ { "figure_id": "2010000000_FG_001_1_1", "bbox_page_pixel": [ 17.82, 74.43, 140.73, 197.2 ], "path": "dataset/forgery/forgery_pdf/2010000000_FG_001/figure/2010000000_FG_001_1_1.png", "figure_level_issues": { "has_issue": false }, "panels": [ { "panel_id": "2010000000_FG_001_1_1_1", "bbox_page_pixel": [ 17.82, 74.43, 140.73, 197.2 ], "path": "dataset/forgery/forgery_pdf/2010000000_FG_001/panel/2010000000_FG_001_1_1/2010000000_FG_001_1_1_1.png", "panel_level_issues": { "has_issue": true, "issues": [ { "issue_id": "FG_001", "scope": "within", "issue_type": "forgery", "issue_subtype": "splicing", "evidence": { "mask_path": "dataset/forgery/forgery_pdf/2010000000_FG_001/mask/2010000000_FG_001_1_1/2010000000_FG_001_1_1_1.png", } } ] } } ] } ] } ] } ] } ``` ##### b) `copy-move`拷贝移动示例 ```json { ... "panels": [ { "panel_id": "2010000000_FG_002_1_1_1", "bbox_page_pixel": [ 17.82, 74.43, 140.73, 197.2 ], "path": "dataset/forgery/forgery_pdf/2010000000_FG_002/panel/2010000000_FG_002_1_1/2010000000_FG_002_1_1_1.png", "panel_level_issues": { "has_issue": true, "issues": [ { "issue_id": "FG_002", "scope": "within", "issue_type": "forgery", "issue_subtype": "copy-move", "evidence": { "mask_path": "dataset/forgery/forgery_pdf/2010000000_FG_002/mask/2010000000_FG_002_1_1/2010000000_FG_002_1_1_1.png", } } ] } } ] } ... ``` #### 3.3 consistency.json ```json { "papers": [ { "paper_id": 2010000000, "path": "dataset/consistency/consistency_pdf/2010000000/2010000000.pdf", "paper_level_issues": { "has_issue": false }, "pages": [ { "page_index": 1, "page_size_pt": { "width": 595.2760009765625, "height": 793.7009887695312 }, "figures": [ { "figure_id": "2010000000_1_1", "bbox_page_pixel": [17.82, 74.43, 140.73, 197.2], "path": "dataset/consistency/consistency_pdf/2010000000/figure/2010000000_1_1.png", "figure_level_issues": { "has_issue": true, "issues": [ { "issue_id": "TC_001", "scope": "within", "issue_type": "consistency", "issue_subtype": "numerical", "evidence": { "caption": "", "related_text": "" } } ] } } ] } ] } ] } ``` #### 3.4 duplication.json示例 ```json { ... "panels": [ { "panel_id": "2010000000_1_1_1_left", "bbox_page_pixel": [ 17.82, 74.43, 140.73, 197.2 ], "path": "dataset/duplication/duplication_pdf/2010000000/panel/2010000000_1_1/2010000000_1_1_1.png", "panel_level_issues": { "has_issue": true, "issues": [ { "issue_id": "DP_004", "scope": "within", "issue_type": "duplication", "issue_subtype": ["scaling", "rotated"], #可以多个 "evidence": { "duplicated_panel_id_list":["2010000000_1_1_1_right"], "group": 1, #复用几处 "duplication_level": "global", #global全局, local局部 "duplication_ratio": 0.5 #0-1,说明重复区域占整个被复用图的面积比例 } } ] } }, { "panel_id": "2010000000_1_1_1_right", "bbox_page_pixel": [ 140.73, 74.43, 273.64, 197.2 ], "path": "dataset/duplication/duplication_pdf/2010000000/panel/2010000000_1_1/2010000000_1_1_1_right.png" } ] ... } ``` #### 3.5 citation.json示例 ```json { "papers": [ { "paper_id": "2010000000_IC_001", "path": "dataset/citation/citation_pdf/2010000000_IC_001/2010000000_IC_001.pdf", "paper_level_issues": { "has_issue": true, "issues": [ { "issue_id": "IC_001", "scope": "across", "issue_type": "citation", "issue_subtype": "irrelevant", "evidence": { "citation_sentence": "", "citation_for": "上文", "citation_post": "下文", "reference_paper_id": xxx, "reference_paper_path": "", "reference_title": "被引用论文题目", "reference_abstract": "被引用论文摘要", } } ] }, } ] } ``` #### 3.6 comprehensive.json示例 ```json { ... "panels": [ { "panel_id": "2010000000_1_1_1_left", "bbox_page_pixel": [ 17.82, 74.43, 140.73, 197.2 ], "path": "dataset/comprehensive/comprehensive_pdf/2010000000/panel/2010000000_1_1/2010000000_1_1_1.png", "panel_level_issues": { "has_issue": true, "issues": [ { "issue_id": "DP_004", "scope": "across", "issue_type": "duplication", "issue_subtype": ["scaling", "rotated"], #可以多个 "evidence": { "duplicated_panel_id_list":["2010000001_1_1_2"], "group": 2, #复用几处 } }, { "issue_id": "FG_002", "scope": "within", "issue_type": "forgery", "issue_subtype": "copy-move", "evidence": { "group": 2, #拷贝移动几处 } } ] } }, ] } ``` ---