# reddit_database

**Repository Path**: subtion/reddit_database

## Basic Information

- **Project Name**: reddit_database
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-07
- **Last Updated**: 2026-01-10

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

根据您提供的 Mendeley Data 链接以及对当前项目文件的分析，以下是该数据集三个文件夹的具体含义和数据分布：

### 1. All articles (原始完整数据集)
这是收集的**所有原始数据**，没有任何删减或平衡处理。
- **总数据量**: 约 22.6 万条 (226,010)
- **特点**: 数据分布极度**不平衡**。Conservative（保守派）的文章数量远远多于 Liberal（自由派）。
- **文件详情**:
  - `Conservative.json`: 203,456 条
  - `Liberal.json`: 22,554 条

### 2. Class balanced articles (类别平衡数据集)
这是一个为了解决数据不平衡问题而生成的**子集**。
- **总数据量**: 约 4.5 万条 (45,108)
- **特点**: **类别数量相等**。研究者保留了所有的 Liberal 文章，并从 Conservative 文章中进行了随机采样（Down-sampling），使其数量与 Liberal 保持一致，以便于训练分类模型时避免偏见。
- **文件详情**:
  - `Conservative.json`: 22,554 条
  - `Liberal.json`: 22,554 条

### 3. Annotated articles (已标注数据集)
这是经过人工或高质量标注的小规模**验证集**。
- **总数据量**: 4,000 条
- **特点**: 数据量小且**严格平衡**。通常用于验证模型的准确性或作为“金标准”（Ground Truth）进行测试。
- **文件详情**:
  - `Conservative.json`: 2,000 条
  - `Liberal.json`: 2,000 条

### 总结表格

| 文件夹名称 | 用途 | Conservative (条数) | Liberal (条数) | 总计 (条数) |
| :--- | :--- | :--- | :--- | :--- |
| **All articles** | 原始全量数据 | 203,456 | 22,554 | 226,010 |
| **Class balanced articles** | 训练用平衡数据 | 22,554 | 22,554 | 45,108 |
| **Annotated articles** | 测试/验证数据 | 2,000 | 2,000 | 4,000 |

为了验证上述数据，我编写了辅助脚本 [analyze_structure.py](analyze_structure.py) 进行了统计，结果与 Mendeley 上的描述完全一致。

---

## JSON 数据格式说明

本项目中包含三类 JSON 数据格式，分别对应三个文件夹下的内容。

### 1. All articles 数据格式
适用于 `All articles` 文件夹下的文件。

**示例:**
```json
{
    "urls": "http://www.americamagazine.org/...",
    "articles": "The news of Justice David Souter’s retirement...",
    "created_utc": 1241294002,
    "num_upvotes": 3,
    "num_comments": 0,
    "flair": [],
    "url_domain": "americamagazine",
    "Frequency": 2,
    "totalwords": 619
}
```

**字段详解:**
| 字段名 | 类型 | 描述 |
| :--- | :--- | :--- |
| `urls` | String | 文章的原始来源 URL 地址 |
| `articles` | String | 文章的正文内容 |
| `created_utc` | Number | 文章发布的时间戳 (UTC) |
| `num_upvotes` | Number | 在 Reddit 上的点赞数 |
| `num_comments` | Number | 在 Reddit 上的评论数 |
| `flair` | Array | Reddit 帖子的 Flair 标签 |
| `url_domain` | String | URL 的域名部分 |
| `Frequency` | Number | (统计字段) 出现频率或相关计数 |
| `totalwords` | Number | 文章的总词数 |

### 2. Class balanced articles 数据格式
适用于 `Class balanced articles` 文件夹下的文件。
大部分字段与 `All articles` 相同，但部分文件（如 Conservative.json）可能包含额外的 `date` 字段。

**示例:**
```json
{
    "urls": "http://beaglescout.wordpress.com/...",
    "articles": "Today, prompted by an anecdote...",
    "created_utc": 1242230642,
    "num_upvotes": 7,
    "num_comments": 4,
    "flair": [],
    "url_domain": "wordpress",
    "Frequency": 1317,
    "totalwords": 2741,
    "date": 1242172800000
}
```

**字段详解:**
| 字段名 | 类型 | 描述 |
| :--- | :--- | :--- |
| `urls` | String | 文章的原始来源 URL 地址 |
| `articles` | String | 文章的正文内容 |
| `created_utc` | Number | 文章发布的时间戳 (UTC) |
| `num_upvotes` | Number | 在 Reddit 上的点赞数 |
| `num_comments` | Number | 在 Reddit 上的评论数 |
| `flair` | Array | Reddit 帖子的 Flair 标签 |
| `url_domain` | String | URL 的域名部分 |
| `Frequency` | Number | (统计字段) 出现频率或相关计数 |
| `totalwords` | Number | 文章的总词数 |
| `date` | Number | (部分文件包含) 毫秒级时间戳，可能用于特定的时间分析 |

### 3. Annotated articles 数据格式
适用于 `Annotated articles` 文件夹下的文件。

**示例:**
```json
{
    "id": 13792,
    "text": "Every product is independently selected...",
    "new_label": "Yes"
}
```

**字段详解:**
| 字段名 | 类型 | 描述 |
| :--- | :--- | :--- |
| `id` | Number | 文章的唯一标识符 |
| `text` | String | 文章的正文内容 |
| `new_label` | String | 人工标注的标签 (例如 "Yes" 表示符合某种意识形态倾向) |