# ClassicalModernCorpus

**Repository Path**: hellohistory/ClassicalModernCorpus

## Basic Information

- **Project Name**: ClassicalModernCorpus
- **Description**: 该项目旨在于收集制作古代汉语和现代汉语对照语料数据集
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-01-20
- **Last Updated**: 2024-01-20

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 该项目是一个利用现有的原始数据进行简单加工适合于机器学习文白翻译的数据集项目

# 项目结构

## 1. 项目目录结构

项目创建日期为文件夹名

## 2. 项目数据结构

以Date0525(1)为例

    {

        "name": "魏书_列传_卷七",

        "with_punctuation": "景穆皇帝十四男。",

        "translation": "景穆皇帝有十四个儿子。"

    },

### 处理原则:

包含文白对照两个字段，其中`with_punctuation`字段为原始文本，`translation`字段为翻译文本，`name`字段为原始文本的出处。

### 处理方法:
参考 https://github.com/Hellohistory/PrepPro 处理脚本


# 3. 项目数据来源
https://github.com/BangBOOM/Classical-Chinese

https://github.com/NiuTrans/Classical-Modern

# 4. 项目更新历史
## 2023-05-30 项目创建，并上传两个项目数据集

Date0524当中一共包含26个JSON文件，总大小为242Mb

Date0525当中一共包含4670个JSON文件，总大小为99.4Mb

# 5.下载地址

百度网盘: https://pan.baidu.com/s/1gpabFt_DrfZfWunQ0RIKiw?pwd=40kc 

Google云端硬盘: https://drive.google.com/drive/folders/1okDzEdWuK9pGydHKik6wrB_TpCi2S60g?usp=sharing