# c4-chinese-zhtw

**Repository Path**: hf-datasets/c4-chinese-zhtw

## Basic Information

- **Project Name**: c4-chinese-zhtw
- **Description**: Mirror of https://huggingface.co/datasets/erhwenkuo/c4-chinese-zhtw
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-10-30
- **Last Updated**: 2024-06-09

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

---
language:
- zh
size_categories:
- 1M<n<10M
task_categories:
- text-generation
- fill-mask
dataset_info:
  features:
  - name: url
    dtype: string
  - name: timestamp
    dtype: string
  - name: content_language
    dtype: string
  - name: content_type
    dtype: string
  - name: text
    dtype: string
  splits:
  - name: train
    num_bytes: 12480603148
    num_examples: 2967556
  download_size: 8659425404
  dataset_size: 12480603148
configs:
- config_name: default
  data_files:
  - split: train
    path: data/train-*
---
# Dataset Card for "c4-chinese-zhtw"

## 內容

Common Crawl 是一個非營利組織，負責抓取網路並向公眾免費提供其檔案和資料集。Common Crawl 的網路檔案包含自 2008 年以來收集的 PB 級資料。它一般每月完成一次抓取。

Common Crawl 的爬蟲程式遵守 nofollow 和 robots.txt 政策。用於處理 Common Crawl 資料集的開源程式碼是公開可用的。

這個繁中的數據來是來自 [Common Crawl](https://commoncrawl.org/overview) **2023-14** 的 data archive 下載并進行清理 。

這是 [jed351](https://huggingface.co/jed351) 準備的版本，託管在這個位址：

- https://huggingface.co/datasets/jed351/Traditional-Chinese-Common-Crawl-Filtered

## 支援的任務

C4主要用於預訓練語言模型(pretrain language model)。

## 範例

一個樣本的範例:

```
{
  'url': 'http://www.bilingtong.com/cpzx/96.html',
  'timestamp': '2023-03-21 02:12:48',
  'content_language': 'zho',
  'content_type': 'text/plain',
  'text': '新風系統是通過系統設計送風和排風使室內空氣存在一空氣 。無需開窗全天持續不斷有組.....'
}
```

## 資料欄位

資料有幾個欄位：

- `url`: 來源 url
- `timestamp`: 時間戳
- `content_language`: 內容包含的語言種類
- `content_type`: 內容類型，也稱為 MIME 或媒體類型，是 Web 伺服器回應標頭中的聲明
- `text`：網頁清理後的文字內容


## 數據清理

請參考在 Github 上的專案 [c4-dataset-script](https://github.com/jedcheng/c4-dataset-script) 來了解數據下載與清理的相關邏輯與程式碼。

主要的步驟有:

1. Download the WET crawl archive index file
2. Run download and Chinese screening script on Spark
3. Filter out non-sentence lines and toxic document
4. Remove duplicated text
5. Remove documents that are over self-repeating - Repetition Removal in DeepMind MassiveText

## 許可資訊

請尊循 Common Craw terms of use 的條款。

- https://commoncrawl.org/terms-of-use