1 Star 0 Fork 0

Hugging Face 数据集镜像 / wmt_t2t

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
annotations_creators language_creators language license multilinguality size_categories source_datasets task_categories task_ids pretty_name paperswithcode_id dataset_info
no-annotation
found
de
en
unknown
translation
10M
extended|europarl_bilingual
extended|news_commentary
extended|opus_paracrawl
extended|un_multi
translation
WMT T2T
features config_name splits download_size dataset_size
name dtype
translation
translation
languages
de
en
de-en
name num_bytes num_examples
train
1385110179
4592289
name num_bytes num_examples
validation
736415
3000
name num_bytes num_examples
test
777334
3003
1728762345
1386623928

Dataset Card for "wmt_t2t"

Table of Contents

Dataset Description

Dataset Summary

The WMT EnDe Translate dataset used by the Tensor2Tensor library.

Translation dataset based on the data from statmt.org.

Versions exist for different years using a combination of data sources. The base wmt allows you to create a custom dataset by choosing your own data/language pair. This can be done as follows:

from datasets import inspect_dataset, load_dataset_builder

inspect_dataset("wmt_t2t", "path/to/scripts")
builder = load_dataset_builder(
    "path/to/scripts/wmt_utils.py",
    language_pair=("fr", "de"),
    subsets={
        datasets.Split.TRAIN: ["commoncrawl_frde"],
        datasets.Split.VALIDATION: ["euelections_dev2019"],
    },
)

# Standard version
builder.download_and_prepare()
ds = builder.as_dataset()

# Streamable version
ds = builder.as_streaming_dataset()

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

de-en

  • Size of downloaded dataset files: 1.73 GB
  • Size of the generated dataset: 1.39 GB
  • Total amount of disk used: 3.11 GB

An example of 'validation' looks as follows.

{
    "translation": {
        "de": "Just a test sentence.",
        "en": "Just a test sentence."
    }
}

Data Fields

The data fields are the same among all splits.

de-en

  • translation: a multilingual string variable, with possible languages including de, en.

Data Splits

name train validation test
de-en 4592289 3000 3003

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information


@InProceedings{bojar-EtAl:2014:W14-33,
  author    = {Bojar, Ondrej  and  Buck, Christian  and  Federmann, Christian  and  Haddow, Barry  and  Koehn, Philipp  and  Leveling, Johannes  and  Monz, Christof  and  Pecina, Pavel  and  Post, Matt  and  Saint-Amand, Herve  and  Soricut, Radu  and  Specia, Lucia  and  Tamchyna, Ale
{s}},
  title     = {Findings of the 2014 Workshop on Statistical Machine Translation},
  booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {12--58},
  url       = {http://www.aclweb.org/anthology/W/W14/W14-3302}
}

Contributions

Thanks to @thomwolf, @patrickvonplaten for adding this dataset.

空文件

简介

Mirror of https://huggingface.co/datasets/wmt_t2t 展开 收起
Python
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/hf-datasets/wmt_t2t.git
git@gitee.com:hf-datasets/wmt_t2t.git
hf-datasets
wmt_t2t
wmt_t2t
main

搜索帮助

53164aa7 5694891 3bd8fe86 5694891