Dataset Card for "wmt_t2t"

annotations_creators

language_creators

language

license

multilinguality

size_categories

source_datasets

task_categories

task_ids

pretty_name

paperswithcode_id

dataset_info

no-annotation

found

de

en

unknown

translation

10M

extended|europarl_bilingual

extended|news_commentary

extended|opus_paracrawl

extended|un_multi

translation

WMT T2T

features

config_name

splits

download_size

dataset_size

name

dtype

translation

languages

de

en

de-en

name	num_bytes	num_examples
train	1385110179	4592289

name	num_bytes	num_examples
validation	736415	3000

name	num_bytes	num_examples
test	777334	3003

1728762345

1386623928

Dataset Card for "wmt_t2t"

Dataset Description

Homepage: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/translate_ende.py
Repository: More Information Needed
Paper: More Information Needed
Point of Contact: More Information Needed
Size of downloaded dataset files: 1.73 GB
Size of the generated dataset: 1.39 GB
Total amount of disk used: 3.11 GB

Dataset Summary

The WMT EnDe Translate dataset used by the Tensor2Tensor library.

Translation dataset based on the data from statmt.org.

Versions exist for different years using a combination of data sources. The base wmt allows you to create a custom dataset by choosing your own data/language pair. This can be done as follows:

from datasets import inspect_dataset, load_dataset_builder

inspect_dataset("wmt_t2t", "path/to/scripts")
builder = load_dataset_builder(
    "path/to/scripts/wmt_utils.py",
    language_pair=("fr", "de"),
    subsets={
        datasets.Split.TRAIN: ["commoncrawl_frde"],
        datasets.Split.VALIDATION: ["euelections_dev2019"],
    },
)

# Standard version
builder.download_and_prepare()
ds = builder.as_dataset()

# Streamable version
ds = builder.as_streaming_dataset()

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

de-en

Size of downloaded dataset files: 1.73 GB
Size of the generated dataset: 1.39 GB
Total amount of disk used: 3.11 GB

An example of 'validation' looks as follows.

{
    "translation": {
        "de": "Just a test sentence.",
        "en": "Just a test sentence."
    }
}

Data Fields

The data fields are the same among all splits.

de-en

translation: a multilingual string variable, with possible languages including de, en.

Data Splits

name	train	validation	test
de-en	4592289	3000	3003

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information


@InProceedings{bojar-EtAl:2014:W14-33,
  author    = {Bojar, Ondrej  and  Buck, Christian  and  Federmann, Christian  and  Haddow, Barry  and  Koehn, Philipp  and  Leveling, Johannes  and  Monz, Christof  and  Pecina, Pavel  and  Post, Matt  and  Saint-Amand, Herve  and  Soricut, Radu  and  Specia, Lucia  and  Tamchyna, Ale
{s}},
  title     = {Findings of the 2014 Workshop on Statistical Machine Translation},
  booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {12--58},
  url       = {http://www.aclweb.org/anthology/W/W14/W14-3302}
}

Contributions

Thanks to @thomwolf, @patrickvonplaten for adding this dataset.

Hugging Face 数据集镜像 / wmt_t2t

Dataset Card for "wmt_t2t"

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

de-en

Data Fields

de-en

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

简介

发行版

贡献者

近期动态

Hugging Face 数据集镜像 / wmt_t2t .gitee-modal { width: 500px !important; }

Dataset Card for "wmt_t2t"

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

de-en

Data Fields

de-en

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

简介

发行版

贡献者

近期动态

搜索帮助

Hugging Face 数据集镜像 / wmt_t2t