annotations_creators |
language_creators |
language |
license |
multilinguality |
size_categories |
source_datasets |
task_categories |
task_ids |
pretty_name |
paperswithcode_id |
dataset_info |
|
|
|
|
|
|
extended|europarl_bilingual |
extended|news_commentary |
extended|opus_paracrawl |
extended|un_multi |
|
|
|
WMT T2T |
|
features |
config_name |
splits |
download_size |
dataset_size |
|
de-en |
name |
num_bytes |
num_examples |
train |
1385110179 |
4592289 |
|
name |
num_bytes |
num_examples |
validation |
736415 |
3000 |
|
name |
num_bytes |
num_examples |
test |
777334 |
3003 |
|
|
1728762345 |
1386623928 |
|
Dataset Card for "wmt_t2t"
Table of Contents
Dataset Description
Dataset Summary
The WMT EnDe Translate dataset used by the Tensor2Tensor library.
Translation dataset based on the data from statmt.org.
Versions exist for different years using a combination of data
sources. The base wmt
allows you to create a custom dataset by choosing
your own data/language pair. This can be done as follows:
from datasets import inspect_dataset, load_dataset_builder
inspect_dataset("wmt_t2t", "path/to/scripts")
builder = load_dataset_builder(
"path/to/scripts/wmt_utils.py",
language_pair=("fr", "de"),
subsets={
datasets.Split.TRAIN: ["commoncrawl_frde"],
datasets.Split.VALIDATION: ["euelections_dev2019"],
},
)
# Standard version
builder.download_and_prepare()
ds = builder.as_dataset()
# Streamable version
ds = builder.as_streaming_dataset()
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure
Data Instances
de-en
-
Size of downloaded dataset files: 1.73 GB
-
Size of the generated dataset: 1.39 GB
-
Total amount of disk used: 3.11 GB
An example of 'validation' looks as follows.
{
"translation": {
"de": "Just a test sentence.",
"en": "Just a test sentence."
}
}
Data Fields
The data fields are the same among all splits.
de-en
-
translation
: a multilingual string
variable, with possible languages including de
, en
.
Data Splits
name |
train |
validation |
test |
de-en |
4592289 |
3000 |
3003 |
Dataset Creation
Curation Rationale
More Information Needed
Source Data
Initial Data Collection and Normalization
More Information Needed
Who are the source language producers?
More Information Needed
Annotations
Annotation process
More Information Needed
Who are the annotators?
More Information Needed
Personal and Sensitive Information
More Information Needed
Considerations for Using the Data
Social Impact of Dataset
More Information Needed
Discussion of Biases
More Information Needed
Other Known Limitations
More Information Needed
Additional Information
Dataset Curators
More Information Needed
Licensing Information
More Information Needed
Citation Information
@InProceedings{bojar-EtAl:2014:W14-33,
author = {Bojar, Ondrej and Buck, Christian and Federmann, Christian and Haddow, Barry and Koehn, Philipp and Leveling, Johannes and Monz, Christof and Pecina, Pavel and Post, Matt and Saint-Amand, Herve and Soricut, Radu and Specia, Lucia and Tamchyna, Ale
{s}},
title = {Findings of the 2014 Workshop on Statistical Machine Translation},
booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
month = {June},
year = {2014},
address = {Baltimore, Maryland, USA},
publisher = {Association for Computational Linguistics},
pages = {12--58},
url = {http://www.aclweb.org/anthology/W/W14/W14-3302}
}
Contributions
Thanks to @thomwolf, @patrickvonplaten for adding this dataset.