1 Star 0 Fork 0

Hugging Face 数据集镜像 / xquad_r

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
CC-BY-4.0
annotations_creators language_creators language license multilinguality size_categories source_datasets task_categories task_ids paperswithcode_id pretty_name dataset_info config_names
expert-generated
found
ar
de
el
en
es
hi
ru
th
tr
vi
zh
cc-by-sa-4.0
multilingual
1K
extended|squad
extended|xquad
question-answering
extractive-qa
xquad-r
LAReQA
config_name features splits download_size dataset_size
ar
name dtype
id
string
name dtype
context
string
name dtype
question
string
name sequence
answers
name dtype
text
string
name dtype
answer_start
int32
name num_bytes num_examples
validation
1722799
1190
17863417
1722799
config_name features splits download_size dataset_size
de
name dtype
id
string
name dtype
context
string
name dtype
question
string
name sequence
answers
name dtype
text
string
name dtype
answer_start
int32
name num_bytes num_examples
validation
1283301
1190
17863417
1283301
config_name features splits download_size dataset_size
zh
name dtype
id
string
name dtype
context
string
name dtype
question
string
name sequence
answers
name dtype
text
string
name dtype
answer_start
int32
name num_bytes num_examples
validation
984241
1190
17863417
984241
config_name features splits download_size dataset_size
vi
name dtype
id
string
name dtype
context
string
name dtype
question
string
name sequence
answers
name dtype
text
string
name dtype
answer_start
int32
name num_bytes num_examples
validation
1477239
1190
17863417
1477239
config_name features splits download_size dataset_size
en
name dtype
id
string
name dtype
context
string
name dtype
question
string
name sequence
answers
name dtype
text
string
name dtype
answer_start
int32
name num_bytes num_examples
validation
1116123
1190
17863417
1116123
config_name features splits download_size dataset_size
es
name dtype
id
string
name dtype
context
string
name dtype
question
string
name sequence
answers
name dtype
text
string
name dtype
answer_start
int32
name num_bytes num_examples
validation
1273499
1190
17863417
1273499
config_name features splits download_size dataset_size
hi
name dtype
id
string
name dtype
context
string
name dtype
question
string
name sequence
answers
name dtype
text
string
name dtype
answer_start
int32
name num_bytes num_examples
validation
2682975
1190
17863417
2682975
config_name features splits download_size dataset_size
el
name dtype
id
string
name dtype
context
string
name dtype
question
string
name sequence
answers
name dtype
text
string
name dtype
answer_start
int32
name num_bytes num_examples
validation
2206690
1190
17863417
2206690
config_name features splits download_size dataset_size
th
name dtype
id
string
name dtype
context
string
name dtype
question
string
name sequence
answers
name dtype
text
string
name dtype
answer_start
int32
name num_bytes num_examples
validation
2854959
1190
17863417
2854959
config_name features splits download_size dataset_size
tr
name dtype
id
string
name dtype
context
string
name dtype
question
string
name sequence
answers
name dtype
text
string
name dtype
answer_start
int32
name num_bytes num_examples
validation
1210763
1190
17863417
1210763
config_name features splits download_size dataset_size
ru
name dtype
id
string
name dtype
context
string
name dtype
question
string
name sequence
answers
name dtype
text
string
name dtype
answer_start
int32
name num_bytes num_examples
validation
2136990
1190
17863417
2136990
ar
de
el
en
es
hi
ru
th
tr
vi
zh

Dataset Card for [Dataset Name]

Table of Contents

Dataset Description

Dataset Summary

XQuAD-R is a retrieval version of the XQuAD dataset (a cross-lingual extractive QA dataset). Like XQuAD, XQUAD-R is an 11-way parallel dataset, where each question appears in 11 different languages and has 11 parallel correct answers across the languages.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The dataset can be found with the following languages:

  • Arabic: xquad-r/ar.json
  • German: xquad-r/de.json
  • Greek: xquad-r/el.json
  • English: xquad-r/en.json
  • Spanish: xquad-r/es.json
  • Hindi: xquad-r/hi.json
  • Russian: xquad-r/ru.json
  • Thai: xquad-r/th.json
  • Turkish: xquad-r/tr.json
  • Vietnamese: xquad-r/vi.json
  • Chinese: xquad-r/zh.json

Dataset Structure

[More Information Needed]

Data Instances

An example from en config:

{'id': '56beb4343aeaaa14008c925b',
 'context': "The Panthers defense gave up just 308 points, ranking sixth in the league, while also leading the NFL in interceptions with 24 and boasting four Pro Bowl selections. Pro Bowl defensive tackle Kawann Short led the team in sacks with 11, while also forcing three fumbles and recovering two. Fellow lineman Mario Addison added 6½ sacks. The Panthers line also featured veteran defensive end Jared Allen, a 5-time pro bowler who was the NFL's active career sack leader with 136, along with defensive end Kony Ealy, who had 5 sacks in just 9 starts. Behind them, two of the Panthers three starting linebackers were also selected to play in the Pro Bowl: Thomas Davis and Luke Kuechly. Davis compiled 5½ sacks, four forced fumbles, and four interceptions, while Kuechly led the team in tackles (118) forced two fumbles, and intercepted four passes of his own. Carolina's secondary featured Pro Bowl safety Kurt Coleman, who led the team with a career high seven interceptions, while also racking up 88 tackles and Pro Bowl cornerback Josh Norman, who developed into a shutdown corner during the season and had four interceptions, two of which were returned for touchdowns.",
 'question': 'How many points did the Panthers defense surrender?',
 'answers': {'text': ['308'], 'answer_start': [34]}}

Data Fields

  • id (str): Unique ID for the context-question pair.
  • context (str): Context for the question.
  • question (str): Question.
  • answers (dict): Answers with the following keys:
    • text (list of str): Texts of the answers.
    • answer_start (list of int): Start positions for every answer text.

Data Splits

The number of questions and candidate sentences for each language for XQuAD-R is shown in the table below:

XQuAD-R
questions candidates
ar 1190 1222
de 1190 1276
el 1190 1234
en 1190 1180
es 1190 1215
hi 1190 1244
ru 1190 1219
th 1190 852
tr 1190 1167
vi 1190 1209
zh 1190 1196

Dataset Creation

[More Information Needed]

Curation Rationale

[More Information Needed]

Source Data

[More Information Needed]

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

[More Information Needed]

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

[More Information Needed]

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

[More Information Needed]

Dataset Curators

The dataset was initially created by Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Barua, Aaron Phillips and Yinfei Yang, during work done at Google Research.

Licensing Information

XQuAD-R is distributed under the CC BY-SA 4.0 license.

Citation Information

@article{roy2020lareqa,
  title={LAReQA: Language-agnostic answer retrieval from a multilingual pool},
  author={Roy, Uma and Constant, Noah and Al-Rfou, Rami and Barua, Aditya and Phillips, Aaron and Yang, Yinfei},
  journal={arXiv preprint arXiv:2004.05484},
  year={2020}
}

Contributions

Thanks to @manandey for adding this dataset.

--- annotations_creators: - expert-generated language_creators: - found language: - ar - de - el - en - es - hi - ru - th - tr - vi - zh license: - cc-by-sa-4.0 multilinguality: - multilingual size_categories: - 1K<n<10K source_datasets: - extended|squad - extended|xquad task_categories: - question-answering task_ids: - extractive-qa paperswithcode_id: xquad-r pretty_name: LAReQA dataset_info: - config_name: ar features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1722799 num_examples: 1190 download_size: 17863417 dataset_size: 1722799 - config_name: de features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1283301 num_examples: 1190 download_size: 17863417 dataset_size: 1283301 - config_name: zh features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 984241 num_examples: 1190 download_size: 17863417 dataset_size: 984241 - config_name: vi features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1477239 num_examples: 1190 download_size: 17863417 dataset_size: 1477239 - config_name: en features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1116123 num_examples: 1190 download_size: 17863417 dataset_size: 1116123 - config_name: es features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1273499 num_examples: 1190 download_size: 17863417 dataset_size: 1273499 - config_name: hi features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 2682975 num_examples: 1190 download_size: 17863417 dataset_size: 2682975 - config_name: el features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 2206690 num_examples: 1190 download_size: 17863417 dataset_size: 2206690 - config_name: th features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 2854959 num_examples: 1190 download_size: 17863417 dataset_size: 2854959 - config_name: tr features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1210763 num_examples: 1190 download_size: 17863417 dataset_size: 1210763 - config_name: ru features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 2136990 num_examples: 1190 download_size: 17863417 dataset_size: 2136990 config_names: - ar - de - el - en - es - hi - ru - th - tr - vi - zh --- # Dataset Card for [Dataset Name] ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [LAReQA](https://github.com/google-research-datasets/lareqa) - **Repository:** [XQuAD-R](https://github.com/google-research-datasets/lareqa) - **Paper:** [LAReQA: Language-agnostic answer retrieval from a multilingual pool](https://arxiv.org/pdf/2004.05484.pdf) - **Point of Contact:** [Noah Constant](mailto:nconstant@google.com) ### Dataset Summary XQuAD-R is a retrieval version of the XQuAD dataset (a cross-lingual extractive QA dataset). Like XQuAD, XQUAD-R is an 11-way parallel dataset, where each question appears in 11 different languages and has 11 parallel correct answers across the languages. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset can be found with the following languages: * Arabic: `xquad-r/ar.json` * German: `xquad-r/de.json` * Greek: `xquad-r/el.json` * English: `xquad-r/en.json` * Spanish: `xquad-r/es.json` * Hindi: `xquad-r/hi.json` * Russian: `xquad-r/ru.json` * Thai: `xquad-r/th.json` * Turkish: `xquad-r/tr.json` * Vietnamese: `xquad-r/vi.json` * Chinese: `xquad-r/zh.json` ## Dataset Structure [More Information Needed] ### Data Instances An example from `en` config: ``` {'id': '56beb4343aeaaa14008c925b', 'context': "The Panthers defense gave up just 308 points, ranking sixth in the league, while also leading the NFL in interceptions with 24 and boasting four Pro Bowl selections. Pro Bowl defensive tackle Kawann Short led the team in sacks with 11, while also forcing three fumbles and recovering two. Fellow lineman Mario Addison added 6½ sacks. The Panthers line also featured veteran defensive end Jared Allen, a 5-time pro bowler who was the NFL's active career sack leader with 136, along with defensive end Kony Ealy, who had 5 sacks in just 9 starts. Behind them, two of the Panthers three starting linebackers were also selected to play in the Pro Bowl: Thomas Davis and Luke Kuechly. Davis compiled 5½ sacks, four forced fumbles, and four interceptions, while Kuechly led the team in tackles (118) forced two fumbles, and intercepted four passes of his own. Carolina's secondary featured Pro Bowl safety Kurt Coleman, who led the team with a career high seven interceptions, while also racking up 88 tackles and Pro Bowl cornerback Josh Norman, who developed into a shutdown corner during the season and had four interceptions, two of which were returned for touchdowns.", 'question': 'How many points did the Panthers defense surrender?', 'answers': {'text': ['308'], 'answer_start': [34]}} ``` ### Data Fields - `id` (`str`): Unique ID for the context-question pair. - `context` (`str`): Context for the question. - `question` (`str`): Question. - `answers` (`dict`): Answers with the following keys: - `text` (`list` of `str`): Texts of the answers. - `answer_start` (`list` of `int`): Start positions for every answer text. ### Data Splits The number of questions and candidate sentences for each language for XQuAD-R is shown in the table below: | | XQuAD-R | | |-----|-----------|------------| | | questions | candidates | | ar | 1190 | 1222 | | de | 1190 | 1276 | | el | 1190 | 1234 | | en | 1190 | 1180 | | es | 1190 | 1215 | | hi | 1190 | 1244 | | ru | 1190 | 1219 | | th | 1190 | 852 | | tr | 1190 | 1167 | | vi | 1190 | 1209 | | zh | 1190 | 1196 | ## Dataset Creation [More Information Needed] ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data [More Information Needed] ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information [More Information Needed] ### Dataset Curators The dataset was initially created by Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Barua, Aaron Phillips and Yinfei Yang, during work done at Google Research. ### Licensing Information XQuAD-R is distributed under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/legalcode). ### Citation Information ``` @article{roy2020lareqa, title={LAReQA: Language-agnostic answer retrieval from a multilingual pool}, author={Roy, Uma and Constant, Noah and Al-Rfou, Rami and Barua, Aditya and Phillips, Aaron and Yang, Yinfei}, journal={arXiv preprint arXiv:2004.05484}, year={2020} } ``` ### Contributions Thanks to [@manandey](https://github.com/manandey) for adding this dataset.

简介

Mirror of https://huggingface.co/datasets/xquad_r 展开 收起
Python
CC-BY-4.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/hf-datasets/xquad_r.git
git@gitee.com:hf-datasets/xquad_r.git
hf-datasets
xquad_r
xquad_r
main

搜索帮助