1 Star 0 Fork 0

Hugging Face 数据集镜像/WebInstructSub

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
languagelicensesize_categoriestask_categoriespretty_namedataset_infotagsconfigs
en
apache-2.0
1M<n<10M
question-answering
WebInstruct
featuressplitsdownload_sizedataset_size
namedtype
orig_questionstring
namedtype
orig_answerstring
namedtype
questionstring
namedtype
answerstring
namedtype
sourcestring
namedtype
indexint64
namenum_bytesnum_examples
train62158888912335220
35098038406215888891
language model
config_namedata_files
default
splitpath
traindata/train-*

🦣 MAmmoTH2: Scaling Instructions from the Web

Project Page: https://tiger-ai-lab.github.io/MAmmoTH2/

Paper: https://arxiv.org/pdf/2405.03548

Code: https://github.com/TIGER-AI-Lab/MAmmoTH2

WebInstruct (Subset)

This repo partial dataset used in "MAmmoTH2: Scaling Instructions from the Web". This partial data is coming mostly from the forums like stackexchange. This subset contains very high-quality data to boost LLM performance through instruction tuning.

License

  • For the data from "mathstackexchange" and "stackexchange", we use Apache-2.0 license. You are free to share and adapt for any purposes.
  • For the data from "socratic", we use CC BY-NC 4.0 license according to https://socratic.org/terms. You are free to share and adapt, but only for non-commercial purposes.

Fields in our dataset

The field orig_question' and orig_answer' are the extracted question-answer pairs from the recalled documents. The question' and answer' are the refined version of the extracted question/answer pairs.

Regarding the data source:

  1. mathstackexchange: https://math.stackexchange.com/.
  2. stackexchange: including https://physics.stackexchange.com/, https://biology.stackexchange.com/, https://chemistry.stackexchange.com/, https://cs.stackexchange.com/.
  3. Socratic: the data is originally from https://socratic.org/.

Size of different sources

Domain Size Subjects
MathStackExchange 1484630 Mathematics
ScienceStackExchange 317209 Physics, Biology, Chemistry, Computer Science
Socratic 533384 Mathematics, Science, Humanties

Dataset Construction

We propose discovering instruction data from the web. We argue that vast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Our three-step pipeline involves recalling documents from Common Crawl, extracting Q-A pairs, and refining them for quality. This approach yields 10 million instruction-response pairs, offering a scalable alternative to existing datasets. We name our curated dataset as WebInstruct.

Project Framework

Citation

@article{yue2024mammoth2,
  title={MAmmoTH2: Scaling Instructions from the Web},
  author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu},
  journal={arXiv preprint arXiv:2405.03548},
  year={2024}
}

空文件

简介

Mirror of https://huggingface.co/datasets/TIGER-Lab/WebInstructSub 展开 收起
取消

发行版

暂无发行版

贡献者

全部

近期动态

不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/hf-datasets/WebInstructSub.git
git@gitee.com:hf-datasets/WebInstructSub.git
hf-datasets
WebInstructSub
WebInstructSub
main

搜索帮助