🦣 MAmmoTH2: Scaling Instructions from the Web

language

license

size_categories

task_categories

pretty_name

dataset_info

🦣 MAmmoTH2: Scaling Instructions from the Web

Project Page: https://tiger-ai-lab.github.io/MAmmoTH2/

Paper: https://arxiv.org/pdf/2405.03548

Code: https://github.com/TIGER-AI-Lab/MAmmoTH2

WebInstruct (Subset)

This repo partial dataset used in "MAmmoTH2: Scaling Instructions from the Web". This partial data is coming mostly from the forums like stackexchange. This subset contains very high-quality data to boost LLM performance through instruction tuning.

License

For the data from "mathstackexchange" and "stackexchange", we use Apache-2.0 license. You are free to share and adapt for any purposes.
For the data from "socratic", we use CC BY-NC 4.0 license according to https://socratic.org/terms. You are free to share and adapt, but only for non-commercial purposes.

Fields in our dataset

The field orig_question' and orig_answer' are the extracted question-answer pairs from the recalled documents. The question' and answer' are the refined version of the extracted question/answer pairs.

Regarding the data source:

mathstackexchange: https://math.stackexchange.com/.
stackexchange: including https://physics.stackexchange.com/, https://biology.stackexchange.com/, https://chemistry.stackexchange.com/, https://cs.stackexchange.com/.
Socratic: the data is originally from https://socratic.org/.

Size of different sources

Domain	Size	Subjects
MathStackExchange	1484630	Mathematics
ScienceStackExchange	317209	Physics, Biology, Chemistry, Computer Science
Socratic	533384	Mathematics, Science, Humanties

Dataset Construction

We propose discovering instruction data from the web. We argue that vast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Our three-step pipeline involves recalling documents from Common Crawl, extracting Q-A pairs, and refining them for quality. This approach yields 10 million instruction-response pairs, offering a scalable alternative to existing datasets. We name our curated dataset as WebInstruct.

Project Framework

Citation

@article{yue2024mammoth2,
  title={MAmmoTH2: Scaling Instructions from the Web},
  author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu},
  journal={arXiv preprint arXiv:2405.03548},
  year={2024}
}

Hugging Face 数据集镜像/WebInstructSub

🦣 MAmmoTH2: Scaling Instructions from the Web

WebInstruct (Subset)

License

Fields in our dataset

Size of different sources

Dataset Construction

Citation

简介

发行版

贡献者

近期动态

Hugging Face 数据集镜像/WebInstructSub .gitee-modal { width: 500px !important; }

🦣 MAmmoTH2: Scaling Instructions from the Web

WebInstruct (Subset)

License

Fields in our dataset

Size of different sources

Dataset Construction

Citation

简介

发行版

贡献者

近期动态

搜索帮助

Hugging Face 数据集镜像/WebInstructSub