# Nemotron-VLM-Dataset-v2
**Repository Path**: hf-datasets/Nemotron-VLM-Dataset-v2
## Basic Information
- **Project Name**: Nemotron-VLM-Dataset-v2
- **Description**: Mirror of https://huggingface.co/datasets/nvidia/Nemotron-VLM-Dataset-v2
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-06
- **Last Updated**: 2025-11-06
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
---
license: cc-by-4.0
task_categories:
- visual-question-answering
- image-text-to-text
- video-text-to-text
pretty_name: Nemotron-VLM-Dataset v2
size_categories:
- n>1T
dataset_info:
features:
- name: id
dtype: string
- name: messages
sequence:
struct:
- name: role
dtype: string
- name: content
sequence:
struct:
- name: type
dtype: string
- name: text
dtype: string
- name: image
dtype: string
- name: video
dtype: string
- name: audio
dtype: string
- name: metadata
struct:
- name: pdf
dtype: string
- name: page_number
dtype: int32
- name: url
dtype: string
- name: yt
dtype: string
- name: width
dtype: int32
- name: height
dtype: int32
- name: format
dtype: string
- name: mode
dtype: string
- name: video_duration
dtype: float32
- name: video_num_frames
dtype: int32
- name: video_fps
dtype: float32
- name: video_width
dtype: int32
- name: video_height
dtype: int32
# - name: audio_duration
# dtype: float32
# - name: audio_channels
# dtype: int32
# - name: audio_sample_rate
# dtype: int32
configs:
- config_name: wiki_de
data_files:
- split: train
path: wiki_de/wiki_de.jsonl
- config_name: wiki_en
data_files:
- split: train
path: wiki_en/wiki_en.jsonl
- config_name: wiki_es
data_files:
- split: train
path: wiki_es/wiki_es.jsonl
- config_name: wiki_fr
data_files:
- split: train
path: wiki_fr/wiki_fr.jsonl
- config_name: wiki_it
data_files:
- split: train
path: wiki_it/wiki_it.jsonl
- config_name: wiki_ja
data_files:
- split: train
path: wiki_ja/wiki_ja.jsonl
- config_name: wiki_ko
data_files:
- split: train
path: wiki_ko/wiki_ko.jsonl
- config_name: wiki_nl
data_files:
- split: train
path: wiki_nl/wiki_nl.jsonl
- config_name: wiki_pt
data_files:
- split: train
path: wiki_pt/wiki_pt.jsonl
- config_name: wiki_zh
data_files:
- split: train
path: wiki_zh/wiki_zh.jsonl
- config_name: oi_bbox_3
data_files:
- split: train
path: oi_bbox_3/oi_bbox_3.jsonl
- config_name: tabmwp_cot
data_files:
- split: train
path: tabmwp_cot/tabmwp_cot.jsonl
- config_name: sparsetables
data_files:
- split: train
path: sparsetables/sparsetables.jsonl
- config_name: mulberry_cot_1
data_files:
- split: train
path: mulberry_cot_1/mulberry_cot_1.jsonl
- config_name: mulberry_cot_2
data_files:
- split: train
path: mulberry_cot_2/mulberry_cot_2.jsonl
- config_name: llava_cot_100k
data_files:
- split: train
path: llava_cot_100k/llava_cot_100k.jsonl
- config_name: geomverse_cot
data_files:
- split: train
path: geomverse_cot/geomverse_cot.jsonl
- config_name: mapqa_cot
data_files:
- split: train
path: mapqa_cot/mapqa_cot.jsonl
- config_name: plotqa_cot
data_files:
- split: train
path: plotqa_cot/plotqa_cot.jsonl
- config_name: visual7w_telling_cot
data_files:
- split: train
path: visual7w_telling_cot/visual7w_telling_cot.jsonl
- config_name: visual_web_instruct_cot
data_files:
- split: train
path: visual_web_instruct_cot/visual_web_instruct_cot.jsonl
- config_name: docvqa_cot
data_files:
- split: train
path: docvqa_cot/docvqa_cot.jsonl
- config_name: chartqa_cot
data_files:
- split: train
path: chartqa_cot/chartqa_cot.jsonl
- config_name: infographicsvqa_cot
data_files:
- split: train
path: infographicsvqa_cot/infographicsvqa_cot.jsonl
- config_name: unigeo_cot
data_files:
- split: train
path: unigeo_cot/unigeo_cot.jsonl
- config_name: nights_cot
data_files:
- split: train
path: nights_cot/nights_cot.jsonl
- config_name: mantis_instruct_cot
data_files:
- split: train
path: mantis_instruct_cot/mantis_instruct_cot.jsonl
- config_name: fintabnet_cot
data_files:
- split: train
path: fintabnet_cot/fintabnet_cot.jsonl
- config_name: hiertext
data_files:
- split: train
path: hiertext/hiertext.jsonl
- config_name: nextqa
data_files:
- split: train
path: nextqa/nextqa.jsonl
- config_name: clevrer
data_files:
- split: train
path: clevrer/clevrer.jsonl
- config_name: ego_exo_learn
data_files:
- split: train
path: ego_exo_learn/ego_exo_learn.jsonl
- config_name: kinetics_k710
data_files:
- split: train
path: kinetics_k710/kinetics_k710.jsonl
- config_name: perception_test_1
data_files:
- split: train
path: perception_test_1/perception_test_1.jsonl
- config_name: perception_test_2
data_files:
- split: train
path: perception_test_2/perception_test_2.jsonl
- config_name: activity_net_1
data_files:
- split: train
path: activity_net_1/activity_net_1.jsonl
- config_name: hacs
data_files:
- split: train
path: hacs/hacs.jsonl
- config_name: hirest_1
data_files:
- split: train
path: hirest_1/hirest_1.jsonl
- config_name: activity_net_2
data_files:
- split: train
path: activity_net_2/activity_net_2.jsonl
- config_name: hirest_2
data_files:
- split: train
path: hirest_2/hirest_2.jsonl
- config_name: youcook2_1
data_files:
- split: train
path: youcook2_1/youcook2_1.jsonl
- config_name: youcook2_2
data_files:
- split: train
path: youcook2_2/youcook2_2.jsonl
- config_name: breakfast_actions
data_files:
- split: train
path: breakfast_actions/breakfast_actions.jsonl
- config_name: ccpdf_multipage_1
data_files:
- split: train
path: ccpdf_multipage_1/ccpdf_multipage_1.jsonl
- config_name: ccpdf_multipage_2
data_files:
- split: train
path: ccpdf_multipage_2/ccpdf_multipage_2.jsonl
- config_name: perception_test_cot
data_files:
- split: train
path: perception_test_cot/perception_test_cot.jsonl
---
# Nemotron-VLM-Dataset v2
## Versions
| Date | Commit | Changes |
|-------------|--------------|----------|
| **2025-11-05** | [head](https://huggingface.co/datasets/nvidia/Nemotron-VLM-Dataset-v2/tree/main) | Fix `nights_cot` dataset. Fix/filter broken <think> entries. Update fintabnet instructions. Update indexes. |
| **2025-10-28** | [214051e](https://huggingface.co/datasets/nvidia/Nemotron-VLM-Dataset-v2/tree/214051e30f0f5ef2d6cd7eb54027b38d229c8822) | Initial Release |
## Dataset Description
Following up on Llama Nemotron VLM Dataset V1 with 3 million samples, we are releasing the Nemotron VLM Dataset V2 with almost three times as many high-quality samples.
This time, our focus was on three main areas: Adding new data modalities like video, expanding our chain-of-thought reasoning data, and providing the community with a toolchain to generate OCR training data.
We discovered that to enhance performance further, our models needed to learn not only the correct answer but also the reasoning process behind it. Adding more targeted chain-of-thought datasets proved to be the key to breaking the plateau for various benchmarks.
With this release, we are broadening the dataset scope to allow for training more capable models. We added
- New Modalities and Domains: We have added a substantial amount of new data covering UI understanding, complex charts, diagrams. For the first time, we are also including video understanding tasks.
- Focus on Reasoning: We have been able to break benchmark plateaus by adding more chain-of-thought data, some of which we generated by auto labeling thinking traces for existing samples. We found that providing those traces helped especially for samples that the previous model struggled with.
- Improved OCR: We further improved on the highly-competitive OCR capabilities of our first VL model by adding an even larger variety of training samples including multilingual data for six languages. Unfortunately, we cannot redistribute a large part of those samples, but we are releasing the data generation pipeline that we used, so you can generate all that OCR data with ground truth yourself! Check it out [here](https://github.com/NVIDIA-NeMo/Curator/tree/experimental/experimental/nvpdftex).
In the table below, you can see all the subdatasets that we are publishing with their sizes, properties and link to a subdataset card with more details.
For each subdataset we are publishing the annotations/labels which we generated by using various strategies, see "Source & Processing" column.
The actual media data (images and videos) can only be redistributed for some of the datasets according to their licenses.
For the remaining ones, we provide instructions on how to obtain the data in each of the subdataset cards.
All of the data is prepared to be used with our multi-modal data loader [Megatron Energon](https://github.com/NVIDIA/Megatron-Energon). For more details, see [this section](#loading-the-data-with-megatron-energon) below.
This dataset is ready for commercial use.
---
## Dataset Owner
NVIDIA Corporation
---
## Dataset Creation Date
10/27/2025
---
## License/Terms of Use
**Governing Terms**: This collection of datasets is governed by the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/deed.en) (CC-BY-4.0), except for the following datasets, which are governed by the [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) (CC BY-SA 4.0): dewiki_v5_0828, enwiki_v5_0828, eswiki_v5_0828, frwiki_v5_0828, itwiki_v5_0828, jawiki_v5_0828, kowiki_v5_0828, nlwiki_v5_0828, ptwiki_v5_0828, and zhwiki_v5_0828.
---
## Intended Usage
The Llama Nemotron VLM Dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train and evaluate.
---
## Dataset Composition
| Dataset Name | Samples | Size (GB) | Data & Task Type | Source & Processing | Media incl. | Governing Terms |
|------------|-----------:|-----------:|------------|------------|------------|------------|
| [wiki_de](./wiki_de/README.md) | 200,000 | 37.13 | text image | public | ☑ | CC BY-SA 4.0 |
| [wiki_en](./wiki_en/README.md) | 200,000 | 33.38 | text image | public | ☑ | CC BY-SA 4.0 |
| [wiki_es](./wiki_es/README.md) | 200,000 | 32.85 | text image | public | ☑ | CC BY-SA 4.0 |
| [wiki_fr](./wiki_fr/README.md) | 200,000 | 31.15 | text image | public | ☑ | CC BY-SA 4.0 |
| [wiki_it](./wiki_it/README.md) | 200,000 | 30.30 | text image | public | ☑ | CC BY-SA 4.0 |
| [wiki_ja](./wiki_ja/README.md) | 200,000 | 38.40 | text image | public | ☑ | CC BY-SA 4.0 |
| [wiki_ko](./wiki_ko/README.md) | 200,000 | 27.09 | text image | public | ☑ | CC BY-SA 4.0 |
| [wiki_nl](./wiki_nl/README.md) | 200,000 | 29.52 | text image | public | ☑ | CC BY-SA 4.0 |
| [wiki_pt](./wiki_pt/README.md) | 200,000 | 30.49 | text image | public | ☑ | CC BY-SA 4.0 |
| [wiki_zh](./wiki_zh/README.md) | 200,000 | 30.14 | text image | public | ☑ | CC BY-SA 4.0 |
| [oi_bbox_1](./oi_bbox_1/README.md) | 1,664,533 | 490.37 | text image qa | public | | CC BY 4.0 |
| [oi_bbox_2](./oi_bbox_2/README.md) | 1,664,533 | 488.17 | text image qa | public | | CC BY 4.0 |
| [oi_bbox_3](./oi_bbox_3/README.md) | 1,128,326 | 324.46 | text image qa | public | | CC BY 4.0 |
| [tabmwp_cot](./tabmwp_cot/README.md) | 20,305 | 0.28 | text image reasoning | public qwen-labels | | CC BY 4.0 |
| [sparsetables](./sparsetables/README.md) | 100,000 | 14.36 | text image | synthetic | ☑ | CC BY 4.0 |
| [mulberry_cot_1](./mulberry_cot_1/README.md) | 191,332 | 30.80 | text image reasoning | public glm-labels | | CC BY 4.0 |
| [llava_cot_100k](./llava_cot_100k/README.md) | 63,013 | 8.18 | text image reasoning | public glm-labels | | CC BY 4.0 |
| [geomverse_cot](./geomverse_cot/README.md) | 9,298 | 0.90 | text image reasoning | public glm-labels | | CC BY 4.0 |
| [mapqa_cot](./mapqa_cot/README.md) | 16,832 | 1.77 | text image reasoning | public glm-labels | | CC BY 4.0 |
| [plotqa_cot](./plotqa_cot/README.md) | 16,256 | 0.76 | text image reasoning | public glm-labels | ☑ | CC BY 4.0 |
| [visual7w_telling_cot](./visual7w_telling_cot/README.md) | 62,592 | 3.21 | text image reasoning | public glm-labels | | CC BY 4.0 |
| [visual_web_instruct_cot](./visual_web_instruct_cot/README.md) | 48,929 | 4.37 | text image reasoning | public glm-labels | | CC BY 4.0 |
| [docvqa_cot](./docvqa_cot/README.md) | 36,333 | 24.32 | text image reasoning | public qwen-labels | | CC BY 4.0 |
| [chartqa_cot](./chartqa_cot/README.md) | 45,710 | 2.10 | text image reasoning | public qwen-labels | | CC BY 4.0 |
| [infographicsvqa_cot](./infographicsvqa_cot/README.md) | 19,548 | 6.70 | text image reasoning | public qwen-labels | | CC BY 4.0 |
| [mulberry_cot_2](./mulberry_cot_2/README.md) | 103,763 | 18.45 | text image reasoning | public qwen-labels | | CC BY 4.0 |
| [unigeo_cot](./unigeo_cot/README.md) | 9,728 | 0.05 | text image reasoning | public glm-labels | | CC BY 4.0 |
| [nights_cot](./nights_cot/README.md) | 12,906 | 37.01 | text image reasoning | public glm-labels | ☑ | CC BY 4.0 |
| [mantis_instruct_cot](./mantis_instruct_cot/README.md) | 67,723 | 13.86 | text image reasoning | public glm-labels | | CC BY 4.0 |
| [fintabnet_cot](./fintabnet_cot/README.md) | 8,356 | 3.17 | text image reasoning | public qwen-labels | | CC BY 4.0 |
| [hiertext](./hiertext/README.md) | 514 | 0.07 | text image reasoning | public qwen-labels | | CC BY 4.0 |
| [nextqa](./nextqa/README.md) | 34,132 | 150.86 | text video qa | public rule-based | | CC BY 4.0 |
| [clevrer](./clevrer/README.md) | 40,000 | 46.03 | text video qa | public rule-based | | CC BY 4.0 |
| [ego_exo_learn](./ego_exo_learn/README.md) | 36,373 | 8558.27 | text video qa | public rule-based | ☑ | CC BY 4.0 |
| [kinetics_k710](./kinetics_k710/README.md) | 647,883 | 890.56 | text video qa | public rule-based | | CC BY 4.0 |
| [perception_test_1](./perception_test_1/README.md) | 7,392 | 94.95 | text video qa | public rule-based | ☑ | CC BY 4.0 |
| [activity_net_1](./activity_net_1/README.md) | 10,021 | 191.49 | text video qa | public rule-based | | CC BY 4.0 |
| [hacs](./hacs/README.md) | 31,223 | 829.25 | text video qa | public rule-based | | CC BY 4.0 |
| [hirest_1](./hirest_1/README.md) | 822 | 42.50 | text video qa | public rule-based | | CC BY 4.0 |
| [perception_test_2](./perception_test_2/README.md) | 2,135 | 25.98 | text video qa | public rule-based | ☑ | CC BY 4.0 |
| [activity_net_2](./activity_net_2/README.md) | 9,064 | 181.24 | text video qa | public rule-based | | CC BY 4.0 |
| [hirest_2](./hirest_2/README.md) | 525 | 27.54 | text video qa | public rule-based | | CC BY 4.0 |
| [youcook2_1](./youcook2_1/README.md) | 1,180 | 77.65 | text video qa | public rule-based | | CC BY 4.0 |
| [youcook2_2](./youcook2_2/README.md) | 2,270 | 158.77 | text video qa | public rule-based | | CC BY 4.0 |
| [breakfast_actions](./breakfast_actions/README.md) | 1,204 | 3.45 | text video qa | public rule-based | ☑ | CC BY 4.0 |
| [ccpdf_multipage_1](./ccpdf_multipage_1/README.md) | 7,262 | 48.19 | text image qa | public qwen-labels | | CC BY 4.0 |
| [ccpdf_multipage_2](./ccpdf_multipage_2/README.md) | 455 | 31.88 | text image qa | public qwen-labels | | CC BY 4.0 |
| [perception_test_cot](./perception_test_cot/README.md) | 4,977 | 64.55 | text video qa | public glm-labels | ☑ | CC BY 4.0 |
| [ccpdf_nv_notables](./ccpdf_nv_notables/README.md) | 14,234 | 8.48 | text image | public human-labels | | CC BY 4.0 |
| [ccpdf_nv_qa](./ccpdf_nv_qa/README.md) | 1,668 | 0.55 | text image qa | public qwen-labels | | CC BY 4.0 |
| [ccpdf_nv_tables](./ccpdf_nv_tables/README.md) | 4,249 | 1.83 | text image | public human-labels | | CC BY 4.0 |
| **Total** (51) | 8,147,599 | 13,227.83 | | | | |
## Tag Legend
* text: Contains text data
* image: Contains image data
* video: Contains video data
* qa: Contains question answering data
* reasoning: Contains chain of thought reasoning data
* public: Origin of the data is another public dataset
* synthetic: The data was synthetically generated
* qwen-labels: Labels generated by Qwen
* glm-labels: Labels generated by GLM
* human-labels: Labels generated by human
* rule-based: Labels generated/transformed by simple rules
---
## Dataset Quantification
- **Total Number of Datasets**: 51
- **Total Number of Samples**: 8,147,599
- **Total Size**: 13,227.83 GB
---
## Dataset Characterization
### **Data Collection Method**
Hybrid: Synthetic, Automated, Human
### **Labeling Method**
Hybrid: Synthetic, Automated, Human
---
## Dataset Format
Each given dataset includes either:
- Text annotations (.jsonl format), referencing images or videos from source datasets, or
- Text annotations (.jsonl format) together with images or videos (in tar'ed shards).
For details on the format, check [Data Format](data_format.md).
---
## Loading the Data with Megatron Energon
This data has been prepared to be used with [Megatron Energon](https://github.com/NVIDIA/Megatron-Energon).
You can just go ahead and try it out like this:
```sh
# Install energon if you haven't already
pip install megatron-energon[av_decode] dacite
# Download this dataset (OPTION 1, slower)
git lfs install
git clone git@hf.co:datasets/nvidia/Nemotron-VLM-Dataset-v2 Nemotron-VLM-Dataset-v2
# Download this dataset (OPTION 2, modern faster way)
pip install --upgrade huggingface_hub
hf download nvidia/Nemotron-VLM-Dataset-v2 --repo-type dataset --local-dir Nemotron-VLM-Dataset-v2
# Try out the example to print a few dataset samples
cd Nemotron-VLM-Dataset-v2
python example_loader.py
```
---
## Ethical Considerations:
NVIDIA believes **Trustworthy AI** is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.
When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or **NVIDIA AI Concerns** [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).