# Nemotron-VLM-Dataset-v2 **Repository Path**: hf-datasets/Nemotron-VLM-Dataset-v2 ## Basic Information - **Project Name**: Nemotron-VLM-Dataset-v2 - **Description**: Mirror of https://huggingface.co/datasets/nvidia/Nemotron-VLM-Dataset-v2 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-06 - **Last Updated**: 2025-11-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README --- license: cc-by-4.0 task_categories: - visual-question-answering - image-text-to-text - video-text-to-text pretty_name: Nemotron-VLM-Dataset v2 size_categories: - n>1T dataset_info: features: - name: id dtype: string - name: messages sequence: struct: - name: role dtype: string - name: content sequence: struct: - name: type dtype: string - name: text dtype: string - name: image dtype: string - name: video dtype: string - name: audio dtype: string - name: metadata struct: - name: pdf dtype: string - name: page_number dtype: int32 - name: url dtype: string - name: yt dtype: string - name: width dtype: int32 - name: height dtype: int32 - name: format dtype: string - name: mode dtype: string - name: video_duration dtype: float32 - name: video_num_frames dtype: int32 - name: video_fps dtype: float32 - name: video_width dtype: int32 - name: video_height dtype: int32 # - name: audio_duration # dtype: float32 # - name: audio_channels # dtype: int32 # - name: audio_sample_rate # dtype: int32 configs: - config_name: wiki_de data_files: - split: train path: wiki_de/wiki_de.jsonl - config_name: wiki_en data_files: - split: train path: wiki_en/wiki_en.jsonl - config_name: wiki_es data_files: - split: train path: wiki_es/wiki_es.jsonl - config_name: wiki_fr data_files: - split: train path: wiki_fr/wiki_fr.jsonl - config_name: wiki_it data_files: - split: train path: wiki_it/wiki_it.jsonl - config_name: wiki_ja data_files: - split: train path: wiki_ja/wiki_ja.jsonl - config_name: wiki_ko data_files: - split: train path: wiki_ko/wiki_ko.jsonl - config_name: wiki_nl data_files: - split: train path: wiki_nl/wiki_nl.jsonl - config_name: wiki_pt data_files: - split: train path: wiki_pt/wiki_pt.jsonl - config_name: wiki_zh data_files: - split: train path: wiki_zh/wiki_zh.jsonl - config_name: oi_bbox_3 data_files: - split: train path: oi_bbox_3/oi_bbox_3.jsonl - config_name: tabmwp_cot data_files: - split: train path: tabmwp_cot/tabmwp_cot.jsonl - config_name: sparsetables data_files: - split: train path: sparsetables/sparsetables.jsonl - config_name: mulberry_cot_1 data_files: - split: train path: mulberry_cot_1/mulberry_cot_1.jsonl - config_name: mulberry_cot_2 data_files: - split: train path: mulberry_cot_2/mulberry_cot_2.jsonl - config_name: llava_cot_100k data_files: - split: train path: llava_cot_100k/llava_cot_100k.jsonl - config_name: geomverse_cot data_files: - split: train path: geomverse_cot/geomverse_cot.jsonl - config_name: mapqa_cot data_files: - split: train path: mapqa_cot/mapqa_cot.jsonl - config_name: plotqa_cot data_files: - split: train path: plotqa_cot/plotqa_cot.jsonl - config_name: visual7w_telling_cot data_files: - split: train path: visual7w_telling_cot/visual7w_telling_cot.jsonl - config_name: visual_web_instruct_cot data_files: - split: train path: visual_web_instruct_cot/visual_web_instruct_cot.jsonl - config_name: docvqa_cot data_files: - split: train path: docvqa_cot/docvqa_cot.jsonl - config_name: chartqa_cot data_files: - split: train path: chartqa_cot/chartqa_cot.jsonl - config_name: infographicsvqa_cot data_files: - split: train path: infographicsvqa_cot/infographicsvqa_cot.jsonl - config_name: unigeo_cot data_files: - split: train path: unigeo_cot/unigeo_cot.jsonl - config_name: nights_cot data_files: - split: train path: nights_cot/nights_cot.jsonl - config_name: mantis_instruct_cot data_files: - split: train path: mantis_instruct_cot/mantis_instruct_cot.jsonl - config_name: fintabnet_cot data_files: - split: train path: fintabnet_cot/fintabnet_cot.jsonl - config_name: hiertext data_files: - split: train path: hiertext/hiertext.jsonl - config_name: nextqa data_files: - split: train path: nextqa/nextqa.jsonl - config_name: clevrer data_files: - split: train path: clevrer/clevrer.jsonl - config_name: ego_exo_learn data_files: - split: train path: ego_exo_learn/ego_exo_learn.jsonl - config_name: kinetics_k710 data_files: - split: train path: kinetics_k710/kinetics_k710.jsonl - config_name: perception_test_1 data_files: - split: train path: perception_test_1/perception_test_1.jsonl - config_name: perception_test_2 data_files: - split: train path: perception_test_2/perception_test_2.jsonl - config_name: activity_net_1 data_files: - split: train path: activity_net_1/activity_net_1.jsonl - config_name: hacs data_files: - split: train path: hacs/hacs.jsonl - config_name: hirest_1 data_files: - split: train path: hirest_1/hirest_1.jsonl - config_name: activity_net_2 data_files: - split: train path: activity_net_2/activity_net_2.jsonl - config_name: hirest_2 data_files: - split: train path: hirest_2/hirest_2.jsonl - config_name: youcook2_1 data_files: - split: train path: youcook2_1/youcook2_1.jsonl - config_name: youcook2_2 data_files: - split: train path: youcook2_2/youcook2_2.jsonl - config_name: breakfast_actions data_files: - split: train path: breakfast_actions/breakfast_actions.jsonl - config_name: ccpdf_multipage_1 data_files: - split: train path: ccpdf_multipage_1/ccpdf_multipage_1.jsonl - config_name: ccpdf_multipage_2 data_files: - split: train path: ccpdf_multipage_2/ccpdf_multipage_2.jsonl - config_name: perception_test_cot data_files: - split: train path: perception_test_cot/perception_test_cot.jsonl --- # Nemotron-VLM-Dataset v2 ## Versions | Date | Commit | Changes | |-------------|--------------|----------| | **2025-11-05** | [head](https://huggingface.co/datasets/nvidia/Nemotron-VLM-Dataset-v2/tree/main) | Fix `nights_cot` dataset. Fix/filter broken <think> entries. Update fintabnet instructions. Update indexes. | | **2025-10-28** | [214051e](https://huggingface.co/datasets/nvidia/Nemotron-VLM-Dataset-v2/tree/214051e30f0f5ef2d6cd7eb54027b38d229c8822) | Initial Release | ## Dataset Description Following up on Llama Nemotron VLM Dataset V1 with 3 million samples, we are releasing the Nemotron VLM Dataset V2 with almost three times as many high-quality samples. This time, our focus was on three main areas: Adding new data modalities like video, expanding our chain-of-thought reasoning data, and providing the community with a toolchain to generate OCR training data. We discovered that to enhance performance further, our models needed to learn not only the correct answer but also the reasoning process behind it. Adding more targeted chain-of-thought datasets proved to be the key to breaking the plateau for various benchmarks. With this release, we are broadening the dataset scope to allow for training more capable models. We added - New Modalities and Domains: We have added a substantial amount of new data covering UI understanding, complex charts, diagrams. For the first time, we are also including video understanding tasks. - Focus on Reasoning: We have been able to break benchmark plateaus by adding more chain-of-thought data, some of which we generated by auto labeling thinking traces for existing samples. We found that providing those traces helped especially for samples that the previous model struggled with. - Improved OCR: We further improved on the highly-competitive OCR capabilities of our first VL model by adding an even larger variety of training samples including multilingual data for six languages. Unfortunately, we cannot redistribute a large part of those samples, but we are releasing the data generation pipeline that we used, so you can generate all that OCR data with ground truth yourself! Check it out [here](https://github.com/NVIDIA-NeMo/Curator/tree/experimental/experimental/nvpdftex). In the table below, you can see all the subdatasets that we are publishing with their sizes, properties and link to a subdataset card with more details. For each subdataset we are publishing the annotations/labels which we generated by using various strategies, see "Source & Processing" column. The actual media data (images and videos) can only be redistributed for some of the datasets according to their licenses. For the remaining ones, we provide instructions on how to obtain the data in each of the subdataset cards. All of the data is prepared to be used with our multi-modal data loader [Megatron Energon](https://github.com/NVIDIA/Megatron-Energon). For more details, see [this section](#loading-the-data-with-megatron-energon) below. This dataset is ready for commercial use. --- ## Dataset Owner NVIDIA Corporation --- ## Dataset Creation Date 10/27/2025 --- ## License/Terms of Use **Governing Terms**: This collection of datasets is governed by the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/deed.en) (CC-BY-4.0), except for the following datasets, which are governed by the [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) (CC BY-SA 4.0): dewiki_v5_0828, enwiki_v5_0828, eswiki_v5_0828, frwiki_v5_0828, itwiki_v5_0828, jawiki_v5_0828, kowiki_v5_0828, nlwiki_v5_0828, ptwiki_v5_0828, and zhwiki_v5_0828. --- ## Intended Usage The Llama Nemotron VLM Dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train and evaluate. --- ## Dataset Composition | Dataset Name | Samples | Size (GB) | Data & Task Type | Source & Processing | Media incl. | Governing Terms | |------------|-----------:|-----------:|------------|------------|------------|------------| | [wiki_de](./wiki_de/README.md) | 200,000 | 37.13 | text image | public | ☑ | CC BY-SA 4.0 | | [wiki_en](./wiki_en/README.md) | 200,000 | 33.38 | text image | public | ☑ | CC BY-SA 4.0 | | [wiki_es](./wiki_es/README.md) | 200,000 | 32.85 | text image | public | ☑ | CC BY-SA 4.0 | | [wiki_fr](./wiki_fr/README.md) | 200,000 | 31.15 | text image | public | ☑ | CC BY-SA 4.0 | | [wiki_it](./wiki_it/README.md) | 200,000 | 30.30 | text image | public | ☑ | CC BY-SA 4.0 | | [wiki_ja](./wiki_ja/README.md) | 200,000 | 38.40 | text image | public | ☑ | CC BY-SA 4.0 | | [wiki_ko](./wiki_ko/README.md) | 200,000 | 27.09 | text image | public | ☑ | CC BY-SA 4.0 | | [wiki_nl](./wiki_nl/README.md) | 200,000 | 29.52 | text image | public | ☑ | CC BY-SA 4.0 | | [wiki_pt](./wiki_pt/README.md) | 200,000 | 30.49 | text image | public | ☑ | CC BY-SA 4.0 | | [wiki_zh](./wiki_zh/README.md) | 200,000 | 30.14 | text image | public | ☑ | CC BY-SA 4.0 | | [oi_bbox_1](./oi_bbox_1/README.md) | 1,664,533 | 490.37 | text image qa | public | | CC BY 4.0 | | [oi_bbox_2](./oi_bbox_2/README.md) | 1,664,533 | 488.17 | text image qa | public | | CC BY 4.0 | | [oi_bbox_3](./oi_bbox_3/README.md) | 1,128,326 | 324.46 | text image qa | public | | CC BY 4.0 | | [tabmwp_cot](./tabmwp_cot/README.md) | 20,305 | 0.28 | text image reasoning | public qwen-labels | | CC BY 4.0 | | [sparsetables](./sparsetables/README.md) | 100,000 | 14.36 | text image | synthetic | ☑ | CC BY 4.0 | | [mulberry_cot_1](./mulberry_cot_1/README.md) | 191,332 | 30.80 | text image reasoning | public glm-labels | | CC BY 4.0 | | [llava_cot_100k](./llava_cot_100k/README.md) | 63,013 | 8.18 | text image reasoning | public glm-labels | | CC BY 4.0 | | [geomverse_cot](./geomverse_cot/README.md) | 9,298 | 0.90 | text image reasoning | public glm-labels | | CC BY 4.0 | | [mapqa_cot](./mapqa_cot/README.md) | 16,832 | 1.77 | text image reasoning | public glm-labels | | CC BY 4.0 | | [plotqa_cot](./plotqa_cot/README.md) | 16,256 | 0.76 | text image reasoning | public glm-labels | ☑ | CC BY 4.0 | | [visual7w_telling_cot](./visual7w_telling_cot/README.md) | 62,592 | 3.21 | text image reasoning | public glm-labels | | CC BY 4.0 | | [visual_web_instruct_cot](./visual_web_instruct_cot/README.md) | 48,929 | 4.37 | text image reasoning | public glm-labels | | CC BY 4.0 | | [docvqa_cot](./docvqa_cot/README.md) | 36,333 | 24.32 | text image reasoning | public qwen-labels | | CC BY 4.0 | | [chartqa_cot](./chartqa_cot/README.md) | 45,710 | 2.10 | text image reasoning | public qwen-labels | | CC BY 4.0 | | [infographicsvqa_cot](./infographicsvqa_cot/README.md) | 19,548 | 6.70 | text image reasoning | public qwen-labels | | CC BY 4.0 | | [mulberry_cot_2](./mulberry_cot_2/README.md) | 103,763 | 18.45 | text image reasoning | public qwen-labels | | CC BY 4.0 | | [unigeo_cot](./unigeo_cot/README.md) | 9,728 | 0.05 | text image reasoning | public glm-labels | | CC BY 4.0 | | [nights_cot](./nights_cot/README.md) | 12,906 | 37.01 | text image reasoning | public glm-labels | ☑ | CC BY 4.0 | | [mantis_instruct_cot](./mantis_instruct_cot/README.md) | 67,723 | 13.86 | text image reasoning | public glm-labels | | CC BY 4.0 | | [fintabnet_cot](./fintabnet_cot/README.md) | 8,356 | 3.17 | text image reasoning | public qwen-labels | | CC BY 4.0 | | [hiertext](./hiertext/README.md) | 514 | 0.07 | text image reasoning | public qwen-labels | | CC BY 4.0 | | [nextqa](./nextqa/README.md) | 34,132 | 150.86 | text video qa | public rule-based | | CC BY 4.0 | | [clevrer](./clevrer/README.md) | 40,000 | 46.03 | text video qa | public rule-based | | CC BY 4.0 | | [ego_exo_learn](./ego_exo_learn/README.md) | 36,373 | 8558.27 | text video qa | public rule-based | ☑ | CC BY 4.0 | | [kinetics_k710](./kinetics_k710/README.md) | 647,883 | 890.56 | text video qa | public rule-based | | CC BY 4.0 | | [perception_test_1](./perception_test_1/README.md) | 7,392 | 94.95 | text video qa | public rule-based | ☑ | CC BY 4.0 | | [activity_net_1](./activity_net_1/README.md) | 10,021 | 191.49 | text video qa | public rule-based | | CC BY 4.0 | | [hacs](./hacs/README.md) | 31,223 | 829.25 | text video qa | public rule-based | | CC BY 4.0 | | [hirest_1](./hirest_1/README.md) | 822 | 42.50 | text video qa | public rule-based | | CC BY 4.0 | | [perception_test_2](./perception_test_2/README.md) | 2,135 | 25.98 | text video qa | public rule-based | ☑ | CC BY 4.0 | | [activity_net_2](./activity_net_2/README.md) | 9,064 | 181.24 | text video qa | public rule-based | | CC BY 4.0 | | [hirest_2](./hirest_2/README.md) | 525 | 27.54 | text video qa | public rule-based | | CC BY 4.0 | | [youcook2_1](./youcook2_1/README.md) | 1,180 | 77.65 | text video qa | public rule-based | | CC BY 4.0 | | [youcook2_2](./youcook2_2/README.md) | 2,270 | 158.77 | text video qa | public rule-based | | CC BY 4.0 | | [breakfast_actions](./breakfast_actions/README.md) | 1,204 | 3.45 | text video qa | public rule-based | ☑ | CC BY 4.0 | | [ccpdf_multipage_1](./ccpdf_multipage_1/README.md) | 7,262 | 48.19 | text image qa | public qwen-labels | | CC BY 4.0 | | [ccpdf_multipage_2](./ccpdf_multipage_2/README.md) | 455 | 31.88 | text image qa | public qwen-labels | | CC BY 4.0 | | [perception_test_cot](./perception_test_cot/README.md) | 4,977 | 64.55 | text video qa | public glm-labels | ☑ | CC BY 4.0 | | [ccpdf_nv_notables](./ccpdf_nv_notables/README.md) | 14,234 | 8.48 | text image | public human-labels | | CC BY 4.0 | | [ccpdf_nv_qa](./ccpdf_nv_qa/README.md) | 1,668 | 0.55 | text image qa | public qwen-labels | | CC BY 4.0 | | [ccpdf_nv_tables](./ccpdf_nv_tables/README.md) | 4,249 | 1.83 | text image | public human-labels | | CC BY 4.0 | | **Total** (51) | 8,147,599 | 13,227.83 | | | | | ## Tag Legend * text: Contains text data * image: Contains image data * video: Contains video data * qa: Contains question answering data * reasoning: Contains chain of thought reasoning data * public: Origin of the data is another public dataset * synthetic: The data was synthetically generated * qwen-labels: Labels generated by Qwen * glm-labels: Labels generated by GLM * human-labels: Labels generated by human * rule-based: Labels generated/transformed by simple rules --- ## Dataset Quantification - **Total Number of Datasets**: 51 - **Total Number of Samples**: 8,147,599 - **Total Size**: 13,227.83 GB --- ## Dataset Characterization ### **Data Collection Method** Hybrid: Synthetic, Automated, Human ### **Labeling Method** Hybrid: Synthetic, Automated, Human --- ## Dataset Format Each given dataset includes either: - Text annotations (.jsonl format), referencing images or videos from source datasets, or - Text annotations (.jsonl format) together with images or videos (in tar'ed shards). For details on the format, check [Data Format](data_format.md). --- ## Loading the Data with Megatron Energon This data has been prepared to be used with [Megatron Energon](https://github.com/NVIDIA/Megatron-Energon). You can just go ahead and try it out like this: ```sh # Install energon if you haven't already pip install megatron-energon[av_decode] dacite # Download this dataset (OPTION 1, slower) git lfs install git clone git@hf.co:datasets/nvidia/Nemotron-VLM-Dataset-v2 Nemotron-VLM-Dataset-v2 # Download this dataset (OPTION 2, modern faster way) pip install --upgrade huggingface_hub hf download nvidia/Nemotron-VLM-Dataset-v2 --repo-type dataset --local-dir Nemotron-VLM-Dataset-v2 # Try out the example to print a few dataset samples cd Nemotron-VLM-Dataset-v2 python example_loader.py ``` --- ## Ethical Considerations: NVIDIA believes **Trustworthy AI** is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report model quality, risk, security vulnerabilities or **NVIDIA AI Concerns** [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).