Dataset Card for DataComp-12M

license

license_name

license_link

task_categories

language

other

apple-ascl

https://github.com/apple/ml-mobileclip/blob/main/LICENSE_weights_data

text-to-image

image-to-text

en

Dataset Card for DataComp-12M

This dataset contains UIDs of DataComp-12M that is a 12M subset of DataComp-1B-BestPool. Image-text models trained on DataComp-12M are significantly better than on CC-12M/YFCC-15M as well as DataComp-Small/Medium. For details on this dataset and the improved DataCompDR-12M, please visit our MobileCLIP paper. The dataset with the original captions is now available at mlfoundations/DataComp-12M. The UIDs per shards match between mlfoundations/DataComp-12M and apple/DataCompDR-12M.

Dataset Details

Dataset Description

DataCompDR is an image-text dataset and an enhancement to the DataComp dataset. We reinforce the DataComp dataset using our multi-modal dataset reinforcement strategy. In particular, we create DataCompDR-1B and DataCompDR-12M by reinforcing the DataComp-1B (BestPool filtering) and a uniform subset of 12.8M samples, DataCompDR-12M. We have a one-time generation process, the cost of which is amortized over multiple architectures and extensive ablations. We generate 5 synthetic captions per image using the coca_ViT-L-14 model in OpenCLIP, and strong random image augmentations (10 for DataCompDR-1B and 30 for DataCompDR-12M). We compute embeddings of an ensemble of two strong teachers (ViT-L-14 with pretrained weights datacomp_xl_s13b_b90k and openai in OpenCLIP) on augmented images as well as real and synthetic captions. Embeddings are 1536-D concatenations of 2x768-D vectors. One seen sample for DataCompDR is a triplet of one randomly augmented image, one ground-truth caption, and one randomly picked synthetic caption.

Curated by: Original data by DataComp and metadata by Apple.
License: We distribute our metadata under our license. The original image url-text samples and metadata were released by DataComp under Creative Common CC-BY-4.0 license. The individual images are under their own copyrights.
Repository: ml-mobileclip GitHub
Paper: MobileCLIP paper
Demo: Coming Soon

Uses

Training with DataCompDR shows significant learning efficiency improvement compared to the standard CLIP training. For example, with a single node of 8×A100 GPUs, we achieve 61.7% zero-shot classification on ImageNet-val in approximately one day when training a ViT-B/16 based CLIP from scratch on DataCompDR-12M. Training with DataCompDR-1B sets new state-of-the-art performance on several metrics (Fig. 2) while still using a fraction of the training compute budget compared to previous works. Using DataCompDR, we demonstrate 10x-1000x learning efficiency in comparison to DataComp.

Dataset Structure

- uids.txt: List of 12779520 (65536*195) UIDs, one UID per line.
- uids.npy: List of 12779520 (65536*195) UIDs as a NumPy array of type `numpy.dtype("u8,u8")`.

Citation

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training. (CVPR 2024) Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.

@InProceedings{mobileclip2024,
  author = {Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel},
  title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2024},
}

Hugging Face 数据集镜像/DataComp-12M

Dataset Card for DataComp-12M

Dataset Details

Dataset Description

Uses

Dataset Structure

Citation

简介

发行版

贡献者

近期动态

Hugging Face 数据集镜像/DataComp-12M .gitee-modal { width: 500px !important; }

Dataset Card for DataComp-12M

Dataset Details

Dataset Description

Uses

Dataset Structure

Citation

简介

发行版

贡献者

近期动态

搜索帮助

Hugging Face 数据集镜像/DataComp-12M