license | license_name | license_link | task_categories | language | |||
---|---|---|---|---|---|---|---|
other | apple-ascl | https://github.com/apple/ml-mobileclip/blob/main/LICENSE_weights_data |
|
|
This dataset contains UIDs of DataComp-12M that is a 12M subset of DataComp-1B-BestPool. Image-text models trained on DataComp-12M are significantly better than on CC-12M/YFCC-15M as well as DataComp-Small/Medium. For details on this dataset and the improved DataCompDR-12M, please visit our MobileCLIP paper. The dataset with the original captions is now available at mlfoundations/DataComp-12M. The UIDs per shards match between mlfoundations/DataComp-12M and apple/DataCompDR-12M.
DataCompDR is an image-text dataset and an enhancement to the DataComp dataset.
We reinforce the DataComp dataset using our multi-modal dataset reinforcement strategy.
In particular, we create DataCompDR-1B and DataCompDR-12M by reinforcing the DataComp-1B (BestPool filtering) and a uniform subset of 12.8M samples, DataCompDR-12M.
We have a one-time generation process, the cost of which is amortized over multiple architectures and extensive ablations.
We generate 5 synthetic captions per image using the coca_ViT-L-14
model in OpenCLIP, and strong random image augmentations (10 for DataCompDR-1B and 30 for DataCompDR-12M).
We compute embeddings of an ensemble of two strong teachers (ViT-L-14
with pretrained weights datacomp_xl_s13b_b90k
and openai in OpenCLIP) on augmented images as well as real and synthetic captions.
Embeddings are 1536-D concatenations of 2x768-D vectors.
One seen sample for DataCompDR is a triplet of one randomly augmented image, one ground-truth caption, and one randomly picked synthetic caption.
Training with DataCompDR shows significant learning efficiency improvement compared to the standard CLIP training. For example, with a single node of 8×A100 GPUs, we achieve 61.7% zero-shot classification on ImageNet-val in approximately one day when training a ViT-B/16 based CLIP from scratch on DataCompDR-12M. Training with DataCompDR-1B sets new state-of-the-art performance on several metrics (Fig. 2) while still using a fraction of the training compute budget compared to previous works. Using DataCompDR, we demonstrate 10x-1000x learning efficiency in comparison to DataComp.
- uids.txt: List of 12779520 (65536*195) UIDs, one UID per line.
- uids.npy: List of 12779520 (65536*195) UIDs as a NumPy array of type `numpy.dtype("u8,u8")`.
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training. (CVPR 2024) Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.
@InProceedings{mobileclip2024,
author = {Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel},
title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
}
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。