# clip-vit-large-patch14-336

**Repository Path**: hf-models/clip-vit-large-patch14-336

## Basic Information

- **Project Name**: clip-vit-large-patch14-336
- **Description**: 基于 CLIP（Contrastive Language-Image Pre-training）模型的大型版本，使用了 Vision Transformer（ViT）作为其图像编码器。这个模型是为了处理图像和文本之间的关联任务而预训练的，能够将图像内容与描述图像的文本标签紧密联系起来。
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 3
- **Forks**: 1
- **Created**: 2023-10-23
- **Last Updated**: 2025-11-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

---
tags:
- generated_from_keras_callback
widget:
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
  candidate_labels: playing music, playing sports
  example_title: Cat & Dog
model-index:
- name: clip-vit-large-patch14-336
  results: []
---

<!-- This model card has been generated automatically according to the information Keras had access to. You should
probably proofread and complete it, then remove this comment. -->

# clip-vit-large-patch14-336

This model was trained from scratch on an unknown dataset.
It achieves the following results on the evaluation set:


## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- optimizer: None
- training_precision: float32

### Training results


### Framework versions

- Transformers 4.21.3
- TensorFlow 2.8.2
- Tokenizers 0.12.1