# ijepa **Repository Path**: biboqingfengxia/ijepa ## Basic Information - **Project Name**: ijepa - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-05-17 - **Last Updated**: 2025-05-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # I-JEPA Official PyTorch codebase for I-JEPA (the **Image-based Joint-Embedding Predictive Architecture**) published @ CVPR-23. [\[arXiv\]](https://arxiv.org/pdf/2301.08243.pdf) [\[JEPAs\]](https://ai.facebook.com/blog/yann-lecun-advances-in-ai-research/) [\[blogpost\]](https://ai.facebook.com/blog/yann-lecun-ai-model-i-jepa/) ## Method I-JEPA is a method for self-supervised learning. At a high level, I-JEPA predicts the representations of part of an image from the representations of other parts of the same image. Notably, this approach learns semantic image features: 1. without relying on pre-specified invariances to hand-crafted data transformations, which tend to be biased for particular downstream tasks, 2. and without having the model fill in pixel-level details, which tend to result in learning less semantically meaningful representations.  ## Visualizations As opposed to generative methods that have a pixel decoder, I-JEPA has a predictor that makes predictions in latent space. The predictor in I-JEPA can be seen as a primitive (and restricted) world-model that is able to model spatial uncertainty in a static image from a partially observable context. This world model is semantic in the sense that it predicts high level information about unseen regions in the image, rather than pixel-level details. We trained a stochastic decoder that maps the I-JEPA predicted representations back in pixel space as sketches. The model correctly captures positional uncertainty and produces high-level object parts with the correct pose (e.g., dog’s head, wolf’s front legs).  Caption: Illustrating how the predictor learns to model the semantics of the world. For each image, the portion outside of the blue box is encoded and given to the predictor as context. The predictor outputs a representation for what it expects to be in the region within the blue box. To visualize the prediction, we train a generative model that produces a sketch of the contents represented by the predictor output, and we show a sample output within the blue box. The predictor recognizes the semantics of what parts should be filled in (the top of the dog’s head, the bird’s leg, the wolf’s legs, the other side of the building). ## Evaluations I-JEPA pretraining is also computationally efficient. It does not involve any overhead associated with applying more computationally intensive data augmentations to produce multiple views. Only one view of the image needs to be processed by the target encoder, and only the context blocks need to be processed by the context encoder. Empirically, I-JEPA learns strong off-the-shelf semantic representations without the use of hand-crafted view augmentations.   ## Pretrained models
| arch. | patch size | resolution | epochs | data | download | ||
|---|---|---|---|---|---|---|---|
| ViT-H | 14x14 | 224x224 | 300 | ImageNet-1K | full checkpoint | logs | configs |
| ViT-H | 16x16 | 448x448 | 300 | ImageNet-1K | full checkpoint | logs | configs |
| ViT-H | 14x14 | 224x224 | 66 | ImageNet-22K | full checkpoint | logs | configs |
| ViT-g | 16x16 | 224x224 | 44 | ImageNet-22K | full checkpoint | logs | configs |