# t2i-MPS
**Repository Path**: py-service/t2i-mps
## Basic Information
- **Project Name**: t2i-MPS
- **Description**: https://github.com/Kwai-Kolors/MPS.git
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-04
- **Last Updated**: 2026-03-04
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Learning Multi-dimensional Human Preference for Text-to-Image Generation (CVPR 2024)
This repository contains the code and model for the paper [Learning Multi-dimensional Human Preference for Text-to-Image Generation](https://openaccess.thecvf.com/content/CVPR2024/papers/Zhang_Learning_Multi-Dimensional_Human_Preference_for_Text-to-Image_Generation_CVPR_2024_paper.pdf).
## Installation
Create a virual env and download torch:
```bash
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
```
Install the requirements:
```bash
pip install -r requirements.txt
pip install -e .
```
## Inference with MPS
We display here an example for running inference with MPS:
```python
# import
from transformers import AutoProcessor, AutoModel
from PIL import Image
import torch
# load model
device = "cuda"
processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
image_processor = CLIPImageProcessor.from_pretrained(processor_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(processor_name_or_path, trust_remote_code=True)
model_ckpt_path = "outputs/MPS_overall_checkpoint.pth"
model = torch.load(model_ckpt_path)
model.eval().to(device)
def infer_example(images, prompt, condition, clip_model, clip_processor, tokenizer, device):
def _process_image(image):
if isinstance(image, dict):
image = image["bytes"]
if isinstance(image, bytes):
image = Image.open(BytesIO(image))
if isinstance(image, str):
image = Image.open( image )
image = image.convert("RGB")
pixel_values = clip_processor(image, return_tensors="pt")["pixel_values"]
return pixel_values
def _tokenize(caption):
input_ids = tokenizer(
caption,
max_length=tokenizer.model_max_length,
padding="max_length",
truncation=True,
return_tensors="pt"
).input_ids
return input_ids
image_inputs = torch.concatenate([_process_image(images[0]).to(device), _process_image(images[1]).to(device)])
text_inputs = _tokenize(prompt).to(device)
condition_inputs = _tokenize(condition).to(device)
with torch.no_grad():
text_features, image_0_features, image_1_features = clip_model(text_inputs, image_inputs, condition_inputs)
image_0_features = image_0_features / image_0_features.norm(dim=-1, keepdim=True)
image_1_features = image_1_features / image_1_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
image_0_scores = clip_model.logit_scale.exp() * torch.diag(torch.einsum('bd,cd->bc', text_features, image_0_features))
image_1_scores = clip_model.logit_scale.exp() * torch.diag(torch.einsum('bd,cd->bc', text_features, image_1_features))
scores = torch.stack([image_0_scores, image_1_scores], dim=-1)
probs = torch.softmax(scores, dim=-1)[0]
return probs.cpu().tolist()
img_0, img_1 = "image1.jpg", "image2.jpg"
# infer the best image for the caption
prompt = "the caption of image"
# condition for overall
condition = "light, color, clarity, tone, style, ambiance, artistry, shape, face, hair, hands, limbs, structure, instance, texture, quantity, attributes, position, number, location, word, things."
print(infer_example([img_0, img_1], prompt, condition, model, image_processor, tokenizer, device))
```
## Download the MPS checkpoint
| ID | Training Data | MPS Model | |||
|---|---|---|---|---|---|
| Overall | Aesthetics | Alignment | Detail | ||
| 1 | ✓ | - | - | - | Model Link |
| 2 | ✓ | ✓ | ✓ | ✓ | - |