ComfyUI IPAdapter plus

ComfyUI reference implementation for IPAdapter models.

IPAdapter implementation that follows the ComfyUI way of doing things. The code is memory efficient, fast, and shouldn't break with Comfy updates.

Important updates

2024/02/02: Added experimental tiled IPAdapter. It lets you easily handle reference images that are not square. Can be useful for upscaling.

2024/01/19: Support for FaceID Portrait models.

2024/01/16: Notably increased quality of FaceID Plus/v2 models. Check the comparison of all face models.

2023/12/30: Added support for FaceID Plus v2 models. Important: this update again breaks the previous implementation. This time I had to make a new node just for FaceID. The base IPAdapter Apply node will work with all previous models; for all FaceID models you'll find an IPAdapter Apply FaceID node. When using v2 remember to check the v2 options otherwise it won't work as expected! As always the examples directory is full of workflows for you to play with.

2023/12/28: Added support for FaceID Plus models. Important: this update breaks the previous implementation of FaceID. Check the updated workflows in the example directory! Remember to refresh the browser ComfyUI page to clear up the local cache.

2023/12/22: Added support for FaceID models. Read the documentation for details.

2023/12/05: Added batch embeds node. This lets you encode images in batches and merge them together into an IPAdapter Apply Encoded node. Useful mostly for animations because the clip vision encoder takes a lot of VRAM. My suggestion is to split the animation in batches of about 120 frames.

2023/11/29: Added unfold_batch option to send the reference images sequentially to a latent batch. Useful for animations.

(previous updates removed for better readability)

What is it?

The IPAdapter are very powerful models for image-to-image conditioning. Given a reference image you can do variations augmented by text prompt, controlnets and masks. Think of it as a 1-image lora.

Example workflow

IPAdapter Example workflow

The example directory has many workflows that cover all IPAdapter functionalities.

Video Tutorials

:nerd_face: Basic usage video

Advanced features video

Attention Masking video

Animation Features video

Installation

Download or git clone this repository inside ComfyUI/custom_nodes/ directory or use the Manager. Beware that the automatic update of the manager sometimes doesn't work and you may need to upgrade manually.

The pre-trained models are available on huggingface, download and place them in the ComfyUI/models/ipadapter directory (create it if not present). You can also use any custom location setting an ipadapter entry in the extra_model_paths.yaml file.

IPAdapter also needs the image encoders. You need the CLIP-ViT-H-14-laion2B-s32B-b79K and CLIP-ViT-bigG-14-laion2B-39B-b160k image encoders, you may already have them. If you don't, download them but be careful because the file name is the same! Rename them to something easy to remember and place them in the ComfyUI/models/clip_vision/ directory.

The following table shows the combination of Checkpoint and Image encoder to use for each IPAdapter Model. Any Tensor size mismatch you may get it is likely caused by a wrong combination.

SD v.	IPadapter	Img encoder	Notes
v1.5	ip-adapter_sd15	ViT-H	Basic model, average strength
v1.5	ip-adapter_sd15_light	ViT-H	Light model, very light impact
v1.5	ip-adapter-plus_sd15	ViT-H	Plus model, very strong
v1.5	ip-adapter-plus-face_sd15	ViT-H	Face model, use only for faces
v1.5	ip-adapter-full-face_sd15	ViT-H	Strongher face model, not necessarily better
v1.5	ip-adapter_sd15_vit-G	ViT-bigG	Base model trained with a bigG encoder
SDXL	ip-adapter_sdxl	ViT-bigG	Base SDXL model, mostly deprecated
SDXL	ip-adapter_sdxl_vit-h	ViT-H	New base SDXL model
SDXL	ip-adapter-plus_sdxl_vit-h	ViT-H	SDXL plus model, stronger
SDXL	ip-adapter-plus-face_sdxl_vit-h	ViT-H	SDXL face model

FaceID requires insightface, you need to install them in your ComfyUI environment. Check this issue for help.

When the dependencies are satisfied you need:

SD v.	IPadapter	Img encoder	Lora
v1.5	FaceID	(not used¹)	FaceID Lora
v1.5	FaceID Plus	ViT-H	FaceID Plus Lora
v1.5	FaceID Plus v2	ViT-H	FaceID Plus v2 Lora
v1.5	FaceID Portrait	(not used¹)	not needed
SDXL	FaceID	(not used¹)	FaceID SDXL Lora
SDXL	FaceID Plus v2	ViT-H	FaceID SDXL Lora

¹ The base FaceID model doesn't make use of a CLIP vision encoder. Remember to pair any FaceID model together with any other Face model to make it more effective.

The loras need to be placed into ComfyUI/models/loras/ directory.

How to

There's a basic workflow included in this repo and a few examples in the examples directory. Usually it's a good idea to lower the weight to at least 0.8.

The noise parameter is an experimental exploitation of the IPAdapter models. You can set it as low as 0.01 for an arguably better result.

More info about the noise option

Basically the IPAdapter sends two pictures for the conditioning, one is the reference the other --that you don't see-- is an empty image that could be considered like a negative conditioning.

What I'm doing is to send a very noisy image instead of an empty one. The noise parameter determines the amount of noise that is added. A value of 0.01 adds a lot of noise (more noise == less impact becaue the model doesn't get it); a value of 1.0 removes most of noise so the generated image gets conditioned more.

Preparing the reference image

The reference image needs to be encoded by the CLIP vision model. The encoder resizes the image to 224×224 and crops it to the center!. It's not an IPAdapter thing, it's how the clip vision works. This means that if you use a portrait or landscape image and the main attention (eg: the face of a character) is not in the middle you'll likely get undesired results. Use square pictures as reference for more predictable results.

I've added a PrepImageForClipVision node that does all the required operations for you. You just have to select the crop position (top/left/center/etc...) and a sharpening amount if you want.

In the image below you can see the difference between prepped and not prepped images.

KSampler configuration suggestions

The IPAdapter generally requires a few more steps than usual, if the result is underwhelming try to add 10+ steps. The model tends to burn the images a little. If needed lower the CFG scale.

The noise option generally grants better results, experiment with it.

IPAdapter + ControlNet

The model is very effective when paired with a ControlNet. In the example below I experimented with Canny. The workflow is in the examples directory.

IPAdapter Face

IPAdapter offers an interesting model for a kind of "face swap" effect. The workflow is provided. Set a close up face as reference image and then input your text prompt as always. The generated character should have the face of the reference. It also works with img2img given a high denoise.

Note: there's a new full-face model available that's arguably better.

Masking (Inpainting)

The most effective way to apply the IPAdapter to a region is by an inpainting workflow. Remeber to use a specific checkpoint for inpainting otherwise it won't work. Even if you are inpainting a face I find that the IPAdapter-Plus (not the face one), works best.

Image Batches

It is possible to pass multiple images for the conditioning with the Batch Images node. An example workflow is provided; in the picture below you can see the result of one and two images conditioning.

It seems to be effective with 2-3 images, beyond that it tends to blur the information too much.

Image Weighting

When sending multiple images you can increase/decrease the weight of each image by using the IPAdapterEncoder node. The workflow (included in the examples) looks like this:

The node accepts 4 images, but remember that you can send batches of images to each slot.

Weight types

You can choose how the IPAdapter weight is applied to the image embeds. Options are:

original: The weight is applied to the aggregated tensors. The weight works predictably for values greater and lower than 1.
linear: The weight is applied to the individual tensors before aggretating them. Compared to original the influence is weaker when weight is <1 and stronger when >1. Note: at weight 1 the two methods are equivalent.
channel penalty: This method is a modified version of Lvmin Zhang's (Fooocus). Results are sometimes sharper. It works very well also when weight is >1. Still experimental, may change in the future.

The image below shows the difference (zoom in).

In the examples directory you can find a workflow that lets you easily compare the three methods.

Attention masking

It's possible to add a mask to define the area where the IPAdapter will be applied to. Everything outside the mask will ignore the reference images and will only listen to the text prompt.

It is suggested to use a mask of the same size of the final generated image.

In the picture below I use two reference images masked one on the left and the other on the right. The image is generated only with IPAdapter and one ksampler (without in/outpainting or area conditioning).

It is also possible to send a batch of masks that will be applied to a batch of latents, one per frame. The size should be the same but if needed some normalization will be performed to avoid errors. This feature also supports (experimentally) AnimateDiff including context sliding.

In the examples directory you'll find a couple of masking workflows: simple and two masks.

Timestepping

In the Apply IPAdapter node you can set a start and an end point. The IPAdapter will be applied exclusively in that timeframe of the generation. This is a very powerful tool to modulate the intesity of IPAdapter models.

Tiled IPAdapter

This is an experimental node that automatically splits a reference image in quadrants. It can be especially useful when the reference image is not in 1:1 ratio as the Clip Vision encoder only works with 224x224 square images.

The short_side_tiles parameter defines the number of tiles to use for ther shorter side of the reference image; the number of tiles for the other side are calculated automatically. If the image is in landscape or portrait mode that generally means that only 2 tiles are created. If the image is a square, the value must be at least 2 for the node to have any meaninful effect.

If the aspect ratio of the refence image and the latent are very diffirent the image will almost certainly be stretched or squished.

tile_weight is very important any time the short_side_tiles value is greater than 1. The default value of 0.6 should be good for most scenarios but you may need to lower it a little (0.5) for SDXL. This parameter has no effect if the short_side_tiles is = 1.

The main IPAdapter weight should also be lowered, 0.7 is a good starting point.

Important: With a high number of tiles we are going to reiterate on the same concept multiple times so I suggest to lower the CFG or better use the CFG Rescale node. Please check the workflow in the examples directory for reference.

The node is experimental and will likely change in the future.

FaceID

FaceID is a new IPAdapter model that takes the embeddings from InsightFace. As such you need to install insightface in your ComfyUI python environment. You may also need onnxruntime and onnxruntime-gpu. Note that your CUDA version might not be compatible with onnxruntime, in that case you can select the "CPU" provider from the Load InsightFace model node.

The first time you use InsightFace the model will be downloaded automatically, check the console to see the progress. If you get an error you need to donwload the buffalo_l model manually inside the ComfyUI/models/insightface/models directory. Also every time you run the workflow for the first time InsightFace will take quite a few seconds to load.

The FaceID model is used in conjuction with its Lora! Check the installation instructions for the links to all models.

The reference image needs to be prepared differently compared to the other IPAdapter face models. While standard face models expect the face to take basically the whole frame, FaceID prefers the subject to be a little further away. Don't cut the face too close and leave hair, beard, ears, neck in the picture.

InsightFace will often fail to detect the face and it will throw an error. Try with a different picture possibly cut to half-bust. FaceID generally works with drawings/illustrations too and the result is often very nice.

I just implemented the FaceID code so I don't have best practices yet and more testing is needed. It's important to understand that FaceID can (and should) be used as a first pass for an additional IPAdapter Face model.

In the examples directory you'll find a few workflows to get you started with FaceID.

The following would be a basic workflow that includes FaceID enhanced by a Plus Face model.

Troubleshooting

Please check the troubleshooting before posting a new issue.

Diffusers version