As the native implementation exists, and has support for most features by now, I will mostly stop working on these nodes for anything but it's main purpose: early access and testing of potential new features that are difficult (at least for me) to implement natively.
Flowedit and enhance-a-video can be found from these nodes: https://github.com/logtd/ComfyUI-HunyuanLoom
TeaCache equilevant FirstBlockCache, as well as torch.compile with LoRA support: https://github.com/chengzeyi/Comfy-WaveSpeed
Sageattention can be enabled by --use-sage-attention
startup argument for ComfyUI, or with a patcher node found in KJNodes as well as some other node packs.
Leapfusion I2V can be used with my patcher node found in the KJNodes as well, example workflow: https://github.com/kijai/ComfyUI-KJNodes/blob/main/example_workflows/leapfusion_hunyuuanvideo_i2v_native_testing.json
What remains missing from native implementation currently:
It's been hectic couple of weeks with this model, I've lost track of what has happened since the start, but I'll try to present some of the more important updates:
Even if this file is .pt it's completely safe and it is loaded with weights_only, the scale map is included with the nodes. To use this model you have to use the fp8_scaled
-quantization option in the model loader.
The quality of these weights is much closer to the original bf16, downside is that they do not currently support fp8 fast mode, or LoRAs.
This has a very slight hit on inference speed and zero hit on memory use, initial tests indicate it's absolutely worth using.
https://github.com/user-attachments/assets/e19b30e1-5f67-4e75-9c73-716d4569c319
https://github.com/user-attachments/assets/083353a2-e9aa-43e9-a916-ff3af1d581c1
Now you can feed image to the VLM as condition of generations! This is different from image2video where the image become the first frame of the video. IP2V uses image as a part of the prompt, to extract the concept and style of the image. So - very much like IPAdapter - but VLM will do the heavy lifting for you!
Now this is a tuning free approach but with further task specific tuning we can expand the use scenarios.
xtuner/llava-llama-3-8b-v1_1-transformers
for Image-Text TasksUse the original xtuner/llava-llama-3-8b-v1_1-transformers
model which includes the vision tower. You have two options:
models/LLM
folder.Note: It's recommended to offload the text encoder since the vision tower requires additional VRAM.
Hunyuan TextImageEncode
node.
<image>
.<image>
tags should match the number of images provided to the sampler.
Describe this <image> in great detail.
You can also choose to give CLIP a prompt that does not reference the image separately.
image_token_selection_expression
This expression is for advanced users and serves as a boolean mask to select which part of the image hidden state will be used for conditioning. Here are some details and recommendations:
::4
, meaning every four tokens, one token goes into conditioning, interleaved, resulting in 144 tokens per image.::2
).:128
- First 128 tokens.-128:
- Last 128 tokens.:128, -128:
- First 128 tokens and last 128 tokens.Scaled dot product attention (sdpa) should now be working (only tested on Windows, torch 2.5.1+cu124 on 4090), sageattention is still recommended for speed, but should not be necessary anymore making installation much easier.
Vid2vid test: source video
https://github.com/user-attachments/assets/12940721-4168-4e2b-8a71-31b4b0432314
text2vid (old test):
https://github.com/user-attachments/assets/3750da65-9753-4bd2-aae2-a688d2b86115
Transformer and VAE (single files, no autodownload):
https://huggingface.co/Kijai/HunyuanVideo_comfy/tree/main
Go to the usual ComfyUI folders (diffusion_models and vae)
LLM text encoder (has autodownload):
https://huggingface.co/Kijai/llava-llama-3-8b-text-encoder-tokenizer
Files go to ComfyUI/models/LLM/llava-llama-3-8b-text-encoder-tokenizer
Clip text encoder (has autodownload)
Either use any Clip_L model supported by ComfyUI by disabling the clip_model in the text encoder loader and plugging in ClipLoader to the text encoder node, or allow the autodownloader to fetch the original clip model from:
https://huggingface.co/openai/clip-vit-large-patch14, (only need the .safetensor from the weights, and all the config files) to:
ComfyUI/models/clip/clip-vit-large-patch14
Memory use is entirely dependant on resolution and frame count, don't expect to be able to go very high even on 24GB.
Good news is that the model can do functional videos even at really low resolutions.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。