Running Flux on a budget GPU is now possible thanks to GGUF quantization—a compression technique that shrinks model files without sacrificing visual quality. Instead of needing 24GB of VRAM for the full Flux Dev model, a properly quantized GGUF version fits comfortably in 7–10GB. This guide explains how GGUF works, how to install it in ComfyUI, and which quantization level suits your hardware.
The real advantage here: GGUF ComfyUI workflows preserve nearly all visual fidelity while freeing up memory for larger images and faster iteration. Whether you’re working with an 8GB GPU or pushing a 16GB card, knowing the difference between GGUF and safetensors will help you pick the right format for your needs.
At a Glance: GGUF vs. Safetensors
| Aspect | Safetensors | GGUF |
|---|---|---|
| File Size | 24GB (Flux Dev) | 6–13GB (depends on quantization) |
| VRAM Usage | ~24GB | ~7–13GB |
| Image Quality | Maximum (uncompressed) | Very good to excellent (imperceptible loss at Q4+) |
| GPU Compatibility | Yes | Yes |
| CPU Compatibility | Limited | Full support |
| Installation | Native to ComfyUI | Requires ComfyUI-GGUF custom node |
| Speed (GPU) | Baseline | Comparable (occasionally slightly slower) |
| Speed (CPU) | Impractical | Viable but slow (2–3× longer) |
What Is GGUF and Where Did It Come From?
GGUF started as a file format in the llama.cpp ecosystem, built to store quantized large language models efficiently on consumer hardware. The format has since evolved to support diffusion models—the kind that generate images—and it’s become a core part of ComfyUI workflows.
At its core, GGUF is a container for quantized model weights. Quantization is the process of reducing the numerical precision of a neural network’s parameters. Think of it like converting a RAW photograph (32-bit precision, massive file size) to a high-quality JPEG (looks nearly identical, way smaller). Instead of storing weights as 32-bit or 16-bit floating-point numbers, GGUF compresses them to 8, 6, 5, 4, 3, or even 2 bits per weight.
The practical result: a Flux Dev model that normally weighs 24GB in safetensors format compresses down to 7GB in GGUF Q4_K_M format. That’s a 70% reduction in file size and memory footprint—the difference between “impossible to run” and “runs smoothly on a mid-range GPU.”
💡 Why this matters: GGUF is a quantization format that compresses model weights to 50–75% of their original size while maintaining near-identical visual quality, making large models like Flux accessible on 8GB GPUs.
GGUF vs. Safetensors: Understanding the Trade-offs
Both GGUF and safetensors are model formats you’ll encounter in ComfyUI. Understanding their differences helps you choose the right one for your setup.
Safetensors is ComfyUI’s standard format. It stores weights uncompressed or minimally compressed, preserving maximum quality but consuming significant VRAM. You get the highest fidelity, but a Flux Dev model requires the full 24GB.
GGUF trades a small amount of quality for dramatic memory savings. At quantization levels like Q4_K_M and Q5_K_M, the quality loss is imperceptible to the human eye. For most users running on 8GB VRAM, this trade is absolutely worth it.
Choosing between GGUF ComfyUI and safetensors comes down to your hardware constraints:
- Limited VRAM (6–12GB): Use GGUF. You won’t run Flux otherwise.
- Ample VRAM (16GB+): Safetensors if quality is paramount; GGUF if you want faster loading and disk space savings.
💡 Rule of thumb: GGUF saves 50–70% of VRAM with minimal quality loss; safetensors offers maximum quality but requires significantly more memory.
Quantization Levels Explained: Choosing the Right Compression
GGUF supports multiple quantization levels, each representing a different balance between file size and quality. Here’s the hierarchy, from most to least aggressive compression:
Q2_K (~25% of original size)
Extreme compression for severe VRAM constraints. Quality degrades noticeably. Only recommended if you have less than 2GB VRAM available—which is rare in modern workflows.
Q3_K_S / Q3_K_M / Q3_K_L (~35–40% of original size)
Strong compression with visible artifacts in fine details. Useful for 4–6GB GPUs but not ideal for production work.
Q4_K_S / Q4_K_M (~50% of original size)
The sweet spot for most users. Minimal quality degradation. Q4_K_M is slightly better than Q4_K_S. This is the most popular choice for 6–8GB GPUs running Flux.
Q5_K_S / Q5_K_M (~60–65% of original size)
Barely perceptible difference from the original. Excellent if you have the VRAM to spare. Recommended for 10–12GB GPUs.
Q6_K (~75% of original size)
Light compression. Quality very close to safetensors. Good for 16GB+ GPUs when you want compression without sacrificing fidelity.
Q8_0 (~50% of float16 size)
Minimal compression. Closest to original quality with some VRAM savings. Rarely used because Q5_K_M offers better compression at similar quality.
F16 (100% of original, no quantization)
Not quantization at all—just the model in float16 format. Equivalent to safetensors in quality and VRAM usage. Useful as a reference point.
Quantization Recommendations by GPU
| GPU VRAM | Recommended Quantization | File Size | Use Case |
|---|---|---|---|
| 6–8GB | Q4_K_M | ~7GB | Budget GPUs; Flux on 8GB VRAM |
| 10–12GB | Q5_K_M | ~10GB | Mid-range GPUs; nearly imperceptible quality loss |
| 16GB+ | Q5_K_M or Q6_K | ~10–12GB | High-end GPUs; maximum quality with compression |
These figures are approximate. Actual VRAM usage depends on image resolution, sampler choice, and whether other nodes run simultaneously. A 512×512 generation uses less memory than a 1024×1024 one.
💡 What I’d pick: Q4_K_M is the best choice for 8GB VRAM; Q5_K_M is nearly indistinguishable from the original and works well on 10–12GB GPUs.
ComfyUI GGUF Install: Step-by-Step Guide
Getting GGUF models running in ComfyUI requires a custom node. The process is straightforward and takes about 5 minutes.
Step 1: Install the ComfyUI-GGUF Custom Node
- Open ComfyUI Manager (the button in the top menu bar, next to “Queue Prompt”).
- Search for “GGUF” in the search bar.
- Find ComfyUI-GGUF by city96 (the most reliable and widely used version).
- Click “Install”.
- Restart ComfyUI completely.
The custom node adds new nodes to your workflow: UnetLoaderGGUF, GGUFModelLoader, and related utilities. You’ll see these appear in the node menu after restart.
Step 2: Download a GGUF Model
GGUF models are primarily hosted on HuggingFace. The most reliable repositories are:
- city96 (GitHub/HuggingFace): Maintains Flux GGUF versions and other optimized models. Highly trusted in the ComfyUI community.
- Comfy-Org: Official quantized model repositories with verified quality.
Example Flux Dev GGUF file sizes:
flux1-dev-Q8_0.gguf(~13GB)flux1-dev-Q5_K_M.gguf(~10GB)flux1-dev-Q4_K_M.gguf(~7GB)flux1-dev-Q3_K_M.gguf(~6GB)
Download the version that fits your VRAM. For 8GB GPUs, Q4_K_M is the standard choice.
Step 3: Place the File in the Models Directory
Move the .gguf file to either:
ComfyUI/models/unet/ComfyUI/models/diffusion_models/
Both paths work. Create the folder if it doesn’t exist. The file should be accessible from ComfyUI’s node dropdown after placement.
Step 4: Update Your Workflow
In your ComfyUI workflow:
- Replace the standard UNETLoader node with UnetLoaderGGUF (or GGUFModelLoader, depending on your custom node version).
- Select your GGUF file from the dropdown.
- Leave everything else unchanged: VAE, CLIP encoders, samplers, and all downstream nodes stay the same.
The GGUF loader is a drop-in replacement. No rewiring needed.
Step 5: Generate
Run the workflow as normal. ComfyUI loads the GGUF model into VRAM and generation begins. Memory usage will be noticeably lower than with safetensors.
Troubleshooting: If the first run fails, verify the GGUF filename matches exactly what the node expects. Special characters in filenames sometimes cause issues. Rename if necessary.
💡 Quick start: Installing ComfyUI-GGUF takes 5 minutes; just add the custom node, download a GGUF file, and swap your UNETLoader node.
Component Compatibility and Mixed Quantization
A common question: Can I use GGUF for the entire model, or just parts of it?
Only the UNet (the diffusion model itself) gets quantized to GGUF in typical workflows. The VAE and text encoders (CLIP, T5) typically load in their original format. This is by design—the UNet uses the most VRAM and benefits most from quantization.
You can reduce text encoder memory further if needed:
- T5-XXL FP8 is a non-GGUF quantization that shrinks the T5 encoder from 9GB to 5GB. It’s compatible with Flux and works alongside GGUF UNet quantization.
LoRA Compatibility: LoRAs work normally regardless of whether the base model is GGUF or safetensors. Connect the LoraLoader between the GGUFModelLoader and CLIPTextEncode as usual. The format of the base model doesn’t affect LoRA loading.
Which Models Support GGUF?
GGUF versions are available for:
- Flux Dev and Flux Schnell
- SDXL checkpoints (many community versions)
- Wan 2.2
- HunyuanVideo
- Other emerging models (support is growing)
No official GGUF version exists for a model you want to use? You can create one with llama.cpp—but that requires more technical knowledge and is beyond the scope of this guide.
⚠️ Important: Not every model has a GGUF version available yet. Check HuggingFace or the model’s official repository before assuming one exists.
Performance: GPU vs. CPU Execution
On GPU, GGUF speed is comparable to safetensors. Sometimes slightly slower depending on quantization level, but the difference is imperceptible in most workflows. You get the memory savings without a meaningful speed penalty.
On CPU, GGUF is fully functional but significantly slower. A 10-second GPU generation might take 2–3 minutes on CPU. Still viable if you don’t have a GPU, but not practical for iterative work or large batches.
FAQ
Q: How much quality is lost with GGUF Q4_K_M compared to the original model?
A: At 1024x1024, the difference between Q4_K_M and FP16 is hard to spot with the naked eye. In very fine detail (text, complex textures) there can be a slight reduction. Q5_K_M is practically indistinguishable from the original. Q3 and Q2 do show visible loss.
Q: Can only the UNet be GGUF, or also the VAE and CLIP?
A: The UNet/diffusion model uses the most VRAM and is the one with GGUF versions. The VAE and text encoders (CLIP, T5) generally aren’t GGUF-quantized in ComfyUI—they load in their original formats. T5-XXL FP8, a different quantization, does shrink that encoder from 9GB to 5GB.
Q: Can I use regular safetensors LoRAs with a GGUF model?
A: Yes. LoRAs apply to the model after it loads and work regardless of the base model’s format. Connect the LoraLoader between the GGUFModelLoader and CLIPTextEncode as usual.
Q: Does GGUF work for every model or only Flux?
A: It works for any model with a GGUF version available. That includes FLUX Dev/Schnell, many SDXL checkpoints, Wan 2.2, HunyuanVideo and others. Without an official GGUF version, one can be created with llama.cpp, which requires more technical knowledge.
Keep Reading
GGUF is one of several VRAM-saving techniques — see our complete guide to reducing VRAM usage in ComfyUI for the full picture, including offloading and resolution tricks. If you’re still deciding what to buy, our best GPU for ComfyUI guide breaks down which cards handle which workloads.
🏆 Our Recommendation
If you have 6–8GB VRAM and want to run Flux: Use GGUF Q4_K_M. It’s the only practical option for this hardware tier, and quality loss is imperceptible. Install ComfyUI-GGUF, download a Q4_K_M model, and swap your UNETLoader node.
If you have 10–12GB VRAM: Use GGUF Q5_K_M. The quality is nearly indistinguishable from the original, and you still save significant VRAM and disk space compared to safetensors.
If you have 16GB+ VRAM and prioritize maximum quality: Safetensors is still the best choice. But if you value faster loading times and smaller file sizes, Q5_K_M or Q6_K GGUF offers excellent quality with meaningful compression.
If you don’t have a GPU: GGUF still works on CPU, but generation is slow. It’s viable for batch processing or when you don’t need real-time results.
Next steps in ComfyUI
Getting started
FAQ
- How much quality is lost with GGUF Q4_K_M compared to the original model?
- At 1024x1024, the difference between Q4_K_M and FP16 is hard to spot with the naked eye. In very fine detail (text, complex textures) there can be a slight reduction. Q5_K_M is practically indistinguishable from the original. Q3 and Q2 do show visible loss.
- Can only the UNet be GGUF, or also the VAE and CLIP?
- The UNet/diffusion model uses the most VRAM and is the one with GGUF versions. The VAE and text encoders (CLIP, T5) generally aren't GGUF-quantized in ComfyUI -- they load in their original formats. T5-XXL FP8, a different quantization, does shrink that encoder from 9GB to 5GB.
- Can I use regular safetensors LoRAs with a GGUF model?
- Yes. LoRAs apply to the model after it loads and work regardless of the base model's format. Connect the LoraLoader between the GGUFModelLoader and CLIPTextEncode as usual.
- Does GGUF work for every model or only Flux?
- It works for any model with a GGUF version available. In 2026 that includes FLUX Dev/Schnell, many SDXL checkpoints, Wan 2.2, HunyuanVideo and others. Without an official GGUF version, one can be created with llama.cpp, which requires more technical knowledge.