Hi there, this is Dora. Two weeks ago I watched a 5-second clip I’d generated at 480p look genuinely cinematic after running it through the LTX 2.3 multi-stage latent upscaling pipeline. Same prompt. Same motion. Completely different level of sharpness and edge detail. I actually said “wait, what?” out loud to my empty room.
LTX 2.3 dropped on recently and the ComfyUI-LTXVideo repository shipped with reference workflows for multi-stage latent upscaling on day one — but the documentation assumes you already know what latent upscaling is and why it matters. Most creators don’t. This guide fixes that.
What Multi-Stage Latent Upscaling Is (Concept in Plain Language)
Standard video generation works like this: you give the model a prompt, it generates a video at whatever resolution you asked for, done. Single pass. One resolution. What you see is what you get.
Multi-stage latent upscaling is a different approach. Instead of generating at full resolution in one shot, you:
- Generate at a lower resolution in latent space — getting motion structure, scene coherence, and temporal consistency right first
- Upscale within the latent space before decoding — adding spatial detail without regenerating the whole clip
- Run a second denoising pass on the upscaled latent to lock in fine texture, edge sharpness, and lighting detail
- Optionally, apply a final pixel-space upscale for maximum output resolution

The key word is latent. You’re not upscaling the decoded video frames (that’s pixel-space upscaling, like Topaz Video AI does). You’re operating directly on the compressed latent representation — the model’s internal “understanding” of the video — before it ever gets turned into pixels. This preserves temporal coherence across frames in a way that pixel-space upscaling cannot.
The result: sharper detail, preserved motion consistency, and significantly better edge accuracy — especially on fine textures like hair, fabric, and text — compared to generating at the target resolution in a single pass.
Why Use Multi-Stage vs. Single-Pass Generation
The honest answer: single-pass generation at high resolution is computationally expensive and temporally inconsistent. Here’s why the two-stage approach wins:
| Factor | Single-Pass (High Res) | Multi-Stage Latent Upscale |
| Motion coherence | Can drift on complex motion | ✅ Established at lower res Stage 1 |
| Fine detail | Present but can over-constrain motion | ✅ Added in Stage 2 upscale pass |
| VRAM at generation | High (full resolution throughout) | Lower (base res in Stage 1) |
| Temporal consistency | Risk of frame-to-frame drift | ✅ Preserved across upscale |
| Generation speed | Slower for equivalent detail | Faster Stage 1 + targeted Stage 2 |
The community consensus on X and Reddit, backed by Lightricks’ own blog documentation, confirms that the two-stage pipeline consistently outperforms single-pass at equivalent compute. The gap is most visible on shots with complex texture — skin, fabric, foliage — and on clips longer than 4 seconds where single-pass generation starts to drift.
One important constraint worth knowing upfront: width and height settings must be divisible by 32 at every stage. This trips people up when setting custom resolutions.
The Official Multi-Stage Workflow Walkthrough
Before touching any node settings, your file structure needs to be correct. Here’s the required directory layout for the two-stage pipeline:
ComfyUI/
├── models/
│ ├── checkpoints/
│ │ └── ltx-2.3-22b-dev-fp8.safetensors # or BF16 if VRAM allows
│ ├── latent_upscale_models/
│ │ └── ltx-2.3-spatial-upscaler-x2-1.0.safetensors # required for Stage 2
│ ├── loras/
│ │ └── ltx-2.3-22b-distilled-lora-384.safetensors # required for pipeline
│ └── text_encoders/
│ └── gemma_3_12B_it_fp4_mixed.safetensors
The Spatial Upscaler and Distilled LoRA are both required for current two-stage pipeline implementations in the ComfyUI-LTXVideo repository. If either file is missing, the workflow will fail silently or produce artifacts at the upscale stage.

Stage 1: Base Generation Settings
Stage 1 focuses entirely on motion structure and scene coherence — not on detail. You’re generating at roughly half your target resolution.
Target resolution for Stage 1:
| Final Output | Stage 1 Resolution | Notes |
| 1280×720 (720p) | 640×384 | Divisible by 32 ✅ |
| 1920×1080 (1080p) | 960×544 | Divisible by 32 ✅ |
| 2560×1440 (1440p) | 1280×720 | Divisible by 32 ✅ |
Key nodes in Stage 1:
RandomNoise— Set a fixed seed if you want reproducible results.-1gives variation; any specific number locks the generation.KSamplerSelect— Useeulerfor most content;dpmpp_2mif you need stronger prompt adherenceLTXVScheduler— The LTX-specific scheduler that balances temporal stability with prompt adherence. Don’t swap this for a generic scheduler.MultiModalGuider— Separates text guidance from cross-modal alignment. You can dial up motion fluidity without overfitting to the prompt — that’s the difference between creepy over-constrained motion and natural, believable movement.CFGGuider— Keep CFG between 3.0–4.5 for LTX 2.3. Higher values cause over-constrained, jittery motion.
Stage 1 prompt tip: The Gemma 3 12B text encoder powering LTX 2.3 handles complex, multi-sentence prompts accurately. Don’t keyword-stuff — write descriptively. “A woman walks through a rainy Tokyo street, neon reflections on wet pavement, handheld camera, cinematic” outperforms a list of tags.
Stage 2: Latent Upscaling Node
This is where the magic happens. The LTXVLatentUpsampler node performs a 2× spatial upscale directly in latent space using the loaded spatial upscaler model.
# Stage 2 node chain:
LTXVLatentUpsampler (#130)
├── Input: AV latent from Stage 1 KSampler
├── LatentUpscaleModelLoader (#114) → ltx-2.3-spatial-upscaler-x2-1.0.safetensors
└── Output: 2× upscaled AV latent → feeds into Stage 2 KSampler
Stage 2 KSampler configuration:
├── RandomNoise (#127) — use SAME seed as Stage 1 for consistency
├── KSamplerSelect (#145)
├── ManualSigmas (#113) — controls the refinement noise schedule
└── LoraLoaderModelOnly (#143) — apply distilled LoRA here for texture polish
The second pass refines the upscaled latent using a ManualSigmas schedule. This stage is where micro-detail and edge sharpness are finalized — it works best when the LoRA is active and the prompt is specific about textures and lighting.
Critical setting: Keep the Stage 2 denoising strength between 0.35–0.55. Too high (above 0.7) and Stage 2 will override the motion structure from Stage 1 — you’ll get sharp frames that don’t flow correctly. Too low (below 0.2) and Stage 2 adds nothing.
Stage 3: Final Spatial Upscale (Optional)
After Stage 2 decodes to pixel space via VAEDecodeTiled, you can optionally apply a final pixel-space upscale for maximum output resolution.
NVIDIA RTX Video Super Resolution is now available as a ComfyUI node — a real-time 4K upscaler that runs on RTX GPU Tensor Cores, delivering 4K upscaling 30× faster than alternative local upscalers at a fraction of the VRAM cost. For RTX users, this is the cleanest Stage 3 path. For everyone else, RealESRGAN (via the ComfyUI-RealESRGAN node) remains the strongest free alternative.

VRAM Requirements at Each Stage
This is the section everyone actually needs before starting. LTX 2.3 is a 22B parameter model —The hardware requirements are real.
| Configuration | VRAM Required | Practical GPU | Notes |
| BF16 full precision | ~44GB | A100 / dual GPU | Best quality ceiling |
| FP8 quantized (recommended) | ~23–30GB | RTX 4090 (24GB) | Sweet spot for quality vs. memory |
| FP16 quantized | ~22GB | RTX 3090 / 4090 | Strong middle ground |
| GGUF Q4_K_M | ~10–12GB | RTX 3080 (10GB) | Community format; more setup complexity |
| GGUF Q4_K_S | ~8GB | RTX 3070 Ti | Noticeable softening vs. BF16 |
In practice, 720p runs on 12–24GB with FP8 quantization, and 1080p on 24–32GB. The official minimum is 32GB VRAM, but the community has pushed this significantly lower with quantized variants.
For VRAM-constrained setups, add this ComfyUI launch flag to reserve headroom:
python -m main --reserve-vram 4 --fp8_e4m3fn-unet
--reserve-vram 4 keeps 4GB free for the OS and other processes. --fp8_e4m3fn-unet runs the diffusion model in FP8 (e4m3fn format, optimized for inference) while keeping VAE at higher precision.
As of earlier this year, ComfyUI has Dynamic VRAM enabled by default, which massively reduces RAM usage and prevents VRAM OOMs. Make sure you’re on v0.16.1+ before troubleshooting memory issues — older versions don’t have this.
One important note on GGUF: GGUF is the “make it fit” option, not always the “cleanest install” option. It’s attractive for low-VRAM users, but it’s also where people are seeing more size mismatch errors and workflow confusion. If you’re on 16GB or above, stick with the official FP8 safetensors checkpoint.
Quality Settings and Trade-offs
| Setting | Conservative | Balanced | Maximum Quality |
| Stage 1 steps | 20 | 30 | 40–50 |
| Stage 2 steps | 10 | 15 | 20 |
| CFG scale | 3 | 3.5–4.0 | 4.5 |
| Stage 2 denoise | 0.35 | 0.45 | 0.55 |
| FPS | 24 | 30 | 50 |
| Generation time (RTX 4090, 1080p, 10s) | ~10 min | ~20 min | ~30 min |
The setting that has the biggest quality-per-minute return: Stage 2 denoising strength. Going from 0.35 to 0.48 typically adds more visible detail than doubling your Step 1 step count.
Common Errors and Fixes
RuntimeError: CUDA out of memory The most common error by far. Fix in order of impact: (1) enable --fp8_e4m3fn-unet launch flag, (2) add --reserve-vram 4, (3) switch to the FP8 checkpoint, (4) reduce resolution (must be divisible by 32), (5) drop to GGUF if all else fails.
Red nodes / missing node errors on workflow load Your ComfyUI isn’t on v0.16.1+. If nodes are missing when loading a workflow, update ComfyUI via the Manager — the desktop version’s update may delay slightly behind the nightly build. Run Update All in ComfyUI Manager.
Stage 2 overrides Stage 1 motion (video looks “regenerated”) Your Stage 2 denoising strength is too high. Drop it below 0.5. The Stage 2 pass should refine, not replace.
Size mismatch error with GGUF models GGUF models load via the UnetLoader node, not CheckpointLoaderSimple. Check that your GGUF file is in models/unet/, not models/checkpoints/.
Output video has playback drift (audio and video out of sync) Keep your FPS setting consistent with the value used during conditioning — inconsistency here causes playback drift. Set FPS once at the conditioning stage and don’t change it between Stage 1 and decode.

When to Use This vs. Simpler Workflows
The multi-stage pipeline adds setup complexity and generation time. It’s not always the right call.
Use multi-stage latent upscaling when:
- Final delivery is 1080p or higher
- Your shot contains fine detail (hair, fabric, skin, text)
- Clip length is 4+ seconds (longer clips benefit most from Stage 1 coherence)
- You’re doing image-to-video and need the reference frame to hold across the full clip
Stick with single-pass when:
- You’re iterating quickly on prompt ideas and don’t need final quality yet
- Output is 720p or below for social/draft use
- You’re on under 12GB VRAM and need results without complex pipeline setup
- Your scene is simple (abstract motion, minimal texture, solid backgrounds)
The distilled variant (ltx-2.3-22b-distilled) completes in as few as 8 denoising steps — dramatically faster than the full dev model. For most creators, distilled is the better starting point before committing to the multi-stage pipeline. Run the distilled single-pass first. If the detail ceiling frustrates you, that’s when you graduate to multi-stage.
FAQ
Q: Do I need the spatial upscaler model separately, or is it included in the checkpoint?
It’s a separate file. Download ltx-2.3-spatial-upscaler-x2-1.0.safetensors from the Lightricks/LTX-2.3 HuggingFace page and place it in ComfyUI/models/latent_upscale_models/. The main checkpoint doesn’t include it. Same for the Distilled LoRA — separate download, separate folder (models/loras/).
Q: Does changing the seed between Stage 1 and Stage 2 break consistency?
Yes. Always use the same seed in both stages. Stage 2 uses the Stage 1 latent as input — a different seed adds noise in a mismatched direction, causing artifacts rather than refinement.
Q: My Stage 2 output looks blurry instead of sharper. What’s wrong?
Most likely cause: the Distilled LoRA isn’t loaded, or your Stage 2 denoising strength is below 0.3. Check that LoraLoaderModelOnly is connected before the Stage 2 KSampler and set denoising to at least 0.4.
Previous Posts:






