LTX 2.3 Multi-Stage Latent Upscaling Workflow in ComfyUI

Hi there, this is Dora. Two weeks ago I watched a 5-second clip I’d generated at 480p look genuinely cinematic after running it through the LTX 2.3 multi-stage latent upscaling pipeline. Same prompt. Same motion. Completely different level of sharpness and edge detail. I actually said “wait, what?” out loud to my empty room.

LTX 2.3 dropped on recently and the ComfyUI-LTXVideo repository shipped with reference workflows for multi-stage latent upscaling on day one — but the documentation assumes you already know what latent upscaling is and why it matters. Most creators don’t. This guide fixes that.

What Multi-Stage Latent Upscaling Is (Concept in Plain Language)

Standard video generation works like this: you give the model a prompt, it generates a video at whatever resolution you asked for, done. Single pass. One resolution. What you see is what you get.

Multi-stage latent upscaling is a different approach. Instead of generating at full resolution in one shot, you:

Generate at a lower resolution in latent space — getting motion structure, scene coherence, and temporal consistency right first
Upscale within the latent space before decoding — adding spatial detail without regenerating the whole clip
Run a second denoising pass on the upscaled latent to lock in fine texture, edge sharpness, and lighting detail
Optionally, apply a final pixel-space upscale for maximum output resolution

The key word is latent. You’re not upscaling the decoded video frames (that’s pixel-space upscaling, like Topaz Video AI does). You’re operating directly on the compressed latent representation — the model’s internal “understanding” of the video — before it ever gets turned into pixels. This preserves temporal coherence across frames in a way that pixel-space upscaling cannot.

The result: sharper detail, preserved motion consistency, and significantly better edge accuracy — especially on fine textures like hair, fabric, and text — compared to generating at the target resolution in a single pass.

Why Use Multi-Stage vs. Single-Pass Generation

The honest answer: single-pass generation at high resolution is computationally expensive and temporally inconsistent. Here’s why the two-stage approach wins:

Factor	Single-Pass (High Res)	Multi-Stage Latent Upscale
Motion coherence	Can drift on complex motion	✅ Established at lower res Stage 1
Fine detail	Present but can over-constrain motion	✅ Added in Stage 2 upscale pass
VRAM at generation	High (full resolution throughout)	Lower (base res in Stage 1)
Temporal consistency	Risk of frame-to-frame drift	✅ Preserved across upscale
Generation speed	Slower for equivalent detail	Faster Stage 1 + targeted Stage 2

The community consensus on X and Reddit, backed by Lightricks’ own blog documentation, confirms that the two-stage pipeline consistently outperforms single-pass at equivalent compute. The gap is most visible on shots with complex texture — skin, fabric, foliage — and on clips longer than 4 seconds where single-pass generation starts to drift.

One important constraint worth knowing upfront: width and height settings must be divisible by 32 at every stage. This trips people up when setting custom resolutions.

The Official Multi-Stage Workflow Walkthrough

Before touching any node settings, your file structure needs to be correct. Here’s the required directory layout for the two-stage pipeline:

ComfyUI/
├── models/
│   ├── checkpoints/
│   │   └── ltx-2.3-22b-dev-fp8.safetensors      # or BF16 if VRAM allows
│   ├── latent_upscale_models/
│   │   └── ltx-2.3-spatial-upscaler-x2-1.0.safetensors   # required for Stage 2
│   ├── loras/
│   │   └── ltx-2.3-22b-distilled-lora-384.safetensors    # required for pipeline
│   └── text_encoders/
│       └── gemma_3_12B_it_fp4_mixed.safetensors

The Spatial Upscaler and Distilled LoRA are both required for current two-stage pipeline implementations in the ComfyUI-LTXVideo repository. If either file is missing, the workflow will fail silently or produce artifacts at the upscale stage.

Stage 1: Base Generation Settings

Stage 1 focuses entirely on motion structure and scene coherence — not on detail. You’re generating at roughly half your target resolution.

Target resolution for Stage 1:

Final Output	Stage 1 Resolution	Notes
1280×720 (720p)	640×384	Divisible by 32 ✅
1920×1080 (1080p)	960×544	Divisible by 32 ✅
2560×1440 (1440p)	1280×720	Divisible by 32 ✅

Key nodes in Stage 1:

RandomNoise — Set a fixed seed if you want reproducible results. -1 gives variation; any specific number locks the generation.
KSamplerSelect — Use euler for most content; dpmpp_2m if you need stronger prompt adherence
LTXVScheduler — The LTX-specific scheduler that balances temporal stability with prompt adherence. Don’t swap this for a generic scheduler.
MultiModalGuider — Separates text guidance from cross-modal alignment. You can dial up motion fluidity without overfitting to the prompt — that’s the difference between creepy over-constrained motion and natural, believable movement.
CFGGuider — Keep CFG between 3.0–4.5 for LTX 2.3. Higher values cause over-constrained, jittery motion.

Stage 1 prompt tip: The Gemma 3 12B text encoder powering LTX 2.3 handles complex, multi-sentence prompts accurately. Don’t keyword-stuff — write descriptively. “A woman walks through a rainy Tokyo street, neon reflections on wet pavement, handheld camera, cinematic” outperforms a list of tags.

Stage 2: Latent Upscaling Node

This is where the magic happens. The LTXVLatentUpsampler node performs a 2× spatial upscale directly in latent space using the loaded spatial upscaler model.

# Stage 2 node chain:
LTXVLatentUpsampler (#130)
  ├── Input: AV latent from Stage 1 KSampler
  ├── LatentUpscaleModelLoader (#114) → ltx-2.3-spatial-upscaler-x2-1.0.safetensors
  └── Output: 2× upscaled AV latent → feeds into Stage 2 KSampler

Stage 2 KSampler configuration:
  ├── RandomNoise (#127) — use SAME seed as Stage 1 for consistency
  ├── KSamplerSelect (#145)
  ├── ManualSigmas (#113) — controls the refinement noise schedule
  └── LoraLoaderModelOnly (#143) — apply distilled LoRA here for texture polish

The second pass refines the upscaled latent using a ManualSigmas schedule. This stage is where micro-detail and edge sharpness are finalized — it works best when the LoRA is active and the prompt is specific about textures and lighting.

Critical setting: Keep the Stage 2 denoising strength between 0.35–0.55. Too high (above 0.7) and Stage 2 will override the motion structure from Stage 1 — you’ll get sharp frames that don’t flow correctly. Too low (below 0.2) and Stage 2 adds nothing.

Stage 3: Final Spatial Upscale (Optional)

After Stage 2 decodes to pixel space via VAEDecodeTiled, you can optionally apply a final pixel-space upscale for maximum output resolution.

NVIDIA RTX Video Super Resolution is now available as a ComfyUI node — a real-time 4K upscaler that runs on RTX GPU Tensor Cores, delivering 4K upscaling 30× faster than alternative local upscalers at a fraction of the VRAM cost. For RTX users, this is the cleanest Stage 3 path. For everyone else, RealESRGAN (via the ComfyUI-RealESRGAN node) remains the strongest free alternative.

VRAM Requirements at Each Stage

This is the section everyone actually needs before starting. LTX 2.3 is a 22B parameter model —The hardware requirements are real.

Configuration	VRAM Required	Practical GPU	Notes
BF16 full precision	~44GB	A100 / dual GPU	Best quality ceiling
FP8 quantized (recommended)	~23–30GB	RTX 4090 (24GB)	Sweet spot for quality vs. memory
FP16 quantized	~22GB	RTX 3090 / 4090	Strong middle ground
GGUF Q4_K_M	~10–12GB	RTX 3080 (10GB)	Community format; more setup complexity
GGUF Q4_K_S	~8GB	RTX 3070 Ti	Noticeable softening vs. BF16

In practice, 720p runs on 12–24GB with FP8 quantization, and 1080p on 24–32GB. The official minimum is 32GB VRAM, but the community has pushed this significantly lower with quantized variants.

For VRAM-constrained setups, add this ComfyUI launch flag to reserve headroom:

python -m main --reserve-vram 4 --fp8_e4m3fn-unet

--reserve-vram 4 keeps 4GB free for the OS and other processes. --fp8_e4m3fn-unet runs the diffusion model in FP8 (e4m3fn format, optimized for inference) while keeping VAE at higher precision.

As of earlier this year, ComfyUI has Dynamic VRAM enabled by default, which massively reduces RAM usage and prevents VRAM OOMs. Make sure you’re on v0.16.1+ before troubleshooting memory issues — older versions don’t have this.

One important note on GGUF: GGUF is the “make it fit” option, not always the “cleanest install” option. It’s attractive for low-VRAM users, but it’s also where people are seeing more size mismatch errors and workflow confusion. If you’re on 16GB or above, stick with the official FP8 safetensors checkpoint.

Quality Settings and Trade-offs

Setting	Conservative	Balanced	Maximum Quality
Stage 1 steps	20	30	40–50
Stage 2 steps	10	15	20
CFG scale	3	3.5–4.0	4.5
Stage 2 denoise	0.35	0.45	0.55
FPS	24	30	50
Generation time (RTX 4090, 1080p, 10s)	~10 min	~20 min	~30 min

The setting that has the biggest quality-per-minute return: Stage 2 denoising strength. Going from 0.35 to 0.48 typically adds more visible detail than doubling your Step 1 step count.

Common Errors and Fixes

RuntimeError: CUDA out of memory The most common error by far. Fix in order of impact: (1) enable --fp8_e4m3fn-unet launch flag, (2) add --reserve-vram 4, (3) switch to the FP8 checkpoint, (4) reduce resolution (must be divisible by 32), (5) drop to GGUF if all else fails.

Red nodes / missing node errors on workflow load Your ComfyUI isn’t on v0.16.1+. If nodes are missing when loading a workflow, update ComfyUI via the Manager — the desktop version’s update may delay slightly behind the nightly build. Run Update All in ComfyUI Manager.

Stage 2 overrides Stage 1 motion (video looks “regenerated”) Your Stage 2 denoising strength is too high. Drop it below 0.5. The Stage 2 pass should refine, not replace.

Size mismatch error with GGUF models GGUF models load via the UnetLoader node, not CheckpointLoaderSimple. Check that your GGUF file is in models/unet/, not models/checkpoints/.

Output video has playback drift (audio and video out of sync) Keep your FPS setting consistent with the value used during conditioning — inconsistency here causes playback drift. Set FPS once at the conditioning stage and don’t change it between Stage 1 and decode.

When to Use This vs. Simpler Workflows

The multi-stage pipeline adds setup complexity and generation time. It’s not always the right call.

Use multi-stage latent upscaling when:

Final delivery is 1080p or higher
Your shot contains fine detail (hair, fabric, skin, text)
Clip length is 4+ seconds (longer clips benefit most from Stage 1 coherence)
You’re doing image-to-video and need the reference frame to hold across the full clip

Stick with single-pass when:

You’re iterating quickly on prompt ideas and don’t need final quality yet
Output is 720p or below for social/draft use
You’re on under 12GB VRAM and need results without complex pipeline setup
Your scene is simple (abstract motion, minimal texture, solid backgrounds)

The distilled variant (ltx-2.3-22b-distilled) completes in as few as 8 denoising steps — dramatically faster than the full dev model. For most creators, distilled is the better starting point before committing to the multi-stage pipeline. Run the distilled single-pass first. If the detail ceiling frustrates you, that’s when you graduate to multi-stage.

FAQ

Q: Do I need the spatial upscaler model separately, or is it included in the checkpoint?

It’s a separate file. Download ltx-2.3-spatial-upscaler-x2-1.0.safetensors from the Lightricks/LTX-2.3 HuggingFace page and place it in ComfyUI/models/latent_upscale_models/. The main checkpoint doesn’t include it. Same for the Distilled LoRA — separate download, separate folder (models/loras/).

Q: Does changing the seed between Stage 1 and Stage 2 break consistency?

Yes. Always use the same seed in both stages. Stage 2 uses the Stage 1 latent as input — a different seed adds noise in a mismatched direction, causing artifacts rather than refinement.

Q: My Stage 2 output looks blurry instead of sharper. What’s wrong?

Most likely cause: the Distilled LoRA isn’t loaded, or your Stage 2 denoising strength is below 0.3. Check that LoraLoaderModelOnly is connected before the Stage 2 KSampler and set denoising to at least 0.4.

LTX 2.3 Spatial and Temporal Upscaler: How to Use It

How to Install LTX 2.3 in ComfyUI: Step-by-Step Guide

LTX 2.3 vs LTX 2: What Changed and Should You Upgrade?

What Is LTX 2.3: The 22B Open-Source Video Model Explained

LTX 2.3 vs WAN 2.2: Best Open-Source Video Model in 2026?

What Multi-Stage Latent Upscaling Is (Concept in Plain Language)

Why Use Multi-Stage vs. Single-Pass Generation

The Official Multi-Stage Workflow Walkthrough

Stage 1: Base Generation Settings

Stage 2: Latent Upscaling Node

Stage 3: Final Spatial Upscale (Optional)

VRAM Requirements at Each Stage

Quality Settings and Trade-offs

Common Errors and Fixes

When to Use This vs. Simpler Workflows

FAQ

Dora

Leave a ReplyCancel Reply

Related Posts

Video Feedback Tools: How to Cut Revision Rounds in Half

Best Voiceover Tools for Video Creators (2026)

LTX 2.3 API Pricing: Fast vs Pro, 720p vs 1080p Explained

MiniTool Video Converter Review: Honest Pros, Cons and Verdict

How to Make a Lyric Video with AI (Free Tools)

LTX 2.3 LoRA Migration: How to Retrain for the New Latent Space