LTX 2.3 Multi-Stage Latent Upscaling Workflow in ComfyUI

Hi there, this is Dora. Two weeks ago I watched a 5-second clip I’d generated at 480p look genuinely cinematic after running it through the LTX 2.3 multi-stage latent upscaling pipeline. Same prompt. Same motion. Completely different level of sharpness and edge detail. I actually said “wait, what?” out loud to my empty room.

LTX 2.3 dropped on recently and the ComfyUI-LTXVideo repository shipped with reference workflows for multi-stage latent upscaling on day one — but the documentation assumes you already know what latent upscaling is and why it matters. Most creators don’t. This guide fixes that.

What Multi-Stage Latent Upscaling Is (Concept in Plain Language)

Standard video generation works like this: you give the model a prompt, it generates a video at whatever resolution you asked for, done. Single pass. One resolution. What you see is what you get.

Multi-stage latent upscaling is a different approach. Instead of generating at full resolution in one shot, you:

  1. Generate at a lower resolution in latent space — getting motion structure, scene coherence, and temporal consistency right first
  2. Upscale within the latent space before decoding — adding spatial detail without regenerating the whole clip
  3. Run a second denoising pass on the upscaled latent to lock in fine texture, edge sharpness, and lighting detail
  4. Optionally, apply a final pixel-space upscale for maximum output resolution

The key word is latent. You’re not upscaling the decoded video frames (that’s pixel-space upscaling, like Topaz Video AI does). You’re operating directly on the compressed latent representation — the model’s internal “understanding” of the video — before it ever gets turned into pixels. This preserves temporal coherence across frames in a way that pixel-space upscaling cannot.

The result: sharper detail, preserved motion consistency, and significantly better edge accuracy — especially on fine textures like hair, fabric, and text — compared to generating at the target resolution in a single pass.

Why Use Multi-Stage vs. Single-Pass Generation

The honest answer: single-pass generation at high resolution is computationally expensive and temporally inconsistent. Here’s why the two-stage approach wins:

FactorSingle-Pass (High Res)Multi-Stage Latent Upscale
Motion coherenceCan drift on complex motion✅ Established at lower res Stage 1
Fine detailPresent but can over-constrain motion✅ Added in Stage 2 upscale pass
VRAM at generationHigh (full resolution throughout)Lower (base res in Stage 1)
Temporal consistencyRisk of frame-to-frame drift✅ Preserved across upscale
Generation speedSlower for equivalent detailFaster Stage 1 + targeted Stage 2

The community consensus on X and Reddit, backed by Lightricks’ own blog documentation, confirms that the two-stage pipeline consistently outperforms single-pass at equivalent compute. The gap is most visible on shots with complex texture — skin, fabric, foliage — and on clips longer than 4 seconds where single-pass generation starts to drift.

One important constraint worth knowing upfront: width and height settings must be divisible by 32 at every stage. This trips people up when setting custom resolutions.

The Official Multi-Stage Workflow Walkthrough

Before touching any node settings, your file structure needs to be correct. Here’s the required directory layout for the two-stage pipeline:

ComfyUI/
├── models/
│   ├── checkpoints/
│   │   └── ltx-2.3-22b-dev-fp8.safetensors      # or BF16 if VRAM allows
│   ├── latent_upscale_models/
│   │   └── ltx-2.3-spatial-upscaler-x2-1.0.safetensors   # required for Stage 2
│   ├── loras/
│   │   └── ltx-2.3-22b-distilled-lora-384.safetensors    # required for pipeline
│   └── text_encoders/
│       └── gemma_3_12B_it_fp4_mixed.safetensors

The Spatial Upscaler and Distilled LoRA are both required for current two-stage pipeline implementations in the ComfyUI-LTXVideo repository. If either file is missing, the workflow will fail silently or produce artifacts at the upscale stage.

Stage 1: Base Generation Settings

Stage 1 focuses entirely on motion structure and scene coherence — not on detail. You’re generating at roughly half your target resolution.

Target resolution for Stage 1:

Final OutputStage 1 ResolutionNotes
1280×720 (720p)640×384Divisible by 32 ✅
1920×1080 (1080p)960×544Divisible by 32 ✅
2560×1440 (1440p)1280×720Divisible by 32 ✅

Key nodes in Stage 1:

  • RandomNoise — Set a fixed seed if you want reproducible results. -1 gives variation; any specific number locks the generation.
  • KSamplerSelect — Use euler for most content; dpmpp_2m if you need stronger prompt adherence
  • LTXVScheduler — The LTX-specific scheduler that balances temporal stability with prompt adherence. Don’t swap this for a generic scheduler.
  • MultiModalGuider — Separates text guidance from cross-modal alignment. You can dial up motion fluidity without overfitting to the prompt — that’s the difference between creepy over-constrained motion and natural, believable movement.
  • CFGGuider — Keep CFG between 3.0–4.5 for LTX 2.3. Higher values cause over-constrained, jittery motion.

Stage 1 prompt tip: The Gemma 3 12B text encoder powering LTX 2.3 handles complex, multi-sentence prompts accurately. Don’t keyword-stuff — write descriptively. “A woman walks through a rainy Tokyo street, neon reflections on wet pavement, handheld camera, cinematic” outperforms a list of tags.

Stage 2: Latent Upscaling Node

This is where the magic happens. The LTXVLatentUpsampler node performs a 2× spatial upscale directly in latent space using the loaded spatial upscaler model.

# Stage 2 node chain:
LTXVLatentUpsampler (#130)
  ├── Input: AV latent from Stage 1 KSampler
  ├── LatentUpscaleModelLoader (#114) → ltx-2.3-spatial-upscaler-x2-1.0.safetensors
  └── Output: 2× upscaled AV latent → feeds into Stage 2 KSampler

Stage 2 KSampler configuration:
  ├── RandomNoise (#127) — use SAME seed as Stage 1 for consistency
  ├── KSamplerSelect (#145)
  ├── ManualSigmas (#113) — controls the refinement noise schedule
  └── LoraLoaderModelOnly (#143) — apply distilled LoRA here for texture polish

The second pass refines the upscaled latent using a ManualSigmas schedule. This stage is where micro-detail and edge sharpness are finalized — it works best when the LoRA is active and the prompt is specific about textures and lighting.

Critical setting: Keep the Stage 2 denoising strength between 0.35–0.55. Too high (above 0.7) and Stage 2 will override the motion structure from Stage 1 — you’ll get sharp frames that don’t flow correctly. Too low (below 0.2) and Stage 2 adds nothing.

Stage 3: Final Spatial Upscale (Optional)

After Stage 2 decodes to pixel space via VAEDecodeTiled, you can optionally apply a final pixel-space upscale for maximum output resolution.

NVIDIA RTX Video Super Resolution is now available as a ComfyUI node — a real-time 4K upscaler that runs on RTX GPU Tensor Cores, delivering 4K upscaling 30× faster than alternative local upscalers at a fraction of the VRAM cost. For RTX users, this is the cleanest Stage 3 path. For everyone else, RealESRGAN (via the ComfyUI-RealESRGAN node) remains the strongest free alternative.

VRAM Requirements at Each Stage

This is the section everyone actually needs before starting. LTX 2.3 is a 22B parameter model —The hardware requirements are real.

ConfigurationVRAM RequiredPractical GPUNotes
BF16 full precision~44GBA100 / dual GPUBest quality ceiling
FP8 quantized (recommended)~23–30GBRTX 4090 (24GB)Sweet spot for quality vs. memory
FP16 quantized~22GBRTX 3090 / 4090Strong middle ground
GGUF Q4_K_M~10–12GBRTX 3080 (10GB)Community format; more setup complexity
GGUF Q4_K_S~8GBRTX 3070 TiNoticeable softening vs. BF16

In practice, 720p runs on 12–24GB with FP8 quantization, and 1080p on 24–32GB. The official minimum is 32GB VRAM, but the community has pushed this significantly lower with quantized variants.

For VRAM-constrained setups, add this ComfyUI launch flag to reserve headroom:

python -m main --reserve-vram 4 --fp8_e4m3fn-unet

--reserve-vram 4 keeps 4GB free for the OS and other processes. --fp8_e4m3fn-unet runs the diffusion model in FP8 (e4m3fn format, optimized for inference) while keeping VAE at higher precision.

As of earlier this year, ComfyUI has Dynamic VRAM enabled by default, which massively reduces RAM usage and prevents VRAM OOMs. Make sure you’re on v0.16.1+ before troubleshooting memory issues — older versions don’t have this.

One important note on GGUF: GGUF is the “make it fit” option, not always the “cleanest install” option. It’s attractive for low-VRAM users, but it’s also where people are seeing more size mismatch errors and workflow confusion. If you’re on 16GB or above, stick with the official FP8 safetensors checkpoint.

Quality Settings and Trade-offs

SettingConservativeBalancedMaximum Quality
Stage 1 steps203040–50
Stage 2 steps101520
CFG scale33.5–4.04.5
Stage 2 denoise0.350.450.55
FPS243050
Generation time (RTX 4090, 1080p, 10s)~10 min~20 min~30 min

The setting that has the biggest quality-per-minute return: Stage 2 denoising strength. Going from 0.35 to 0.48 typically adds more visible detail than doubling your Step 1 step count.

Common Errors and Fixes

RuntimeError: CUDA out of memory The most common error by far. Fix in order of impact: (1) enable --fp8_e4m3fn-unet launch flag, (2) add --reserve-vram 4, (3) switch to the FP8 checkpoint, (4) reduce resolution (must be divisible by 32), (5) drop to GGUF if all else fails.

Red nodes / missing node errors on workflow load Your ComfyUI isn’t on v0.16.1+. If nodes are missing when loading a workflow, update ComfyUI via the Manager — the desktop version’s update may delay slightly behind the nightly build. Run Update All in ComfyUI Manager.

Stage 2 overrides Stage 1 motion (video looks “regenerated”) Your Stage 2 denoising strength is too high. Drop it below 0.5. The Stage 2 pass should refine, not replace.

Size mismatch error with GGUF models GGUF models load via the UnetLoader node, not CheckpointLoaderSimple. Check that your GGUF file is in models/unet/, not models/checkpoints/.

Output video has playback drift (audio and video out of sync) Keep your FPS setting consistent with the value used during conditioning — inconsistency here causes playback drift. Set FPS once at the conditioning stage and don’t change it between Stage 1 and decode.

When to Use This vs. Simpler Workflows

The multi-stage pipeline adds setup complexity and generation time. It’s not always the right call.

Use multi-stage latent upscaling when:

  • Final delivery is 1080p or higher
  • Your shot contains fine detail (hair, fabric, skin, text)
  • Clip length is 4+ seconds (longer clips benefit most from Stage 1 coherence)
  • You’re doing image-to-video and need the reference frame to hold across the full clip

Stick with single-pass when:

  • You’re iterating quickly on prompt ideas and don’t need final quality yet
  • Output is 720p or below for social/draft use
  • You’re on under 12GB VRAM and need results without complex pipeline setup
  • Your scene is simple (abstract motion, minimal texture, solid backgrounds)

The distilled variant (ltx-2.3-22b-distilled) completes in as few as 8 denoising steps — dramatically faster than the full dev model. For most creators, distilled is the better starting point before committing to the multi-stage pipeline. Run the distilled single-pass first. If the detail ceiling frustrates you, that’s when you graduate to multi-stage.

FAQ

Q: Do I need the spatial upscaler model separately, or is it included in the checkpoint?

It’s a separate file. Download ltx-2.3-spatial-upscaler-x2-1.0.safetensors from the Lightricks/LTX-2.3 HuggingFace page and place it in ComfyUI/models/latent_upscale_models/. The main checkpoint doesn’t include it. Same for the Distilled LoRA — separate download, separate folder (models/loras/).

Q: Does changing the seed between Stage 1 and Stage 2 break consistency?

Yes. Always use the same seed in both stages. Stage 2 uses the Stage 1 latent as input — a different seed adds noise in a mismatched direction, causing artifacts rather than refinement.

Q: My Stage 2 output looks blurry instead of sharper. What’s wrong?

Most likely cause: the Distilled LoRA isn’t loaded, or your Stage 2 denoising strength is below 0.3. Check that LoraLoaderModelOnly is connected before the Stage 2 KSampler and set denoising to at least 0.4.


Previous Posts:

Leave a Reply

Your email address will not be published. Required fields are marked *