Flux + Mochi

Infinite worlds, Summoned by voice


1) OSS end-to-end real-time video gen (Flux + Mochi)

I’m building an immersive, locally deployed pipeline that turns a novel (or a spoken narrative) into film-like video. The goal is privacy-first production and full control over model provenance.

  • Privacy: everything can run on your workstation.
  • Reproducibility: models + configs are explicit and versioned.
  • Quality: Flux for high-quality keyframes, Mochi for motion.

2) Core model: Mochi 1

Mochi 1 is a widely used open-source DiT video model from Genmo AI. It’s known for strong physical motion: fluids, lighting changes, and particle motion feel consistent and “real.”

  • Industrial scale: ~10B parameters (not a toy model).
  • English-first training: strong at complex sci-fi descriptions.
  • Motion realism: coherent dynamics across frames.

In practice, a dual RTX 4090 (48GB total VRAM) workstation is a great comfort zone for Mochi + decoding, especially with offload + tiling.

3) Architecture overview

The pipeline is five layers, with a clean data flow:

Text / Audio input
      ↓
[Perception] Whisper Large-v3 (ASR)
      ↓
[Understanding] Llama-3.1-70B (scene split + prompt generation)
      ↓
[Static] Flux.1 Dev (4K keyframes)
      ↓
[Dynamic] Mochi 1 (physics-realistic video)
      ↓
[Post] Upscale + frame interpolation + encoding → 4K60 output

4) Hardware configuration

Local workstation (dual RTX 4090)

Part Spec Budget (USD)
GPU 2× RTX 4090 24GB Owned
CPU Ryzen 9 7950X ~$550
RAM 128GB DDR5 ~$400
NVMe 4TB Gen4 ~$300
PSU 1600W (Platinum) ~$350
Total ~$1,600

VRAM allocation strategy

  • GPU 0: Mochi inference + VAE decode.
  • GPU 1: Flux inference + Llama Int4.
  • CPU RAM: offload buffer for large models.

Cloud fallback (batch production)

Platform Config $/hour
Vast.ai 2× 4090 $1.20
RunPod 2× A100 $3.50
Lambda 1× H100 $2.50

Simple strategy: develop locally, scale out in the cloud.

5) Model inventory

Model Size Source License
Whisper Large-v3 ~3GB OpenAI MIT
Llama-3.1-70B (Int4) ~35GB Meta Llama 3.1
Flux.1 Dev ~24GB Black Forest Labs Apache 2.0
Mochi 1 Preview ~40GB Genmo Apache 2.0
Real-ESRGAN ~64MB xinntao BSD
RIFE ~50MB hzwer MIT

Total: ~102GB. I keep ~200GB SSD free for caches and intermediate outputs.

6) Workflow (what the system actually does)

Stage 1 — Text preprocessing

A large LLM is used to produce a Scene Manifest JSON: scene boundaries, extracted visual elements, Flux prompts (composition + texture), and Mochi motion prompts (camera + physics).

Stage 2 — Keyframe generation (Flux.1 Dev)

Generate a 2048×1152 first frame per scene (optionally with ControlNet / IP-Adapter for consistency).

  • guidance_scale: 3.5
  • inference_steps: 50

Stage 3 — Video generation (Mochi 1)

Mochi takes the Flux keyframe and runs image-to-video inference: 84 frames (~3.5s at 24fps) at 480×848, with guidance around 4.5.

VRAM optimization is critical:

# Memory optimizations (key for dual 4090)
pipe.enable_model_cpu_offload()   # staged offload
pipe.enable_vae_tiling()          # tiled decode
torch.cuda.empty_cache()          # clean after each scene

Stage 4 — Enhancement

  1. Real-ESRGAN 4×: 480p → 1920p
  2. RIFE interpolation: 24fps → 60fps
  3. FFmpeg (H.265): CRF 18 for high quality

Stage 5 — Assembly

Stitch scenes by manifest order, add transitions, align audio, and export a final 4K60 deliverable.

7) Performance & cost

Per-scene timing (dual RTX 4090)

Step Time
Llama analysis ~10s
Flux keyframe ~45s
Mochi video ~180s
Post-processing ~60s
Total per scene ~5 min

Output per scene: ~3.5 seconds of 4K60 video.

Project estimate (10-minute final video)

Assuming ~170 scenes:

  • Local: ~14 hours, roughly ~$5 electricity.
  • Cloud: ~8 hours, roughly ~$10–$30 GPU cost.

8) Future: NVIDIA Cosmos

In the future, I also want to explore world models to push coherence, controllability, and longer-horizon storytelling.

10) Daily Log

2026-01-05
Generalize the pipeline into: user ask agent a question by clearly ask, not let agent guess by signals.
2026-01-04
Set up XGIMI projector and Google Play developer account. Tested Gemini Nano banana using voice input.