Flux + Mochi Video Generation System

Flux + Mochi

Infinite worlds, Summoned by voice

1) OSS end-to-end real-time video gen (Flux + Mochi)

I’m building an immersive, locally deployed pipeline that turns a novel (or a spoken narrative) into film-like video. The goal is privacy-first production and full control over model provenance.

Privacy: everything can run on your workstation.
Reproducibility: models + configs are explicit and versioned.
Quality: Flux for high-quality keyframes, Mochi for motion.

2) Core model: Mochi 1

Mochi 1 is a widely used open-source DiT video model from Genmo AI. It’s known for strong physical motion: fluids, lighting changes, and particle motion feel consistent and “real.”

Industrial scale: ~10B parameters (not a toy model).
English-first training: strong at complex sci-fi descriptions.
Motion realism: coherent dynamics across frames.

In practice, a dual RTX 4090 (48GB total VRAM) workstation is a great comfort zone for Mochi + decoding, especially with offload + tiling.

3) Architecture overview

The pipeline is five layers, with a clean data flow:

Text / Audio input
      ↓
[Perception] Whisper Large-v3 (ASR)
      ↓
[Understanding] Llama-3.1-70B (scene split + prompt generation)
      ↓
[Static] Flux.1 Dev (4K keyframes)
      ↓
[Dynamic] Mochi 1 (physics-realistic video)
      ↓
[Post] Upscale + frame interpolation + encoding → 4K60 output

4) Hardware configuration

Local workstation (dual RTX 4090)

Part	Spec	Budget (USD)
GPU	2× RTX 4090 24GB	Owned
CPU	Ryzen 9 7950X	~$550
RAM	128GB DDR5	~$400
NVMe	4TB Gen4	~$300
PSU	1600W (Platinum)	~$350
Total		~$1,600

VRAM allocation strategy

GPU 0: Mochi inference + VAE decode.
GPU 1: Flux inference + Llama Int4.
CPU RAM: offload buffer for large models.

Cloud fallback (batch production)

Platform	Config	$/hour
Vast.ai	2× 4090	$1.20
RunPod	2× A100	$3.50
Lambda	1× H100	$2.50

Simple strategy: develop locally, scale out in the cloud.

5) Model inventory

Model	Size	Source	License
Whisper Large-v3	~3GB	OpenAI	MIT
Llama-3.1-70B (Int4)	~35GB	Meta	Llama 3.1
Flux.1 Dev	~24GB	Black Forest Labs	Apache 2.0
Mochi 1 Preview	~40GB	Genmo	Apache 2.0
Real-ESRGAN	~64MB	xinntao	BSD
RIFE	~50MB	hzwer	MIT

Total: ~102GB. I keep ~200GB SSD free for caches and intermediate outputs.

6) Workflow (what the system actually does)

Stage 1 — Text preprocessing

A large LLM is used to produce a Scene Manifest JSON: scene boundaries, extracted visual elements, Flux prompts (composition + texture), and Mochi motion prompts (camera + physics).

Stage 2 — Keyframe generation (Flux.1 Dev)

Generate a 2048×1152 first frame per scene (optionally with ControlNet / IP-Adapter for consistency).

guidance_scale: 3.5
inference_steps: 50

Stage 3 — Video generation (Mochi 1)

Mochi takes the Flux keyframe and runs image-to-video inference: 84 frames (~3.5s at 24fps) at 480×848, with guidance around 4.5.

VRAM optimization is critical:

# Memory optimizations (key for dual 4090)
pipe.enable_model_cpu_offload()   # staged offload
pipe.enable_vae_tiling()          # tiled decode
torch.cuda.empty_cache()          # clean after each scene

Stage 4 — Enhancement

Real-ESRGAN 4×: 480p → 1920p
RIFE interpolation: 24fps → 60fps
FFmpeg (H.265): CRF 18 for high quality

Stage 5 — Assembly

Stitch scenes by manifest order, add transitions, align audio, and export a final 4K60 deliverable.

7) Performance & cost

Per-scene timing (dual RTX 4090)

Step	Time
Llama analysis	~10s
Flux keyframe	~45s
Mochi video	~180s
Post-processing	~60s
Total per scene	~5 min

Output per scene: ~3.5 seconds of 4K60 video.

Project estimate (10-minute final video)

Assuming ~170 scenes:

Local: ~14 hours, roughly ~$5 electricity.
Cloud: ~8 hours, roughly ~$10–$30 GPU cost.

8) Future: NVIDIA Cosmos

In the future, I also want to explore world models to push coherence, controllability, and longer-horizon storytelling.

9) Gallery

View generated samples from the Flux + Mochi pipeline.

View Gallery →

10) Daily Log

2026-02-18

Tested Fireworks AI real-time speech-to-text streaming API — WebSocket-based transcription using PCM 16-bit audio at 16kHz with word-level timestamps and multi-language support.

2026-02-16

Decided to attend the Gray Area 2026 Cultural Incubator and TIAT program.

2026-02-14

Set up a YouTube channel to share demo videos generated from the pipeline.

2026-01-24

Project selected to receive an ElevenLabs Grant, securing 33,000,225 credits.

2026-01-22

Explored and run newly released WorldLab API.

2026-01-05

Generalize the pipeline into: user ask agent a question by clearly ask, not let agent guess by signals.

2026-01-04

Set up XGIMI projector and Google Play developer account. Tested Gemini Nano banana using voice input.