Infinite worlds, Summoned by voice
1) OSS end-to-end real-time video gen (Flux + Mochi)
I’m building an immersive, locally deployed pipeline that turns a novel (or a spoken narrative) into film-like video. The goal is privacy-first production and full control over model provenance.
- Privacy: everything can run on your workstation.
- Reproducibility: models + configs are explicit and versioned.
- Quality: Flux for high-quality keyframes, Mochi for motion.
2) Core model: Mochi 1
Mochi 1 is a widely used open-source DiT video model from Genmo AI. It’s known for strong physical motion: fluids, lighting changes, and particle motion feel consistent and “real.”
- Industrial scale: ~10B parameters (not a toy model).
- English-first training: strong at complex sci-fi descriptions.
- Motion realism: coherent dynamics across frames.
In practice, a dual RTX 4090 (48GB total VRAM) workstation is a great comfort zone for Mochi + decoding, especially with offload + tiling.
3) Architecture overview
The pipeline is five layers, with a clean data flow:
Text / Audio input
↓
[Perception] Whisper Large-v3 (ASR)
↓
[Understanding] Llama-3.1-70B (scene split + prompt generation)
↓
[Static] Flux.1 Dev (4K keyframes)
↓
[Dynamic] Mochi 1 (physics-realistic video)
↓
[Post] Upscale + frame interpolation + encoding → 4K60 output
4) Hardware configuration
Local workstation (dual RTX 4090)
| Part | Spec | Budget (USD) |
|---|---|---|
| GPU | 2× RTX 4090 24GB | Owned |
| CPU | Ryzen 9 7950X | ~$550 |
| RAM | 128GB DDR5 | ~$400 |
| NVMe | 4TB Gen4 | ~$300 |
| PSU | 1600W (Platinum) | ~$350 |
| Total | ~$1,600 |
VRAM allocation strategy
- GPU 0: Mochi inference + VAE decode.
- GPU 1: Flux inference + Llama Int4.
- CPU RAM: offload buffer for large models.
Cloud fallback (batch production)
| Platform | Config | $/hour |
|---|---|---|
| Vast.ai | 2× 4090 | $1.20 |
| RunPod | 2× A100 | $3.50 |
| Lambda | 1× H100 | $2.50 |
Simple strategy: develop locally, scale out in the cloud.
5) Model inventory
| Model | Size | Source | License |
|---|---|---|---|
| Whisper Large-v3 | ~3GB | OpenAI | MIT |
| Llama-3.1-70B (Int4) | ~35GB | Meta | Llama 3.1 |
| Flux.1 Dev | ~24GB | Black Forest Labs | Apache 2.0 |
| Mochi 1 Preview | ~40GB | Genmo | Apache 2.0 |
| Real-ESRGAN | ~64MB | xinntao | BSD |
| RIFE | ~50MB | hzwer | MIT |
Total: ~102GB. I keep ~200GB SSD free for caches and intermediate outputs.
6) Workflow (what the system actually does)
Stage 1 — Text preprocessing
A large LLM is used to produce a Scene Manifest JSON: scene boundaries, extracted visual elements, Flux prompts (composition + texture), and Mochi motion prompts (camera + physics).
Stage 2 — Keyframe generation (Flux.1 Dev)
Generate a 2048×1152 first frame per scene (optionally with ControlNet / IP-Adapter for consistency).
- guidance_scale: 3.5
- inference_steps: 50
Stage 3 — Video generation (Mochi 1)
Mochi takes the Flux keyframe and runs image-to-video inference: 84 frames (~3.5s at 24fps) at 480×848, with guidance around 4.5.
VRAM optimization is critical:
# Memory optimizations (key for dual 4090)
pipe.enable_model_cpu_offload() # staged offload
pipe.enable_vae_tiling() # tiled decode
torch.cuda.empty_cache() # clean after each scene
Stage 4 — Enhancement
- Real-ESRGAN 4×: 480p → 1920p
- RIFE interpolation: 24fps → 60fps
- FFmpeg (H.265): CRF 18 for high quality
Stage 5 — Assembly
Stitch scenes by manifest order, add transitions, align audio, and export a final 4K60 deliverable.
7) Performance & cost
Per-scene timing (dual RTX 4090)
| Step | Time |
|---|---|
| Llama analysis | ~10s |
| Flux keyframe | ~45s |
| Mochi video | ~180s |
| Post-processing | ~60s |
| Total per scene | ~5 min |
Output per scene: ~3.5 seconds of 4K60 video.
Project estimate (10-minute final video)
Assuming ~170 scenes:
- Local: ~14 hours, roughly ~$5 electricity.
- Cloud: ~8 hours, roughly ~$10–$30 GPU cost.
8) Future: NVIDIA Cosmos
In the future, I also want to explore world models to push coherence, controllability, and longer-horizon storytelling.
9) Gallery
View generated samples from the Flux + Mochi pipeline.
View Gallery →