PaintScene4D: Consistent 4D Scene Generation from Text Prompts

Abstract

Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack photorealism. While text-to-video models can generate more realistic scenes with motion, they often struggle with spatial understanding and provide limited control over camera viewpoints during rendering. To address these limitations, we present PaintScene4D, a novel text-to-4D scene generation framework that departs from conventional multi-view generative models in favor of a streamlined architecture that harnesses video generative models trained on diverse real-world datasets. Our method first generates a reference video using a video generation model, and then employs a strategic camera array selection for rendering. We apply a progressive warping and inpainting technique to ensure both spatial and temporal consistency across multiple viewpoints. Finally, we optimize multi-view images using a dynamic renderer, enabling flexible camera control based on user preferences. Our PaintScene4D produces realistic 4D scenes that can be viewed from arbitrary trajectories.

We highlight our advantages as: (1) Training-free and Efficient, (2) Scene-level Generation, and (3) Camera Trajectory Control.

Video

Gallery of Results

We present qualitative results of our text-to-4D generation framework, showcasing superior visual fidelity, consistent multi-view reconstructions, plausible scene compositions, and realistic dynamic motions. The horizontal axis represents the time axis and the vertical axis represents different view points. A comprehensive collection of video demonstrations is provided in the supplementary materials.

Explicit Camera Control

Camera Control: PaintScene4D demonstrates strong explicit camera control capabilities. Input text prompt: "A rabbit flying a drone in the middle of mountains." In contrast, to guide the T2V model, we append camera motion directives such as "The camera tilts to the right" or "The camera moves upwards" to the text prompt. However, due to implicit handling of camera motion, T2V often fails to generate precise or controllable camera movements. Our approach, once trained, allows for flexible camera trajectories within the bounds of the input training cameras, achieving precise and repeatable control over camera movements for the same scene.

BibTeX

@article{gupta2024paintscene4d,
title={PaintScene4D: Consistent 4D Scene Generation from Text Prompts},
author={Gupta, Vinayak and Man, Yunze and Wang, Yuxiong},
journal={https://arxiv.org/abs/2412.04471},
year={2024}
}