This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from
real-world videos, in order to enable prediction of future
3D scene motion from a single input image. We propose
a novel pixel-aligned Motion Map (MoMap) representation
for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and
effective motion prediction. To learn meaningful distributions
over motion, we create a large-scale database of MoMaps
from over 50,000 real videos and train a diffusion model on
these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new
pipeline
for 2D video synthesis: first generate a MoMap, then warp
an image accordingly and complete the warped point-based
renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent
3D
scene motion.