We present an implicit video representation for occlusions, appearance, and
motion disentanglement from monocular videos, which we call Video
SPatiotemporal Splines (VideoSPatS). Unlike previous methods that map time and
coordinates to deformation and canonical colors, our VideoSPatS maps input
coordinates into Spatial and Color Spline deformation fields
Ds and
Dc,
which disentangle motion and appearance in videos. With spline-based
parametrization, our method naturally generates temporally consistent flow and
guarantees long-term temporal consistency, which is crucial for convincing
video editing. Using multiple prediction branches, our VideoSPatS model also
performs layer separation between the latent video and the selected occluder.
By disentangling occlusions, appearance, and motion, our method enables better
spatiotemporal modeling and editing of diverse videos, including in-the-wild
talking head videos with challenging occlusions, shadows, and specularities
while maintaining an appropriate canonical space for editing. We also present
general video modeling results on the DAVIS and CoDeF datasets, as well as our
own talking head video dataset collected from open-source web videos. Extensive
ablations show the combination of
Ds and
Dc under neural splines can
overcome motion and appearance ambiguities, paving the way for more advanced
video editing models.