1The Chinese University of Hong Kong, 2Adobe Research 3The Hong Kong University of Science and Technology, 4SmartMore
*Internship Project Co-corresponding authors

Object & Effect Removal

Object Insertion

Object Replacement

Effect Editing

Background Replacement

Emergent Ability: Outpainting

Abstract

Large-scale video generation models have the inherent ability to realistically model natural scenes. In this paper, we demonstrate that through a careful design of a generative video propagation framework, various video tasks can be addressed in a unified way by leveraging the generative power of such models.

Specifically, our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model. We propose a data generation scheme to cover multiple video tasks based on instance-level video segmentation datasets. Our model is trained by incorporating a mask prediction decoder head and optimizing a region-aware loss to aid the encoder to preserve the original content while the generation model propagates the modified region.

This novel design opens up new possibilities: In editing scenarios, GenProp allows substantial changes to an object’s shape; for insertion, the inserted objects can exhibit independent motion; for removal, GenProp effectively removes effects like shadows and reflections from the whole video; for tracking, GenProp is capable of tracking objects and their associated effects together. Experiment results demonstrate the leading performance of our model in various video tasks, and we further provide in-depth analyses of the proposed framework.

Video

Method

Training Framework of GenProp. Our framework integrates a Selective Content Encoder and a Mask Prediction Decoder on top of the I2V generation model, enforcing the model to propagate the edited region while preserving the content in the original video for all other regions.

Tracking Comparison

GenProp is further able to perform instance tracking of objects and their effects when solid color fills are given as the first frame.

Multiple Edits

Multiple edits (outpainting, insertion, removal) at a single inference run.

Apply GenProp on MovieGen’s Cases

BibTeX

@article{liu2024generativevideopropagation,
      title={Generative Video Propagation}, 
      author={Shaoteng Liu and Tianyu Wang and Jui-Hsien Wang and Qing Liu and Zhifei Zhang and Joon-Young Lee and Yijun Li and Bei Yu and Zhe Lin and Soo Ye Kim and Jiaya Jia},
      journal={arXiv preprint arXiv:2412.19761},
      year={2024},
    }