VideoAnydoor : High-fidelity Video Object Insertion with Precise Motion Control

Yuanpeng Tu1,2   Hao Luo2,3   Xi Chen1   Sihui Ji1   Xiang Bai4   Hengshuang Zhao1  
1HKU    2DAMO Academy, Alibaba Group    3Hupan Lab    4HUST




Input video
Edited Video


Our method can insert/replace diverse objects into the videos with precise motion alignment with the given
motion trajectories, where the appearance details of the reference object can be accurately preserved as well.

Our VideoAnydoor enables a wide range of applications for users, including video virtual tryon,
video face swapping, logo insertion, multi-region editing. (Scroll to view more videos)

Video Virtual Try-on



Our VideoAnydoor enables a wide range of applications for users, including video virtual tryon, video face swapping, logo insertion, and multi-region editing. (Scroll to view more videos)

Video Face Swapping



Our VideoAnydoor enables a wide range of applications for users, including video virtual tryon, video face swapping, logo insertion, and multi-region editing. (Scroll to view more videos)

Logo Insection



Our VideoAnydoor enables a wide range of applications for users, including video virtual tryon, video face swapping, logo insertion, and multi-region editing. (Scroll to view more videos)

Multi-region editing





Abstract

In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a reweight reconstruction loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., video face swapping, video virtual try-on, multi-region editing) without task-specific fine-tuning.





Overall Pipeline

In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a reweight reconstruction loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., video face swapping, video virtual try-on, multi-region editing) without task-specific fine-tuning.

Video Introduction














BibTeX

@misc{tu2025videoanydoorhighfidelityvideoobject,
      title={VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control}, 
      author={Yuanpeng Tu and Hao Luo and Xi Chen and Sihui Ji and Xiang Bai and Hengshuang Zhao},
      year={2025},
      journal={arXiv preprint arXiv:2501.01427}, 
}