TokenFlow: Consistent Diffusion Features for Consistent Video Editinghttps://diffusion-tokenflow.github.io/
TokenFlow is a framework for text-driven video editing using a text-to-image diffusion model. The framework aims to generate high-quality videos that adhere to a target text while preserving the spatial layout and dynamics of the input video. The method leverages the observation that consistency in the edited video can be achieved by enforcing consistency in the diffusion feature space. This is done by propagating edited features across frames based on inter-frame correspondences. The framework does not require training or fine-tuning and can be used with any text-to-image editing method. Experimental results demonstrate state-of-the-art video editing on real-world videos. The method involves inverting input video frames, extracting tokens, and using nearest-neighbor search for feature correspondences. Denoising is performed by replacing generated tokens with propagated tokens from the original video.