First, we conduct a comprehensive analysis of the attention mechanism in MM-DiT architecture. Our investigation reveals that the MM-DiT's attention mechanism exhibits remarkable similarities to the cross/self-attention blocks found in UNet-like diffusion models. Specifically, as illustrated in Fig. 1, the attention matrix in MM-DiT can be decomposed into four distinct regions: text-to-text, video-to-video, text-to-video, and video-to-text attention. Taking the prompt "a cat watch a black mouse" as an example, we observe that each text token demonstrates significant activation patterns when analyzing the averaged attention values across text-to-video and video-to-text regions. Furthermore, as shown in Fig. 2, our visualization of text-to-video and video-to-text attention patterns reveals that MM-DiT not only possesses functionality similar to UNet's self-attention but also demonstrates enhanced capabilities in temporal modeling. This indicates that the attention mechanism of MM-DiT can be easily extended to the multi-prompt video generation task, enabling mask-guided precise semantic control across different prompts with kv-sharing for multi-prompt video generation. Additionally, the attention mechanism of MM-DiT can be easily extended to the video editing task, such as word swap and video reweight.
Our method tries to synthesize content-consistent and motion-consistent videos based on multi-prompts. The first video is synthesized with source text prompt \(P_{i-1}\). During the denoising process for video synthesis, we convert the full-attention into masked-guided KV-sharing strategy to query video contents from source video \(V_{i-1}\), so that we can synthesize content-consistent video under the modified target prompt \(P_i\). Note that initial latents are assumed to be 5 frames. The first three frames are used to generate the contents of \(P_{i-1}\), and the last three frames are used to generate contents of \(P_i\). The pink latent represents the overlapping frame, while the blue and green latents are used to distinguish different prompt segments.
Although our design is for multi-prompt video generation task, our method can naturally work on single-prompt longer video generation by setting sequential prompts as the same. This shows that our method can enhance the consistency of single prompt in long video generation.
A white SUV drives on a steep dirt road... (665 frames)
fish in ocean (241 frames)
hot air balloon (241 frames)
Removing our latent blending strategy of our approach DiTCtrl, we can achieve the video editing performance of Word Swap[1][2]. Specifically, we just use masked-guided KV-sharing strategy to share keys and values from source prompt \(P_{source}\) branch, so that we can synthesize a new video to preserve the original composition while also addressing the content of the new prompt \(P_{target}\) .
Similar to prompt-to-prompt[1], through reweighting the specific columns and rows corresponding to specified token (e.g. "pink" and "snowy") in the MM-DiT's Text-Video attention and Video-Text attention, we can also achieve the video editing performance of reweight.
@article{cai2024ditctrl,
title = {DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation},
author = {Cai, Minghong and Cun, Xiaodong and Li, Xiaoyu and Liu, Wenze and Zhang, Zhaoyang and Zhang, Yong and Shan, Ying and Yue, Xiangyu},
journal = {arXiv:2412.18597},
year = {2024},
}