DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

1MMLab, The Chinese University of Hong Kong, 2GVC Lab, Great Bay University, 3ARC Lab, Tencent PCG, 4Tencent AI Lab

†Intern at ARC Lab, Tencent PCG, *Corresponding Authors

Click here →

Our method DiTCtrl can generate multi-prompt videos with good temporal consistency and strong prompt-following capabilities. We showcase various types of transitions, including background transitions, motion transitions, identity transitions, and camera view transitions, demonstrating the versatility of our approach. Note that first case is inspired by the scene description from concurrent training-based work Presto[3]. Although our method is training-free, DiTCtrl also demonstrates cinematographic-style transitions in depicting the boy's riding sequence.

Abstract

Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer (MM-DiT) architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.

Key Observations

Attention visualization

Figure 1. MM-DiT Attention Analysis

Qualitative comparison

Figure 2. MM-DiT Text-to-Text and Video-to-Video Attention Visualization

First, we conduct a comprehensive analysis of the attention mechanism in MM-DiT architecture. Our investigation reveals that the MM-DiT's attention mechanism exhibits remarkable similarities to the cross/self-attention blocks found in UNet-like diffusion models. Specifically, as illustrated in Fig. 1, the attention matrix in MM-DiT can be decomposed into four distinct regions: text-to-text, video-to-video, text-to-video, and video-to-text attention. Taking the prompt "a cat watch a black mouse" as an example, we observe that each text token demonstrates significant activation patterns when analyzing the averaged attention values across text-to-video and video-to-text regions. Furthermore, as shown in Fig. 2, our visualization of text-to-video and video-to-text attention patterns reveals that MM-DiT not only possesses functionality similar to UNet's self-attention but also demonstrates enhanced capabilities in temporal modeling. This indicates that the attention mechanism of MM-DiT can be easily extended to the multi-prompt video generation task, enabling mask-guided precise semantic control across different prompts with kv-sharing for multi-prompt video generation. Additionally, the attention mechanism of MM-DiT can be easily extended to the video editing task, such as word swap and video reweight.

Multi-Prompt Video Generation Pipeline

Method pipeline

Our method tries to synthesize content-consistent and motion-consistent videos based on multi-prompts. The first video is synthesized with source text prompt \(P_{i-1}\). During the denoising process for video synthesis, we convert the full-attention into masked-guided KV-sharing strategy to query video contents from source video \(V_{i-1}\), so that we can synthesize content-consistent video under the modified target prompt \(P_i\). Note that initial latents are assumed to be 5 frames. The first three frames are used to generate the contents of \(P_{i-1}\), and the last three frames are used to generate contents of \(P_i\). The pink latent represents the overlapping frame, while the blue and green latents are used to distinguish different prompt segments.

More Applications

1. Single-prompt longer video generation

Although our design is for multi-prompt video generation task, our method can naturally work on single-prompt longer video generation by setting sequential prompts as the same. This shows that our method can enhance the consistency of single prompt in long video generation.

2. Video Editing -- Word Swap

Removing our latent blending strategy of our approach DiTCtrl, we can achieve the video editing performance of Word Swap[1][2]. Specifically, we just use masked-guided KV-sharing strategy to share keys and values from source prompt \(P_{source}\) branch, so that we can synthesize a new video to preserve the original composition while also addressing the content of the new prompt \(P_{target}\) .

3. Video Editing -- Reweight

Similar to prompt-to-prompt[1], through reweighting the specific columns and rows corresponding to specified token (e.g. "pink" and "snowy") in the MM-DiT's Text-Video attention and Video-Text attention, we can also achieve the video editing performance of reweight.

Method Comparison

References

  1. Prompt-to-Prompt Image Editing with Cross Attention Control. Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. https://arxiv.org/abs/2208.01626
  2. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, Yinqiang Zheng. https://arxiv.org/abs/2304.08465
  3. Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation. Xin Yan, Yuxuan Cai, Qiuyue Wang, Yuan Zhou, Wenhao Huang, Huan Yang. https://arxiv.org/abs/2412.01316

BibTeX

@article{cai2024ditctrl,
      title     = {DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation},
      author    = {Cai, Minghong and Cun, Xiaodong and Li, Xiaoyu and Liu, Wenze and Zhang, Zhaoyang and Zhang, Yong and Shan, Ying and Yue, Xiangyu},
      journal   = {arXiv:2412.18597},
      year      = {2024},
    }