DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

CVPR 2025

Minghong Cai^1†, Xiaodong Cun², Xiaoyu Li^3*, Wenze Liu¹, Zhaoyang Zhang³, Yong Zhang⁴, Ying Shan³, Xiangyu Yue^1*

¹MMLab, The Chinese University of Hong Kong, ²GVC Lab, Great Bay University, ³ARC Lab, Tencent PCG, ⁴Tencent AI Lab

†Intern at ARC Lab, Tencent PCG, *Corresponding Authors

Abstract

Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer (MM-DiT) architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.

Key Observations

Figure 1. MM-DiT Attention Analysis

Figure 2. MM-DiT Text-to-Text and Video-to-Video Attention Visualization

First, we conduct a comprehensive analysis of the attention mechanism in MM-DiT architecture. Our investigation reveals that the MM-DiT's attention mechanism exhibits remarkable similarities to the cross/self-attention blocks found in UNet-like diffusion models. Specifically, as illustrated in Fig. 1, the attention matrix in MM-DiT can be decomposed into four distinct regions: text-to-text, video-to-video, text-to-video, and video-to-text attention. Taking the prompt "a cat watch a black mouse" as an example, we observe that each text token demonstrates significant activation patterns when analyzing the averaged attention values across text-to-video and video-to-text regions. Furthermore, as shown in Fig. 2, our visualization of text-to-video and video-to-text attention patterns reveals that MM-DiT not only possesses functionality similar to UNet's self-attention but also demonstrates enhanced capabilities in temporal modeling. This indicates that the attention mechanism of MM-DiT can be easily extended to the multi-prompt video generation task, enabling mask-guided precise semantic control across different prompts with kv-sharing for multi-prompt video generation. Additionally, the attention mechanism of MM-DiT can be easily extended to the video editing task, such as word swap and video reweight.

Multi-Prompt Video Generation Pipeline

Our method tries to synthesize content-consistent and motion-consistent videos based on multi-prompts. The first video is synthesized with source text prompt \(P_{i-1}\). During the denoising process for video synthesis, we convert the full-attention into masked-guided KV-sharing strategy to query video contents from source video \(V_{i-1}\), so that we can synthesize content-consistent video under the modified target prompt \(P_i\). Note that initial latents are assumed to be 5 frames. The first three frames are used to generate the contents of \(P_{i-1}\), and the last three frames are used to generate contents of \(P_i\). The pink latent represents the overlapping frame, while the blue and green latents are used to distinguish different prompt segments.

More Applications

1. Single-prompt longer video generation

Although our design is for multi-prompt video generation task, our method can naturally work on single-prompt longer video generation by setting sequential prompts as the same. This shows that our method can enhance the consistency of single prompt in long video generation.

A white SUV drives on a steep dirt road... (665 frames)

fish in ocean (241 frames)

hot air balloon (241 frames)

2. Video Editing -- Word Swap

Removing our latent blending strategy of our approach DiTCtrl, we can achieve the video editing performance of Word Swap^[1][2]. Specifically, we just use masked-guided KV-sharing strategy to share keys and values from source prompt \(P_{source}\) branch, so that we can synthesize a new video to preserve the original composition while also addressing the content of the new prompt \(P_{target}\) .

3. Video Editing -- Reweight

Similar to prompt-to-prompt^[1], through reweighting the specific columns and rows corresponding to specified token (e.g. "pink" and "snowy") in the MM-DiT's Text-Video attention and Video-Text attention, we can also achieve the video editing performance of reweight.

BibTeX

@article{cai2024ditctrl,
      title     = {DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation},
      author    = {Cai, Minghong and Cun, Xiaodong and Li, Xiaoyu and Liu, Wenze and Zhang, Zhaoyang and Zhang, Yong and Shan, Ying and Yue, Xiangyu},
      journal   = {arXiv:2412.18597},
      year      = {2024},
    }