CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training
Authors: Xiuli Bi, Jian Lu, Bo Liu, Xiaodong Cun, Yong Zhang, Weisheng Li, Bin Xiao
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct detailed experiments to verify the effectiveness of the proposed methods. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations. We conduct the experiments over various motions and the subject reference. Compared with state-of-the-art methods, our method shows much better visual quality and text-video alignment under the multi-customization settings. The experiments show the advantage of the proposed method over current methods. In Sec. 3.2, we have analyzed the effectiveness of the prompt replacement. Here, we show that training Lo RAs on the produced layers improves the performance. We also conduct user studies to show the effectiveness of the proposed methods. |
| Researcher Affiliation | Collaboration | 1 School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China, 2 GVC Lab, Great Bay University, Dongguan, China, 3 Meituan, China 4 Jinan Inspur Data Technology Co., Ltd., Jinan, China |
| Pseudocode | No | The paper describes the methodology in detail but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository. The paper mentions other open-source projects but not its own implementation. |
| Open Datasets | Yes | Following previous methods (Kumari et al. 2023; Wei et al. 2024), for appearance customization, we collect a total of 13 objects from Dreambooth (Ruiz et al. 2023) and Custom Diffusion (Kumari et al. 2023), including pets, vehicles, toys, etc. For motion customization, we source 18 sets of videos from the UCF101 (Soomro, Zamir, and Shah 2012), the UCF (Soomro and Zamir 2015) Sports Action datasets, and the DAVIS dataset (Pont-Tuset et al. 2017). |
| Dataset Splits | No | The paper mentions the datasets used and the number of objects/videos collected for customization but does not specify how these datasets were split into training, validation, or test sets for the experiments. |
| Hardware Specification | Yes | All experiments are performed on a single A6000 GPU. |
| Software Dependencies | No | Our approach uses Animate Diff (Guo et al. 2023) as the base T2V model, trained with the LION optimizer (Chen et al. 2024b). During inference, we employ DDIM (Song, Meng, and Ermon 2021) sampling with 25 steps and a classifierfree guidance (Ho and Salimans 2022) scale of 9. The paper mentions software components but does not provide specific version numbers for them (e.g., Python, PyTorch, CUDA, or Animate Diff's version). |
| Experiment Setup | Yes | The learning rate for the spatial Lo RA is set at 1e 5, while the temporal Lo RA is trained with a learning rate of 5e 5. Both types of Lo RA are trained for 500 steps with a rank set to 32. In the test-time training phase, we set the learning rate to 1e 6 and train for 30 steps. During inference, we employ DDIM (Song, Meng, and Ermon 2021) sampling with 25 steps and a classifierfree guidance (Ho and Salimans 2022) scale of 9. We generate 16-frame videos at a resolution of 256 256 and 8 fps. |