reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training

Authors: Xiuli Bi, Jian Lu, Bo Liu, Xiaodong Cun, Yong Zhang, Weisheng Li, Bin Xiao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct detailed experiments to verify the effectiveness of the proposed methods. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations. We conduct the experiments over various motions and the subject reference. Compared with state-of-the-art methods, our method shows much better visual quality and text-video alignment under the multi-customization settings. The experiments show the advantage of the proposed method over current methods. In Sec. 3.2, we have analyzed the effectiveness of the prompt replacement. Here, we show that training Lo RAs on the produced layers improves the performance. We also conduct user studies to show the effectiveness of the proposed methods.
Researcher Affiliation	Collaboration	1 School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China, 2 GVC Lab, Great Bay University, Dongguan, China, 3 Meituan, China 4 Jinan Inspur Data Technology Co., Ltd., Jinan, China
Pseudocode	No	The paper describes the methodology in detail but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository. The paper mentions other open-source projects but not its own implementation.
Open Datasets	Yes	Following previous methods (Kumari et al. 2023; Wei et al. 2024), for appearance customization, we collect a total of 13 objects from Dreambooth (Ruiz et al. 2023) and Custom Diffusion (Kumari et al. 2023), including pets, vehicles, toys, etc. For motion customization, we source 18 sets of videos from the UCF101 (Soomro, Zamir, and Shah 2012), the UCF (Soomro and Zamir 2015) Sports Action datasets, and the DAVIS dataset (Pont-Tuset et al. 2017).
Dataset Splits	No	The paper mentions the datasets used and the number of objects/videos collected for customization but does not specify how these datasets were split into training, validation, or test sets for the experiments.
Hardware Specification	Yes	All experiments are performed on a single A6000 GPU.
Software Dependencies	No	Our approach uses Animate Diff (Guo et al. 2023) as the base T2V model, trained with the LION optimizer (Chen et al. 2024b). During inference, we employ DDIM (Song, Meng, and Ermon 2021) sampling with 25 steps and a classifierfree guidance (Ho and Salimans 2022) scale of 9. The paper mentions software components but does not provide specific version numbers for them (e.g., Python, PyTorch, CUDA, or Animate Diff's version).
Experiment Setup	Yes	The learning rate for the spatial Lo RA is set at 1e 5, while the temporal Lo RA is trained with a learning rate of 5e 5. Both types of Lo RA are trained for 500 steps with a rank set to 32. In the test-time training phase, we set the learning rate to 1e 6 and train for 30 steps. During inference, we employ DDIM (Song, Meng, and Ermon 2021) sampling with 25 steps and a classifierfree guidance (Ho and Salimans 2022) scale of 9. We generate 16-frame videos at a resolution of 256 256 and 8 fps.