Think Then React: Towards Unconstrained Action-to-Reaction Motion Generation
Authors: Wenhui Tan, Boyuan Li, Chuhao Jin, Wenbing Huang, Xiting Wang, Ruihua Song
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that TTR outperforms existing baselines, achieving significant improvements in evaluation metrics, such as reducing FID from 3.988 to 1.942. ...We evaluate our proposed method with strong baselines and further analyze contributions of different components, and the impact of key parameters. ...We conduct an experiment to change the downsampling parameter frame rate and calculate the difference between taking ground-truth action and random action as the input of M, in terms of summed ranking scores (Top-1, Top-2, Top-3 and Acc.). |
| Researcher Affiliation | Academia | Wenhui Tan, Boyuan Li, Chuhao Jin, Wenbing Huang, Xiting Wang & Ruihua Song Gaoling School of Artificial Intelligence Renmin University of China Beijing, China EMAIL |
| Pseudocode | No | The paper describes methods and processes in paragraph form and through diagrams (Figure 1 and Figure 2), but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: https://Think-Then-React.github.io/. |
| Open Datasets | Yes | Dataset. We evaluate all the methods on Inter-X dataset, which consists about 9K training samples and 1,708 test samples. Each sample is an action-reaction sequence and three corresponding textual description. As supplementation, we mix our pre-training data with single person motiontext dataset Human ML3D (Guo et al., 2022a), which consists more than 23K annotated motion sequences. |
| Dataset Splits | Yes | Dataset. We evaluate all the methods on Inter-X dataset, which consists about 9K training samples and 1,708 test samples. ...We evaluate each method for 20 times with different seeds to calculate the final results at 95% confidence interval. |
| Hardware Specification | Yes | Both the pre-training and fine-tuning phases are trained on a single machine with 8 Tesla V100 GPUs. ...The motion VQ-VAE is trained for 150K steps with batch size set to 256 and learning rate fixed at 1e-4 on a single Tesla V100 GPU. |
| Software Dependencies | No | For the LLM, we adopt Flan-T5-base (Chung et al., 2024; Raffel et al., 2020) as our base model, with extended vocabulary. ...We use the text embedding layer from clip-vit-large-patch14 (Radford et al., 2021), which is frozen during training. |
| Experiment Setup | Yes | We warm up the learning rate for 1,000 steps, peaking at 1e-4 for the pre-training phase, and use the same learning rate for fine-tuning. Both the pre-training and fine-tuning phases are trained on a single machine with 8 Tesla V100 GPUs. The training batch size is set to 32 for the LLM and we monitor the validation loss and reaction generation metrics for early-stopping, resulting about 100K pre-training steps and 40K fine-tuning steps. We set the re-thinking interval Nr to 4 tokens and divide each space signal into Nb = 10 bins. ...The motion VQ-VAE is trained for 150K steps with batch size set to 256 and learning rate fixed at 1e-4 on a single Tesla V100 GPU. ...We train the model on both the Inter-X and Human ML3D datasets for 200,000 steps, with batch size set to 256, and learning rate set to 1e-4. We apply L1-loss on both pose feature and velocity reconstruction, and a commitment loss for the embedding process. The weight set to velocity loss is 0.5 and commitment loss is 0.02. |