Auto-Regressive Diffusion for Generating 3D Human-Object Interactions
Authors: Zichen Geng, Zeeshan Hayder, Wei Liu, Ajmal Saeed Mian
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our model has been evaluated on the OMOMO and BEHAVE datasets, where it outperforms existing state-of-the-art methods in terms of both performance and inference speed. This makes ARDHOI a robust and efficient solution for text-driven HOI tasks. Experiments on the OMOMO and BEHAVE datasets demonstrate that our method outperforms current SOTA techniques in both accuracy and inference speed. |
| Researcher Affiliation | Academia | 1The University of Western Australia 35 Stirling Highway, Perth, WA 6009 Australia 2Commonwealth Scientific and Industrial Research Organization Synergy building Black Mountain Canberra, ACT 2601 Australia EMAIL, EMAIL EMAIL, EMAIL |
| Pseudocode | No | The paper describes the model architecture and training process in prose and uses diagrams (Figure 2) to illustrate the components, but no structured pseudocode or algorithm blocks are present. |
| Open Source Code | Yes | Code https://github.com/gengzichen/ARDHOI |
| Open Datasets | Yes | Our experiments are conducted on the OMOMO (Li, Wu, and Liu 2023), and BEHAVE (Bhatnagar et al. 2022) datasets. |
| Dataset Splits | Yes | For the OMOMO dataset, we trim the sequence to a minimum length of 60 and a maximum length of 240 frames. For the BEHAVE dataset, we follow the annotation and sequence splitting by (Peng et al. 2023). |
| Hardware Specification | No | The paper discusses inference speed in terms of FLOPs and AITS, but does not specify the exact hardware (e.g., GPU model, CPU type) used for these measurements or for training. |
| Software Dependencies | No | The paper does not explicitly mention any specific software dependencies with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | In c VAE, the encoder is a three-block MLP, each consisting of one fully connected layer, a Si LU activation layer, a fully connected layer, and a layer norm... The channel size of the input is 1024, and the encoded token size is 512. The ARDM consists of 27 Mamba2 layers with a hidden dimension of 512. We use an expansion factor of 2 and the state number is 32. The MLP denoiser has the same setting as the c VAE encoder. |