Auto-Regressive Diffusion for Generating 3D Human-Object Interactions

Authors: Zichen Geng, Zeeshan Hayder, Wei Liu, Ajmal Saeed Mian

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model has been evaluated on the OMOMO and BEHAVE datasets, where it outperforms existing state-of-the-art methods in terms of both performance and inference speed. This makes ARDHOI a robust and efficient solution for text-driven HOI tasks. Experiments on the OMOMO and BEHAVE datasets demonstrate that our method outperforms current SOTA techniques in both accuracy and inference speed.
Researcher Affiliation Academia 1The University of Western Australia 35 Stirling Highway, Perth, WA 6009 Australia 2Commonwealth Scientific and Industrial Research Organization Synergy building Black Mountain Canberra, ACT 2601 Australia EMAIL, EMAIL EMAIL, EMAIL
Pseudocode No The paper describes the model architecture and training process in prose and uses diagrams (Figure 2) to illustrate the components, but no structured pseudocode or algorithm blocks are present.
Open Source Code Yes Code https://github.com/gengzichen/ARDHOI
Open Datasets Yes Our experiments are conducted on the OMOMO (Li, Wu, and Liu 2023), and BEHAVE (Bhatnagar et al. 2022) datasets.
Dataset Splits Yes For the OMOMO dataset, we trim the sequence to a minimum length of 60 and a maximum length of 240 frames. For the BEHAVE dataset, we follow the annotation and sequence splitting by (Peng et al. 2023).
Hardware Specification No The paper discusses inference speed in terms of FLOPs and AITS, but does not specify the exact hardware (e.g., GPU model, CPU type) used for these measurements or for training.
Software Dependencies No The paper does not explicitly mention any specific software dependencies with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes In c VAE, the encoder is a three-block MLP, each consisting of one fully connected layer, a Si LU activation layer, a fully connected layer, and a layer norm... The channel size of the input is 1024, and the encoded token size is 512. The ARDM consists of 27 Mamba2 layers with a hidden dimension of 512. We use an expansion factor of 2 and the state number is 32. The MLP denoiser has the same setting as the c VAE encoder.