Joint Diffusion for Universal Hand-Object Grasp Generation

Authors: Jinkun Cao, Jingyuan Liu, Kris Kitani, Yi Zhou

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental According to both qualitative and quantitative experiments, both conditional and unconditional generation of hand grasp achieves good visual plausibility and diversity.
Researcher Affiliation Collaboration Jinkun Cao* EMAIL Carnegie Mellon University Jingyuan Liu EMAIL Adobe Kris Kitani EMAIL Carnegie Mellon University Yi Zhou* EMAIL Roblox
Pseudocode No The paper describes the methodology using mathematical equations and prose, but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The text discusses using Adobe Firefly as an existing tool for image generation and mentions other third-party tools/models but does not provide specific access information (e.g., a repository link or explicit release statement) for the authors' own methodology described in the paper.
Open Datasets Yes Datasets. We combine the data from multiple resources to train the model. GRAB (Taheri et al., 2020) contains human full-body poses together with 3D objects. For Oak Ink (Yang et al., 2022), we use the official training split for training. We also use the contact-adapted synthetic grasp from the Oak Ink-Shape dataset for training. Besides the hand-object interaction data, we also leverage the rich resources of 3D object data to help train the object part in our model. [...] LION learns from a much larger basis, i.e., more than 50,000 objects in Shape Net (Chang et al., 2015). [...] We hold the objects from the Oak Ink-Shape test set and ARCTIC (Fan et al., 2023) dataset for quantitative evaluations.
Dataset Splits Yes We use the official training split for training. We hold the objects from the Oak Ink-Shape test set and ARCTIC (Fan et al., 2023) dataset for quantitative evaluations. We train the models on the GRAB and Oak Ink-shape train splits. The results are shown in Table 1.
Hardware Specification No The paper mentions running experiments and training models but does not specify any hardware details such as GPU models, CPU types, or memory used.
Software Dependencies No The paper mentions various software components and models (e.g., MANO, LION, MDM, Adobe Firefly) but does not provide specific version numbers for any of them. For example, it says 'we follow MDM (Tevet et al., 2022) to use a transformer encoder-only backbone-based diffusion network', but not 'MDM vX.Y'.
Experiment Setup Yes For the encoder network to transform modality features to latent codes, we always use 2-layer MLP networks with a hidden dimension of 1024 and an output dimension of 512. For the denoiser, we follow MDM (Tevet et al., 2022) to use a transformer encoder-only backbone-based diffusion network. In practice, we combine these two training objectives in a 1:1 ratio for a single draw of training data batch.