Single-view Image to Novel-view Generation for Hand-Object Interactions

Authors: Zhongqun Zhang, Yihua Cheng, Eduardo Pérez-Pellitero, Yiren Zhou, Jiankang Deng, Hyung Jin Chang, Jifei Song

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the HO3D and Dex YCB datasets demonstrate that our method significantly outperforms stateof-the-art novel-view synthesis for hand-object interactions. Experiments Experimental Setup Datasets. We conduct experiments on two real-world datasets: HO3D (Hampali et al. 2020) and Dex YCB (Chao et al. 2021). Evaluation Metrics. To evaluate the novel view synthesis quality, we use three metrics by comparing with the ground truth images: we report PSNR, SSIM (Wang et al. 2004), and LPIPS* (Zhang et al. 2018).
Researcher Affiliation Collaboration 1University of Birmingham, UK 2Huawei, Noah s Ark Lab, UK
Pseudocode No The paper describes methods using mathematical formulations and descriptive text, but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No The paper does not contain an explicit statement about releasing source code or provide a link to a code repository.
Open Datasets Yes We conduct experiments on two real-world datasets: HO3D (Hampali et al. 2020) and Dex YCB (Chao et al. 2021). Both datasets capture dynamic hand-object interactions from multiple views and provide comprehensive annotations.
Dataset Splits Yes We follow i HOI (Ye, Gupta, and Tulsiani 2022) to split training and testing sets. Dex YCB is one of the largest real-world hand-object video datasets, and we focus on right-hand samples using the official s0 split. Following g SDF (Chen et al. 2023), we use 29,656 training samples and 5,928 testing samples.
Hardware Specification Yes Training the diffusion model takes about 10 hours on 8 NVIDIA A100 GPUs for 30k steps. We use the same offline systems (Rong, Shiratori, and Joo 2020) as (Ye, Gupta, and Tulsiani 2022) to estimate the hand and camera poses. During testing, our method takes 2 seconds on a single A100 GPU to generate one image.
Software Dependencies No The paper mentions several models and frameworks like Zero123XL, CLIP, and Control Net, but it does not provide specific version numbers for any software dependencies used for implementation (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We utilize a batch size of 30 images and an Adam W (Kingma and Ba 2015) optimizer with a learning rate of 10 4, incorporating a constant warmup scheduling. We finetune our model on the HO3D (Hampali et al. 2020) dataset selecting 110,00 image pairs. Training the diffusion model takes about 10 hours on 8 NVIDIA A100 GPUs for 30k steps. For the 3D Gaussian Splatting part, we train 500 steps for SDS loss. The 3D Gaussians are initialized with an opacity of 0.1 and a grey color inside a sphere with a radius of 0.2. We sample random camera poses at a fixed radius of 1, with a y-axis field of view (FOV) of 49 degrees. The azimuth ranges from -180 to 180 degrees and the elevation ranges from -30 to 30 degrees.