Free-Moving Object Reconstruction and Pose Estimation with Virtual Camera

Authors: Haixin Shi, Yinlin Hu, Daniel Koguciuk, Juan-Ting Lin, Mathieu Salzmann, David Ferstl

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on the standard HO3D dataset and a collection of egocentric RGB sequences captured with a head-mounted device. We demonstrate that our approach outperforms most methods significantly, and is on par with recent techniques that assume prior information. [...] 4 Experiments We evaluate our method systematically in this section. Datasets. We first evaluate our method on the standard HO3D dataset (Hampali et al. 2020), which includes video captures of daily objects with a fixed camera.
Researcher Affiliation Collaboration Haixin Shi1, Yinlin Hu2, Daniel Koguciuk2, Juan-Ting Lin2 Mathieu Salzmann1, David Ferstl2 1EPFL 2Magic Leap
Pseudocode No The paper describes the approach using text descriptions, mathematical equations, and figures, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Project Page https://haixinshi.github.io/fmov/
Open Datasets Yes We evaluate our method on both the HO3D dataset (Hampali et al. 2020) with fixed camera and a collection of data captured using a head-mounted AR device with egocentric views.
Dataset Splits Yes We report results on the 9 sequences of HO3D as in (Hampali et al. 2023; Ye, Gupta, and Tulsiani 2022)
Hardware Specification Yes On a typical NVIDIA V100 GPU, the training of a 100-frame sequence takes about 3 hours for initialization and 7 hours for refinement.
Software Dependencies No The paper mentions using specific models and optimizers like 'ADAM optimizer (Kingma and Ba 2015)' and 'Lo FTR (Sun et al. 2021)', but does not provide specific version numbers for software libraries, programming languages, or other ancillary tools used in the implementation.
Experiment Setup Yes During training, the learning rate warms up linearly from 0 to 5e-4 during the first 5k iteration and then follows a cosine decay schedule with alpha=0.05. For Pose MLP, we use another ADAM optimizer with a cosine decay schedule of alpha=0.5. We randomly sample 512 rays from the input image batch. During the optimization with guided virtual camera, we only sample 32 points along each ray for efficiency. We progressively train our model with B consecutive images as a group. For every group, we train the networks with a fixed number of training steps (typically 1K). We sample 20% of the rays from images within previously-converged groups and 80% from the images within the newly added group. We train the networks for 150K training steps for refinement.