TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances
Authors: Wenting Xu, Viorela Ila, Luping Zhou, Craig T. Jin
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our TB-HSU model by comparing it with two non-neural network models, three baseline neural network models, and some published models. We run the comparisons using three different datasets. |
| Researcher Affiliation | Academia | Wenting Xu1, Viorela Ila2, Luping Zhou1, Craig T. Jin1 1School of Electrical and Computer Engineering, The University of Sydney 2School of Aerospace, Mechanical and Mechatronic Engineering, The University of Sydney EMAIL |
| Pseudocode | No | The paper describes the model architecture and training process, including mathematical formulas and a diagram (Figure 2), but does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The code and dataset are publicly available. Code, Dataset Github.com/Wenting Xu3o3/TB-HSU |
| Open Datasets | Yes | To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ the three datasets in the evaluation experiments. 3DHSG Dataset... Scan Net (Dai et al. 2017) Scan Net is an RGB-D video dataset... Matterport3D (Chang et al. 2017) Matterport3D is an RGB-D dataset... |
| Dataset Splits | Yes | 3DHSG Dataset We split our custom 3DHSG dataset into 96 scenes (80%) for training and 24 scenes (20%) for testing. We exclude object labels such as wall , floor , and ceiling because these regions are not annotated. Scan Net (Dai et al. 2017) Scan Net is an RGB-D video dataset containing 1513 scans annotated with instance-level semantic segmentations. We split it into 1013 scenes for training and 500 scenes for testing, the same as (Huang, Usvyatsov, and Schindler 2020). Matterport3D (Chang et al. 2017) Matterport3D is an RGB-D dataset consisting of 90 reconstructions of indoor building-scale scenes with 2194 rooms. There are 30 room types in the dataset. We split the dataset in the same way as the benchmark and discarded rooms that contain less than 3 objects, the same as (Hughes, Chang, and Carlone 2022). |
| Hardware Specification | Yes | Note that all network models are trained with the SGD optimizer, with a base learning rate of 1 10 3, except for the TB-HSU model trained on Scan Net20, which uses a base learning rate of 1 10 4, on a single NVIDIA Ge Force GTX 3070 within 500 training epochs, except for the TB-HSU model trained on Matterport3D within 30 epochs. |
| Software Dependencies | No | The paper mentions several software tools and models like GPT, LLaMa, CLIP, ViT, BERT, etc., but does not provide specific version numbers for any of these dependencies to replicate the experiment. |
| Experiment Setup | Yes | Note that all network models are trained with the SGD optimizer, with a base learning rate of 1 10 3, except for the TB-HSU model trained on Scan Net20, which uses a base learning rate of 1 10 4, on a single NVIDIA Ge Force GTX 3070 within 500 training epochs, except for the TB-HSU model trained on Matterport3D within 30 epochs. The TB-HSU model employs 4 transformer layers with 384 dimensions across all experiments, adapting the room classification head size (12 for 3DHSG, 30 for Matterport3D, 21 for Scan Net20 and Scan Net200), the number of kinds of object labels (191 for 3DHSG, 41 for Matterport3D, 20 for Scan Net20, and 200 for Scan Net200), and input sequence length (77 for 3DHSG, 230 for Matterport3D, 62 for Scan Net20, and 121 for Scan Net200), maintaining 7.62 0.05 million parameters. |