Riemann-based Multi-scale Attention Reasoning Network for Text-3D Retrieval
Authors: Wenrui Li, Wei Han, Yandu Chen, Yeyu Chai, Yidan Lu, Xingtao Wang, Xiaopeng Fan
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments We conducted comparative experiments on the T3DR-HIT dataset, utilizing different text and point cloud feature extractors while keeping the retrieval framework unchanged. The experimental results demonstrated the superior retrieval performance of our model. The Table 1 summarizes the performance of various models on the T3DR-HIT dataset, including their respective hyperparameter configurations. |
| Researcher Affiliation | Academia | Wenrui Li1, Wei Han1, Yandu Chen1, Yeyu Chai1, Yidan Lu1, Xingtao Wang12*, Xiaopeng Fan123 1Harbin Institute of Technology 2 Harbin Institute of Technology Suzhou Research Institute 3Peng Cheng Laboratory EMAIL; EMAIL; EMAIL; EMAIL EMAIL; EMAIL; EMAIL |
| Pseudocode | No | The paper describes the components of RMARN using mathematical formulations and textual explanations (e.g., equations for Attention, FFN, similarity calculation) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/liwrui/RMARN |
| Open Datasets | Yes | In this paper, to address the scarcity of paired text-3D data, we developed a large-scale, high-quality open-source dataset named T3DR-HIT, containing over 3,380 pairs of text and point cloud data. The dataset comprises two main parts: one part contains coarse-grained alignments between indoor 3D scenes and text, consisting of 1,380 text-3D pairs; the other part contains fine-grained alignments between Chinese cultural heritage scenes and text, with over 2,000 text3D pairs. The release of the T3DR-HIT dataset provides robust support for multi-scale text-3D retrieval tasks. ... Building on the open-source Stanford 2D-3D-Semantics Dataset, we developed the Indoor Text Point Pairs dataset... |
| Dataset Splits | No | The paper describes the composition of the T3DR-HIT dataset, including its division into 'coarse-grained Indoor 3D Scenes' and 'fine-grained Chinese Artifact Scenes', and the total number of pairs. However, it does not provide specific details on how these datasets are split into training, validation, and test sets for experimentation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for conducting the experiments. |
| Software Dependencies | No | The paper mentions several software components, models, or functions such as CLIP text encoder, Point Net, Adam optimizer (with beta values), GELU activation function (with parameters), dropout rate, Open3D, and LLaVA (v1.6-mistral-7b-hf). While some parameters are provided for the optimizer and activation, and a specific LLaVA model version is named, the paper does not list multiple key software libraries or frameworks with their specific version numbers that are critical for reproducing the RMARN model's implementation. |
| Experiment Setup | Yes | We trained the model for 100 epochs, utilizing the Adam optimizer, which is well-regarded for its ability to adapt learning rates during training. The learning rate was set to 0.008, providing a balance between making steady progress and avoiding potential overshooting of minima. The π½for the Adam optimizer were configured as (0.91, 0.9993). ... For the activation function, we utilized GELU (Gaussian Error Linear Unit) with the parameters π= 0.5 and π= 0.044715... A dropout rate of 0.1 is applied... Both the Attention layer and the Feed-Forward Network (FFN) in the self-attention encoder are configured with a dimensionality of 512. ... Table 1 also shows hyperparameters for the best performing model: Low Rank (256), Epochs (100), Batch Size (64), Nhead (32), SA Layer (8). |