RRT-MVS: Recurrent Regularization Transformer for Multi-View Stereo
Authors: Jianfei Jiang, Liyong Wang, Haochen Yu, Tianyu Hu, Jiansheng Chen, Huimin Ma
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that RRT-MVS achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets. Notably, RRT-MVS ranks first on both the Tanks-and-Temples intermediate and advanced benchmarks among all published methods. We conducted extensive experiments to demonstrate the effectiveness and efficiency of RRT-MVS. As shown in Figure 1(a), compared with the current state-of-the-art method Go MVS (Wu et al. 2024), RRT-MVS achieves the best performance on the DTU (Aanæs et al. 2016) dataset while reducing GPU memory consumption by 72.17%. Additionally, RRT-MVS also demonstrates strong generalization capability on the Tanks-and-Temples (Knapitsch et al. 2017) benchmark, as illustrated in Figure 1(b). We conducted quantitative ablation experiments to validate the effectiveness of the RRT-MVS design. All ablation experiments were performed on the DTU (Aanæs et al. 2016) dataset with same parameters, reconstructing point clouds using the normal fusion strategy (Sch onberger et al. 2016). |
| Researcher Affiliation | Academia | Jianfei Jiang, Liyong Wang, Haochen Yu, Tianyu Hu, Jiansheng Chen, Huimin Ma* School of Computer and Communication Engineering, University of Science and Technology Beijing, China EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and architectural diagrams (e.g., Figure 3: Pipeline of RRT-MVS, Figure 4: The detailed structure of the proposed Recurrent Regularization Transformer), but it does not include explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Datasets DTU (Aanæs et al. 2016) dataset includes 128 indoor scenes, each captured from 49 or 64 viewpoints under 7 different lighting conditions using a fixed camera trajectory. As described in (Yao et al. 2018), the dataset is divided into training, test, and validation sets, with a total of 27,097 training samples. Tanks-and-Temples (Knapitsch et al. 2017) dataset is an extensive collection of real-world scenes divided into an intermediate subset and an advanced subset. Blended MVS (Yao et al. 2020) dataset is a large-scale synthetic dataset with both indoor and outdoor scenes, consisting of training and validation data. |
| Dataset Splits | Yes | As described in (Yao et al. 2018), the dataset is divided into training, test, and validation sets, with a total of 27,097 training samples. We initially trained our model on the DTU (Aanæs et al. 2016) training set using 5-view images with a resolution of 512 640. Subsequently, we fine-tuned the model on the Blended MVS (Yao et al. 2020) dataset using 11-view images with a resolution of 576 768. Evaluation We trained our model on the DTU training set and evaluated its performance on the DTU test set using 5-view images with a resolution of 832 1152. |
| Hardware Specification | Yes | We trained our model on the DTU training set and evaluated its performance on the DTU test set using 5-view images with a resolution of 832 1152, which required 0.35s and 3.5GB of memory on a NVIDIA RTX 3090 GPU. |
| Software Dependencies | No | Our network was developed using Py Torch (Paszke et al. 2019) and employed the Adam (Kingma and Ba 2014) optimizer. While PyTorch is mentioned as the framework and Adam as the optimizer, specific version numbers for these software dependencies are not provided in the text. |
| Experiment Setup | Yes | Training Our network was developed using Py Torch (Paszke et al. 2019) and employed the Adam (Kingma and Ba 2014) optimizer. Following standard procedures, we initially trained our model on the DTU (Aanæs et al. 2016) training set using 5-view images with a resolution of 512 640. We started with a learning rate of 0.001 for 10 epochs with a batch size of 2. Subsequently, we fine-tuned the model on the Blended MVS (Yao et al. 2020) dataset using 11-view images with a resolution of 576 768. This phase began with a learning rate of 0.001 for an additional 15 epochs with a batch size of 2. Our approach involved inverse depth sampling ranging from 425mm to 935mm, with 4 coarse-to-fine stages that incorporated both depth hypotheses and group feature correlations of 8-8-4-4. Evaluation We trained our model on the DTU training set and evaluated its performance on the DTU test set using 5-view images with a resolution of 832 1152... |