CityAnchor: City-scale 3D Visual Grounding with Multi-modality LLMs
Authors: Jinpeng Li, Haiping Wang, Jiabin chen, Yuan Liu, Zhiyang Dou, Yuexin Ma, Sibei Yang, Yuan Li, Wenping Wang, Zhen Dong, Bisheng Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on the City Refer dataset and a new synthetic dataset annotated by us, both of which demonstrate our method can produce accurate 3D visual grounding on a city-scale 3D point cloud. The source code is available at https://github.com/WHU-USI3DV/City Anchor. |
| Researcher Affiliation | Academia | 1LISMARS, Wuhan University 2Hong Kong University of Science and Technology 3University of Pennsylvania 4Shanghai Tech University 5Sun Yat-Sen University 6Texas A&M University |
| Pseudocode | No | The paper describes its methodology in natural language text and illustrates it with architectural diagrams (e.g., Figure 2 and Figure 3), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code is available at https://github.com/WHU-USI3DV/City Anchor. |
| Open Datasets | Yes | To evaluate the performance of City Anchor, we conduct experiments on the City Refer dataset and a new synthetic self-annotated dataset. City Refer (Miyanishi et al., 2023) is a 3D visual grounding dataset annotated from city-scale dataset Sensat Urban (Hu et al., 2021b) dataset. City Anchor is a city-scale 3D visual grounding dataset. We use 25 city-scale point clouds of STPLS3D (Chen et al., 2022) dataset and manually annotate them with text prompts. We will release our City Anchor dataset under the MIT license. |
| Dataset Splits | Yes | City Refer (Miyanishi et al., 2023)...We use 85% of them for training and 15% of them for evaluation. City Anchor is a city-scale 3D visual grounding dataset...There are 1448 text-object pairs. 80% of these pairs are used in training while the rest are used in tests. |
| Hardware Specification | Yes | All the experiments are implemented with Py Torch on a single NVIDIA A100 GPU (40 GB). |
| Software Dependencies | No | The paper mentions "Py Torch", "LISA (Lai et al., 2023)", "Vicuna-7b-v1.3 (Zheng et al., 2024)", "LLa VA architecture", and "Lo RA layers (Hu et al., 2021a)". While these are software components or models, specific version numbers for general software dependencies like PyTorch, Python, or CUDA are not provided. |
| Experiment Setup | Yes | We use the Adam W optimizer with a batch size of 8 and a learning rate decaying from 2e-5 to 2e-7 with a cosine annealing scheduler. The training of CLM and FMM takes about 12 and 15 hours to converge. We set the threshold θ for the candidate object detection in CLM to a fixed 0.3 (except for the specialized analysis of Ro I threshold) and the number of neighboring objects K for spatial context-aware feature enhancement in FMM to 5. We select positive and negative samples in a ratio of 1:3 for FMM training. |