More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

Authors: Yuan Tang, Xu Han, Xianzhi Li, Qiao Yu, Jinfeng Xu, Yixue Hao, Long Hu, Min Chen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results show that Green PLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, Green PLM also achieves competitive performance using text-only data.
Researcher Affiliation Academia 1Huazhong University of Science and Technology, 2South China University of Technology, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods and equations, such as Equation (6) for 0M-Pooling, but does not present any explicitly labeled pseudocode or algorithm blocks in a structured format.
Open Source Code Yes Code https://github.com/Tang Yuan96/Green PLM
Open Datasets Yes We bring T3D dataset, a 6M text dataset of 3D object descriptions and conversations for free, the largest to our knowledge, to expand the text space for better text-LLM alignment and compensate for the scarcity of expensive 3D data. ... Generative 3D object classification task on the Model Net40 dataset (Wu et al. 2015) and Objaverse dataset (Deitke et al. 2023)
Dataset Splits Yes Generative 3D object classification task on the Model Net40 dataset (Wu et al. 2015) and Objaverse dataset (Deitke et al. 2023), using I-type and C-type prompts, with results shown in Tab. 2. For close-set zero-shot classification on Model Net40, we let Qwen2 select the closest matching category in the 40 classes as the model s output.
Hardware Specification Yes Together, we can complete training in just 26.6 hours using a single 3090 GPU (24GB)
Software Dependencies No The paper mentions models like Phi-3, EVA-CLIP-E, ViT, Uni3D, and Qwen2-72B-Instruct, but it does not provide specific version numbers for software libraries, programming languages, or other ancillary software components.
Experiment Setup Yes We use Phi-3 (Abdin et al. 2024) as the LLM backbone, with EVA-CLIP-E (Sun et al. 2023) and Vi T (Dosovitskiy et al. 2020) both trained by Uni3D (Zhou et al. 2023) as the text encoder and point encoder, respectively. ... The MLP projector consists of two linear layers and a Ge LU activation, mapping the encoder s output tokens to tokens with 3072 dimensions of Phi-3. Our Green PLM has 63.3M trainable parameters and requires only 26.6 hours of training on a single 3090 GPU. ... We continue using Lo RA (Hu et al. 2021) to train f LLM for efficient point-LLM alignment. ... As the standard deviation (std) of the noise increases from 0 to 0.06, Green PLM s accuracy initially increases and then decreases, reaching its peak at 0.05.