CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment

Authors: Yating Liu, Yujie Zhang, Ziyu Shan, Yiling Xu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our CLIP-PCQA outperforms other State-Of-The-Art (SOTA) approaches. We conduct comprehensive experiments on multiple benchmarks. Experimental results indicate that CLIP-PCQA achieves superior performance and further analyses reveal the model s robustness under different settings.
Researcher Affiliation Academia Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai, China EMAIL
Pseudocode No The paper describes the proposed method using text, mathematical formulations, and diagrams, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Olivialyt/CLIP-PCQA
Open Datasets Yes To illustrate the effectiveness of our method, we employ three benchmarks with available raw opinion scores: SJTU-PCQA (Yang et al. 2020a), LS-PCQA Part I (Liu et al. 2023b) and BASICS (Ak et al. 2024).
Dataset Splits Yes We partition the databases according to content (reference point clouds) and k-fold cross-validation is used for training. Specifically, 9-fold cross-validation is applied for SJTU-PCQA following (Zhang et al. 2022b), and we adopt a 5-fold cross-validation both for LS-PCQA and BASICS. For each fold, the test performance with minimal training loss is recorded and the average result across all folds is recorded to alleviate randomness.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for the experiments. It only mentions general training strategies.
Software Dependencies No The paper mentions using a 'Vision Transformer (Vi T-B/16)' and 'Adam optimizer', but does not provide specific version numbers for software libraries, frameworks, or programming languages used.
Experiment Setup Yes The initial learning rate is set as 4e-6 and the model is trained for 50 epochs with a default batch size of 16. We use the Adam optimizer (Kingma and Ba 2014) with a weight decay of 1e-4. The number of projection views M = 6 and the images are randomly cropped into 224 224 3 as inputs. We set the number of context tokens W as 16. For the loss function, we set θ = [0.25, 0.50, 0.75]. α is set to 1/K and β is set to 0.08. Depending on the raw score ranges of different databases, we evenly divide them into five thresholds as the quantitative values q. For example, we set q = [5, 4, 3, 2, 1] for LS-PCQA and q = [10, 8, 6, 4, 2] for SJTU-PCQA, respectively.