Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning

Authors: Yuti Liu, Shice Liu, Junyuan Gao, Peng-tao Jiang, Hao Zhang, Jinwei Chen, Bo Li

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment.
Researcher Affiliation Industry vivo Mobile Communication Co., Ltd, Shanghai, China EMAIL; EMAIL
Pseudocode No The paper describes the architecture of CALM and its training process in textual descriptions and diagrams (Figures 2, 3, 4) but does not include a dedicated pseudocode or algorithm block.
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes Self-Supervised Pre-Training encourages the three Qformers in the MFAM to learn aesthetic attributes in a self-supervised manner, utilizing unlabeled images from diverse sources, including AVA (Murray et al. 2012), AADB (Kong et al. 2016), EVA (Kang et al. 2020), ICAA, PCCD (Chang et al. 2017), pexels (Pfister et al. 2021), SPAQ (Fang et al. 2020) and TAD66K (He et al. 2022). The training data comprises a 558K subset of LAION-CC-SBU (Schuhmann et al. 2022; Changpinyo et al. 2021; Saleh and Elgammal 2015) and Share GPT4V (Chen et al. 2023).
Dataset Splits Yes The AVA dataset comprises over 250,000 images with scores rated by users on the DPChallenge website. We used the official split, designating 19,928 images as the test set and the remainder for training. The AVA-Captions dataset contains approximately 230,000 images, each with an average of 5 user comments. To prevent data leakage, images from the AVA test set are excluded from AVA-Captions training, resulting in 210,000 images for training and 9,361 for testing. FLICKR-AES includes 35,263 images rated by 173 annotators in the training set and 4,737 images evaluated by 37 annotators in the test set, along with user identifications.
Hardware Specification Yes Training was conducted on eight 80GB A100 GPUs, utilizing the Adam optimizer (Kingma and Ba 2014).
Software Dependencies No The paper mentions models like Vicuna-7B and GPT-3.5, and the Adam optimizer, but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions) required to replicate the experiments.
Experiment Setup Yes The peak learning rate was set to 1e3 for the pre-training stage, and 2.5e-5 and 7e-5 for the two processes in the fine-tuning stage, respectively. Both stages commenced with a linear warm-up, followed by a cosine annealing schedule (Loshchilov and Hutter 2016), with durations of 5 hours and 16.5 hours, respectively.