Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning
Authors: Yuti Liu, Shice Liu, Junyuan Gao, Peng-tao Jiang, Hao Zhang, Jinwei Chen, Bo Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment. |
| Researcher Affiliation | Industry | vivo Mobile Communication Co., Ltd, Shanghai, China EMAIL; EMAIL |
| Pseudocode | No | The paper describes the architecture of CALM and its training process in textual descriptions and diagrams (Figures 2, 3, 4) but does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | Self-Supervised Pre-Training encourages the three Qformers in the MFAM to learn aesthetic attributes in a self-supervised manner, utilizing unlabeled images from diverse sources, including AVA (Murray et al. 2012), AADB (Kong et al. 2016), EVA (Kang et al. 2020), ICAA, PCCD (Chang et al. 2017), pexels (Pfister et al. 2021), SPAQ (Fang et al. 2020) and TAD66K (He et al. 2022). The training data comprises a 558K subset of LAION-CC-SBU (Schuhmann et al. 2022; Changpinyo et al. 2021; Saleh and Elgammal 2015) and Share GPT4V (Chen et al. 2023). |
| Dataset Splits | Yes | The AVA dataset comprises over 250,000 images with scores rated by users on the DPChallenge website. We used the official split, designating 19,928 images as the test set and the remainder for training. The AVA-Captions dataset contains approximately 230,000 images, each with an average of 5 user comments. To prevent data leakage, images from the AVA test set are excluded from AVA-Captions training, resulting in 210,000 images for training and 9,361 for testing. FLICKR-AES includes 35,263 images rated by 173 annotators in the training set and 4,737 images evaluated by 37 annotators in the test set, along with user identifications. |
| Hardware Specification | Yes | Training was conducted on eight 80GB A100 GPUs, utilizing the Adam optimizer (Kingma and Ba 2014). |
| Software Dependencies | No | The paper mentions models like Vicuna-7B and GPT-3.5, and the Adam optimizer, but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions) required to replicate the experiments. |
| Experiment Setup | Yes | The peak learning rate was set to 1e3 for the pre-training stage, and 2.5e-5 and 7e-5 for the two processes in the fine-tuning stage, respectively. Both stages commenced with a linear warm-up, followed by a cosine annealing schedule (Loshchilov and Hutter 2016), with durations of 5 hours and 16.5 hours, respectively. |