Ranking-aware adapter for text-driven image ordering with CLIP

Authors: Wei-Hsiang Yu, Yen-Yu Lin, Ming-Hsuan Yang, Yi-Hsuan Tsai

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on four tasks spanning diverse numerical concepts, including facial aging estimation (Samek et al., 2017), object count sorting (Singh et al., 2024), image quality/aesthetics assessment (Hosu et al., 2020; Murray, Naila and Marchesotti, Luca and Perronnin, Florent, 2012), and dating historical colored images (Palermo et al., 2012). Our approach consistently performs favorably against CLIP baselines and state-of-the-art methods in terms of ranking and retrieval qualities, even though these competing methods are fine-tuned for target tasks.
Researcher Affiliation Collaboration Wei-Hsiang Yu1 Yen-Yu Lin1 Ming-Hsuan Yang2 Yi-Hsuan Tsai3 1National Yang Ming Chiao Tung University 2UC Merced 3Atmanity Inc.
Pseudocode No The paper includes architectural diagrams (Figure 2, Figure 3) but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code is available at https://github.com/uynaes/Ranking Aware CLIP
Open Datasets Yes Facial age estimation. Facial age estimation predicts the age of a given face. We use the Adience dataset (Samek et al., 2017), which includes 13,027 images labeled across 8 age groups, following the data split from Wang et al. (2023d). Historical colored image dating. The historical colored image dataset (Palermo et al., 2012) is a widely-used benchmark for predicting the decade of a given historical colored image, consisting of 1,325 images labeled across 5 decades ranging from the 1930s to the 1970s. Image quality and aesthetics assessment. For ranking images based on subjective preference and objective properties, we employ the Kon IQ-10k dataset (Hosu et al., 2020) for assessing image qualities and the Aesthetic Visual Analysis (AVA) dataset (Murray, Naila and Marchesotti, Luca and Perronnin, Florent, 2012) for evaluating image aesthetics. Object count sorting. ... The COCO-REM dataset (Singh et al., 2024), an annotation revised version of the COCO dataset, serves as the test bed. Table 7: Facial Age Estimation Results on the UTKFace Dataset. ... UTKFace dataset (Zhang et al., 2017). A.2 OBJECT COUNT SORTING ON ON CLEVR DATASET. The CLEVR dataset (Johnson et al., 2017). As shown in Figure 9, we evaluate our ranking adapter on unseen categories using the LVIS dataset (Gupta et al., 2019). As shown in Figure 10, we sample five images from the highest to lowest MOS at the same interval, finding that our model shows high agreement with subjective scores. AGIQA-3k dataset (Li et al., 2023a).
Dataset Splits Yes Facial age estimation. ...following the data split from Wang et al. (2023d). Historical colored image dating. ...We follow the standard ordinal regression setting as that in Wang et al. (2023d). Image quality and aesthetics assessment. ...We evaluate models using the official splits, which have 2,015 and 19,930 test images in the Kon IQ-10k and AVA datasets, respectively. Object count sorting. ...comprising 118,287 training and 5,000 testing images. A.1 FACIAL AGE ESTIMATION ON THE UTKFACE DATASET. ...Following the preprocessing and data split described in Kuprashevich & Tolstykh (2023), we train our ranking-aware adapter for 20k steps with a batch size of 64, a learning rate of 5e 5, and a weight decay of 0.01.
Hardware Specification Yes We conduct all experiments on one NVIDIA RTX-3090Ti GPU.
Software Dependencies No The paper mentions using 'Open CLIP' and optimizing with 'Smooth L1 loss', 'Hinge loss', and 'Adam W optimizer' but does not specify version numbers for these or other software components like programming languages or libraries.
Experiment Setup Yes We implement the ranking adapter upon the Open CLIP (Ilharco et al., 2021) framework and optimize the model using a combination of Smooth L1 loss and Hinge loss with an Adam W optimizer at a learning rate of 1e 5, weight decay of 0.01, and batch size of 64. We use 220k steps for object counts sorting and 144k steps for image-quality assessment, facial age estimation, and dating historical image tasks. ...where α is a hyperparameter (set to 0.2 in our experiments) to balance the importance of the regression and ranking objectives. We take a random horizontal flipping as data augmentation and resize the images to 320 x 320 without cropping.