reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PRIME: Deep Imbalanced Regression with Proxies

Authors: Jongin Lim, Sucheol Lee, Daeho Um, Sung-Un Park, Jinwoo Shin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the effectiveness and broad applicability of PRIME, achieving state-of-the-art performance on four real-world regression benchmark datasets across diverse target domains.
Researcher Affiliation	Collaboration	1AI Center, Samsung Electronics 2Korea Advanced Institute of Science and Technology (KAIST).
Pseudocode	No	The paper describes the proposed method using mathematical formulations and descriptive text, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	We will release the code after publication.
Open Datasets	Yes	We conduct experiments on four real-world imbalanced regression benchmarks introduced by (Yang et al., 2021): (i) Age DB-DIR is a facial age estimation dataset derived from Age DB (Moschoglou et al., 2017). (ii) IMDBWIKI-DIR is an age estimation dataset constructed from IMDB-WIKI (Rothe et al., 2018). (iii) NYUD2-DIR is derived from the NYU Depth Dataset V2 (Silberman et al., 2012) for depth prediction from RGB indoor scenes. (iv) STS-B-DIR is a natural language dataset based on STSB (Cer et al., 2017; Wang, 2018), providing continuous similarity scores between pairs of sentences.
Dataset Splits	Yes	For all datasets, we report results for four subsets: All, Many, Median, and Few. All refers to the entire test set. Based on the number of training samples per label, Many includes labels with over 100 samples, Median covers those with 20 to 100 samples, and Few consists of labels with fewer than 20 samples. [Table 10: Overall dataset statistics. Age DB-DIR: # Training 12,208, # Val. 2,140, # Test 2,140; IMDB-WIKI-DIR: # Training 191,509, # Val. 11,022, # Test 11,022; NYUD2-DIR: # Training 50,688, # Val. , # Test 654; STS-B-DIR: # Training 5,249, # Val. 1,000, # Test 1,000]
Hardware Specification	Yes	To analyze the computational efficiency of PRIME, we compute the average wall-clock training time (in seconds) using four NVIDIA Tesla V100 GPUs.
Software Dependencies	No	The paper mentions using ResNet50, Bi LSTM, and GloVe word embeddings, but does not specify version numbers for any software libraries or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	The number of proxies C is empirically determined for each dataset. Proxy embeddings {zp i }C i=1 are initialized with He initialization (He et al., 2015) and trained jointly with the model. The Proxy lr refers to the multiplication factor applied to the learning rate of the proxy. The hyperparameters λp, λa, τf, τt, and α are set empirically. ... Tables 11, 12, 13, and 14 summarize the implementation details for Age DB-DIR, IMDB-WIKI-DIR, NYUD2-DIR, and STS-B-DIR, respectively. ... For Age DB-DIR and IMDB-WIKI-DIR, we use Res Net50 ... Epoch 80, Batch size 64, Learning rate 2.5e-4, Weight decay 1.0e-4, Optimizer Adam, Scheduler Step LR (60/0.1).