reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dimension Agnostic Neural Processes

Authors: Hyungi Lee, Chaeyun Jang, Dong Bok Lee, Juho Lee

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive experimentation on various synthetic and practical regression tasks, we empirically show that DANP outperforms previous NP variations, showcasing its effectiveness in overcoming the limitations of traditional NP models and its potential for broader applicability in diverse regression scenarios.
Researcher Affiliation	Academia	Hyungi Lee, Chaeyun Jang, Dong bok Lee, Juho Lee KAIST EMAIL
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical formulations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	To ensure reproducibility, we have included our experiment code in the supplementary material. Our code builds upon the official implementation1 of TNP (Nguyen & Grover, 2022).
Open Datasets	Yes	We employed the EMNIST 2 Balanced dataset (Cohen et al., 2017), which consists of 112,800 training samples and 18,800 test samples. We utilized the Celeb A 3 dataset (Liu et al., 2015), which includes 162,770 training samples, 19,867 validation samples, and 19,962 test samples. We utility the HPO-B benchmark (Pineda Arango et al., 2021), which consists of 176 search spaces (algorithms) evaluated sparsely on 196 datasets with a total of 6.4 million hyperparameter evaluations. To further validate the practicality, we conducted additional experiments on time series data using the blood pressure estimation task from the MIMIC-III dataset (Johnson et al., 2016).
Dataset Splits	Yes	Specifically, we first determine the number of context points \|c\| by drawing from a uniform distribution Unif(n2/5, n2/50 * n2/5). ... Next, we sample the number of target points \|t\| from a uniform distribution Unif(n2/5, n2/50 - \|c\|) to keep the total number of points within a manageable range. For the EMNIST dataset, our training and test datasets comprise 26,400 and 4,400 samples, respectively. ... We sample the number of context points \|c\| ~ Unif(5, 45) and the number of target points \|t\| ~ Unif(5, 50 - \|c\|). For the Celeb A dataset, We sampled the number of context points \|c\| from Unif(5, 45) and the number of target points \|t\| from Unif(5, 50 - \|c\|). For HPO-B benchmark, To construct meta-batch from each task, we sample the number of context points \|c\| from Unif(5, 50) and the number of target points \|t\| from Unif(5, 50 - \|c\|). For the MIMIC-III dataset, we trained the models with 32,000 training samples and fine-tuned them with only 320 samples. ... And here, we considered observations from time 0, ..., t as context points and t+1, ..., T as target points.
Hardware Specification	Yes	All experiments were conducted on either a single NVIDIA Ge Force RTX 3090 GPU or an NVIDIA RTX A6000 GPU.
Software Dependencies	No	The paper mentions software like Py Torch, Bayes O, Bo Torch, GPy Torch, and Adam optimizer but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Unless otherwise specified, we selected the base learning rate, weight decay, and batch size from the following grids: {5e-5, 7e-5, 9e-5, 1e-4, 3e-4, 5e-4} for learning rate, {0, 1e-5} for weight decay, and {16, 32} for batch size, based on validation task log-likelihood.