reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Specialized Foundation Models Struggle to Beat Supervised Baselines

Authors: Zongzhe Xu, Ritvik Gupta, Wenduo Cheng, Alexander Shen, Junhong Shen, Ameet Talwalkar, Mikhail Khodak

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To answer we look at three modalities genomics, satellite imaging, and time series with multiple recent FMs and compare them to a standard supervised learning workflow... Across these three specialized domains, we find that it is consistently possible to train simple supervised models no more complicated than a lightly modified wide Res Net or UNet that match or even outperform the latest foundation models. Our work demonstrates that the benefits of large-scale pretraining have yet to be realized in many specialized areas... 4 EMPIRICAL RESULTS
Researcher Affiliation	Collaboration	Zongzhe Xu, Ritvik Gupta, Wenduo Cheng, Alexander Shen, Junhong Shen Carnegie Mellon University EMAIL denotes equal contribution; order decided by coin flip Ameet Talwalkar Carnegie Mellon University & Datadog, Inc. EMAIL Mikhail Khodak Princeton University EMAIL
Pseudocode	Yes	Algorithm 1: Pseudocode for the DASHA workflow. Starting with a set of backbone CNNs, we use DASH (Shen et al., 2022) to set the right kernel size and dilation rate for each of its convolutional layers and then use ASHA (Li et al., 2020) to configure a training routine for the resulting architecture. Lastly, we pick the best backbone using validation data and train it.
Open Source Code	Yes	To facilitate ongoing research in these and other domains, we make code associated with both our CNN-tuning pipeline (DASHA) and our AR-on-GPU workflow (Auto-AR) publicly available.2 Available at https://github.com/ritvikgupta199/DASHA and https://github.com/Zongzhe-Xu/Auto AR.
Open Datasets	Yes	To evaluate them, we consider the Nucleotide Transformer (NT) benchmark of Dalla-Torre et al. (2023)... our evaluation includes Geo Bench (Lacoste et al., 2024)... add four additional tasks Big Earth Net (Sumbul et al., 2019), Euro SAT (Helber et al., 2019), Canadian Cropland (Jacques et al., 2023), and f Mo W-Sentinel (Cong et al., 2022)... We focus on long-horizon forecasting, which has a standard set of datasets (Goswami et al., 2024, Table 11), of which we consider seven.
Dataset Splits	Yes	In alignment with the leaderboard, we apply a 0.1 validation split for DASHA during our evaluation. Additionally, we use an architecture set that includes both Wide Res Net and UNet for the search with DASHA on these datasets.
Hardware Specification	Yes	All experiments were conducted on L40 GPUs (L40S for Genomics).
Software Dependencies	No	The paper describes methods like DASHA, Auto-AR, and uses libraries/methods such as ASHA (Li et al., 2020) and DASH (Shen et al., 2022). While these indicate software components, no specific version numbers for programming languages, frameworks (e.g., Python, PyTorch, TensorFlow), or other libraries are provided in the paper.
Experiment Setup	Yes	The hyperparameter search space includes learning rate, weight decay, momentum, drop rate, and random seed for model initialization. We define a continuous search space, with further specific details provided in Table 7. Using ASHA, we evaluate 200 sample configurations over a maximum of 20 epochs, using a reduction factor of 2. The low-performing configurations are pruned based on their validation scores.