Specialized Foundation Models Struggle to Beat Supervised Baselines

Authors: Zongzhe Xu, Ritvik Gupta, Wenduo Cheng, Alexander Shen, Junhong Shen, Ameet Talwalkar, Mikhail Khodak

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To answer we look at three modalities genomics, satellite imaging, and time series with multiple recent FMs and compare them to a standard supervised learning workflow... Across these three specialized domains, we find that it is consistently possible to train simple supervised models no more complicated than a lightly modified wide Res Net or UNet that match or even outperform the latest foundation models. Our work demonstrates that the benefits of large-scale pretraining have yet to be realized in many specialized areas... 4 EMPIRICAL RESULTS
Researcher Affiliation Collaboration Zongzhe Xu, Ritvik Gupta, Wenduo Cheng, Alexander Shen, Junhong Shen Carnegie Mellon University EMAIL denotes equal contribution; order decided by coin flip Ameet Talwalkar Carnegie Mellon University & Datadog, Inc. EMAIL Mikhail Khodak Princeton University EMAIL
Pseudocode Yes Algorithm 1: Pseudocode for the DASHA workflow. Starting with a set of backbone CNNs, we use DASH (Shen et al., 2022) to set the right kernel size and dilation rate for each of its convolutional layers and then use ASHA (Li et al., 2020) to configure a training routine for the resulting architecture. Lastly, we pick the best backbone using validation data and train it.
Open Source Code Yes To facilitate ongoing research in these and other domains, we make code associated with both our CNN-tuning pipeline (DASHA) and our AR-on-GPU workflow (Auto-AR) publicly available.2 Available at https://github.com/ritvikgupta199/DASHA and https://github.com/Zongzhe-Xu/Auto AR.
Open Datasets Yes To evaluate them, we consider the Nucleotide Transformer (NT) benchmark of Dalla-Torre et al. (2023)... our evaluation includes Geo Bench (Lacoste et al., 2024)... add four additional tasks Big Earth Net (Sumbul et al., 2019), Euro SAT (Helber et al., 2019), Canadian Cropland (Jacques et al., 2023), and f Mo W-Sentinel (Cong et al., 2022)... We focus on long-horizon forecasting, which has a standard set of datasets (Goswami et al., 2024, Table 11), of which we consider seven.
Dataset Splits Yes In alignment with the leaderboard, we apply a 0.1 validation split for DASHA during our evaluation. Additionally, we use an architecture set that includes both Wide Res Net and UNet for the search with DASHA on these datasets.
Hardware Specification Yes All experiments were conducted on L40 GPUs (L40S for Genomics).
Software Dependencies No The paper describes methods like DASHA, Auto-AR, and uses libraries/methods such as ASHA (Li et al., 2020) and DASH (Shen et al., 2022). While these indicate software components, no specific version numbers for programming languages, frameworks (e.g., Python, PyTorch, TensorFlow), or other libraries are provided in the paper.
Experiment Setup Yes The hyperparameter search space includes learning rate, weight decay, momentum, drop rate, and random seed for model initialization. We define a continuous search space, with further specific details provided in Table 7. Using ASHA, we evaluate 200 sample configurations over a maximum of 20 epochs, using a reduction factor of 2. The low-performing configurations are pruned based on their validation scores.