IM-Context: In-Context Learning for Imbalanced Regression Tasks

Authors: Ismail Nejjar, Faez Ahmed, Olga Fink

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations across a variety of real-world datasets demonstrate that in-context learning substantially outperforms existing in-weight learning methods in scenarios with high levels of imbalance. To empirically validate these findings, we use two pre-trained models (Garg et al., 2022; Müller et al., 2023), and evaluate our methodology across eight imbalanced regression tasks.
Researcher Affiliation Academia Ismail Nejjar EMAIL École Polytechnique Fédérale de Lausanne (EPFL) and Massachusetts Institute of Technology (MIT) Faez Ahmed EMAIL Massachusetts Institute of Technology (MIT) Olga Fink EMAIL École Polytechnique Fédérale de Lausanne (EPFL)
Pseudocode No The paper describes the methodology in text and illustrates a high-level overview in Figure 5, but it does not contain a structured pseudocode block or algorithm.
Open Source Code Yes The code is available https://github.com/ismailnejjar/IM-Context.
Open Datasets Yes We use three benchmark datasets curated by Yang et al. (2021) specifically for imbalanced regression tasks: Age DB-DIR, derived from the Age DB dataset for age estimation Moschoglou et al. (2017); IMDB-WIKI-DIR, another age estimation dataset sourced from IMDB-WIKI Rothe et al. (2018); and STS-B-DIR, which measures text similarity between two sentences based on the Semantic Textual Similarity Benchmark Wang et al. (2018). Additionally, we use six tabular datasets, namely Boston Harrison & Rubinfeld (1978), Concrete Yeh (2007), Abalone Nash et al. (1995), Communities Redmond (2009), Kin8nm, and an engineering design dataset: Airfoil Chen et al. (2019).
Dataset Splits Yes For the UCI datasets, we created training and testing sets, ensuring the test sets were balanced, using the Algorithm ??. Due to its small size, the Boston dataset was split 90% for training and 10% for testing, and the bin size was set to 15. The other datasets were split 80% for training and 20% for testing and the bin size was set to 50. We present our results across four predefined shot regions All, Many, Median, and Few which categorize the subsets of the datasets based on the number of training samples available per label within each discretized bin. Specifically, the Few category includes bins with fewer than 20 samples, Median encompasses bins with 20 to 100 samples, and Many refers to bins with over 100 samples per label.
Hardware Specification Yes An NVIDIA RTX 2080 GPU was used for all the experiments.
Software Dependencies No For the tabular datasets, we used baselines from scikit-learn (Pedregosa et al., 2011). For the benchmark datasets, we preprocess images and text into embeddings using the Hugging Face implementation of CLIP (Radford et al., 2021) and BERT (all-mpnet-base-v2 model) (Reimers & Gurevych, 2019).
Experiment Setup Yes To ensure the robustness and reproducibility of our results, we conducted three separate experiments using different random seeds, between 0 and 3. In all experiments, for the GPT2 model from Garg et al. (2022), we retrieve k s = ks = 10 nearest neighbors, for the PFN model we retrieve k s = ks = 15, except for the IMDB dataset where k s was equal to 5. This choice is motivated by the preliminary experiments conducted in section 3.3 as observed in Figure 3 where the minimum error in all regions is achieved for 10 neighbors. Standard scaling was applied to the input features for these models. For in-context learning, we applied both standard scaling and a power transform to the features and then concatenated the two representations. Table 7: Architecture details of GPT2 and PFN models.