Local Distribution-Based Adaptive Oversampling for Imbalanced Regression

Authors: Shayan Alahyari, Mike Domaratzki

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In extensive evaluations on 45 imbalanced datasets, LDAO outperforms stateof-the-art oversampling methods on both frequent and rare target values, demonstrating its effectiveness for addressing the challenge of imbalanced regression.
Researcher Affiliation Academia Shayan Alahyari EMAIL Western University Mike Domaratzki EMAIL Western University
Pseudocode Yes The complete LDAO procedure is presented in Algorithm 1, which outlines all four steps of our proposed method.
Open Source Code Yes Our code is available at https://github.com/Shayan Alahyari/LDAO.
Open Datasets Yes We evaluated our method using 45 datasets from three sources: the Keel repository (Alcalá-Fdez et al., 2011), the collection at https://paobranco.github.io/Data Sets-IR (Branco et al., 2019), and the repository at https://github.com/Jusci Avelino/imbalanced Regression (Avelino et al., 2024).
Dataset Splits Yes We employed 5 runs of 5-fold cross-validation for all experiments. This outer fold cross-validation divided each dataset into five equal portions, with each fold using four portions (80% of data) for training and one portion (20% of data) as the test set. Each data portion served as a test set exactly once across the five folds in each run. For hyperparameter tuning within each fold, we further divided the training data into sub-training (80%) and validation (20%) sets.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments. It only mentions using an MLP with specific layer configurations and the Adam optimizer, but no details on the computational resources (e.g., GPU/CPU models, memory) are provided.
Software Dependencies No For our evaluation metrics, we utilized the SERA implementation from the Imbalanced Learning Regression Python package (Wu et al., 2022). The SMOGN method was implemented using the package developed by Kunz (Kunz, 2020). We implemented the Dense Loss and G-SMOTE methods based on their original papers, carefully following the authors descriptions and guidelines to ensure faithful reproduction of their approaches. To optimize the hyperparameters for each method, we employed the Optuna framework (Akiba et al., 2019). While specific packages and frameworks are mentioned, concrete version numbers for general software dependencies like Python, PyTorch/TensorFlow, or CUDA are not provided.
Experiment Setup Yes Following the approach of (Steininger et al., 2021), we evaluated all methods using a Multi-Layer Perceptron (MLP) with three hidden layers (10 neurons each) and ReLU activations. The output layer uses linear activation for regression. We trained models for 1000 epochs using Adam optimizer with early stopping to prevent overfitting. We compared LDAO against four approaches: Baseline (no resampling, using the original imbalanced data), SMOGN (an extension of SMOTER that incorporates Gaussian noise during oversampling), G-SMOTE (Geometric SMOTE adapted for regression tasks, using geometric interpolation), and Dense Loss (a cost-sensitive approach that weights errors by target density). ... We utilized Bayesian optimization with 15 trials to efficiently search the parameter space ... LDAO s parameters include the oversampling multiplier and KDE bandwidth. SMOGN uses neighborhood size, sampling approach, and relevance threshold. G-SMOTE involves quantile for rarity, truncation factor, deformation factor, number of neighbors, and oversampling factor. Dense Loss works with the density weighting parameter.