Adaptive Partitioning Schemes for Optimistic Optimization

Authors: Raja Sunkara, Ardhendu Tripathy

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When the function is a low-dimensional multi-index function we theoretically prove improved regret bounds shown in Table 1. Empirically, we demonstrate the improvement in optimization error for several benchmark functions including Rastrigin (multi-modal), Branin (multiple minima), and Sharp Ridge (non-differentiable). We pose the quantization of Large Language Model (LLM) as a high-dimensional black-box optimization problem and obtain an improved perplexity value.
Researcher Affiliation Collaboration 1Missouri University of Science & Technology, Rolla MO, US 2Ops Canvas, Alexandria VA, US. Correspondence to: Raja Sunkara <EMAIL>, Ardhendu Tripathy <EMAIL>.
Pseudocode Yes Algorithm 1 Obtaining directions for an adaptive partitioning scheme Require: T, oracle for f which is a multi-index function defined using A (see (1))... Algorithm 2 Sequ OOL on an adaptive partitioning scheme with a direction selection strategy Require: Total number of openings n, number of samples T for updating f, integer c stating how often f is updated, number of dimensions m, oracle for f, direction selection strategy τh... Algorithm 3 Implementing lookahead direction selection strategy τh( f) Require: Current partition tree T , height h, estimated function f
Open Source Code Yes All implementation details, benchmark functions, and experiment scripts can be found at our Git Hub repository: https://github.com/raja-sunkara/Learned-Partitions-SequOOL
Open Datasets Yes We evaluated our approach on the OPT-1.3B model (Zhang et al., 2022), with results presented in Table 2. Our proposed objective function using Sequ OOL over 72 dimensions outperformed AWQ, achieving lower perplexity on both Wiki Text-2 (Merity et al., 2016) and the calibration set (Pile dataset (Gao et al., 2020)).
Dataset Splits No Where X is the input features to the block which is cached from a calibration dataset. It uses the parameterization s = sα X, where s X is the activation scale computed from X and α [0, 1] and Q as the quantization function and W as the original weights (full-precision). To determine the optimal α , AWQ applies a 1D grid search over the interval [0, 1]. This parameter controls the scale of activations and influences quantization error. The paper mentions using a "calibration dataset" for LLM quantization but does not provide specific details on how this dataset is split into training, validation, or test sets, nor does it provide percentages or counts for these splits. The text focuses on the use of the dataset for caching input features and calculating perplexity, not on its partitioning for model training/evaluation.
Hardware Specification Yes We implemented our Large Language Model (LLM) code on hardware equipped with one Quadro RTX 5000 GPU having 16GB VRAM.
Software Dependencies No We employed the Ray package for hyper-parameter tuning 4. We used Adam optimizer and our search space included hidden layer sizes (500, 1000, 2000, 3000), learning rates (log-uniform from 1 10 4 to 1 10 1), weight decay (log-uniform from 1 10 2 to 1 10 1), and learning rate Step Decay with gamma values (uniform from 0.9 to 0.99), and step sizes (500, 1000, 2000). The paper mentions software like "Ray package" and "Adam optimizer" but does not specify their version numbers or the version of Python/PyTorch (or similar frameworks) used.
Experiment Setup Yes We used Adam optimizer and our search space included hidden layer sizes (500, 1000, 2000, 3000), learning rates (log-uniform from 1 10 4 to 1 10 1), weight decay (log-uniform from 1 10 2 to 1 10 1), and learning rate Step Decay with gamma values (uniform from 0.9 to 0.99), and step sizes (500, 1000, 2000). We utilized early stopping to prevent overfitting.