MixMax: Distributional Robustness in Function Space via Optimal Data Mixtures

Authors: Anvith Thudi, Chris Maddison

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we found that Mix Max matched or outperformed the standard group DRO baselines, and in particular, Mix Max improved the performance of XGBoost over the only baseline, data balancing, for variations of the ACSIncome and Celeb A annotations datasets.
Researcher Affiliation Academia Anvith Thudi Department of Computer Science University of Toronto and Vector Institute EMAIL Chris J. Maddison Department of Computer Science University of Toronto and Vector Institute EMAIL
Pseudocode Yes Algorithm 1 Empirical Mix Max
Open Source Code No The paper mentions 'GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, Mar. 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata.' This refers to a third-party tool used by the authors, not their own implementation code for the methodology described in the paper.
Open Datasets Yes We selected ACSIncome Ding et al. (2021) (released under the MIT license) and Celeb A annotations Liu et al. (2015) (released for non-commercial use) to test on.
Dataset Splits Yes We used random 80% 20% train-test splits in all settings. We applied E2Mix Max given a small transformer trained for next token prediction on 600 of the 800 training samples per length (leaving the other 200 training samples per length to run EMix Max).
Hardware Specification Yes We used Nvidia RTX 2080 Ti and A100 GPUs to accelerate our experiments involving small transformers, and otherwise used Intel Xeon Silver 4210 CPUs and AMD EPYC 7643 CPUs.
Software Dependencies No The paper mentions using specific models and optimizers (GPTNeo, XGBoost, Adam W, Pytorch hyperparameters) and methods (Gaussian kernel density estimation with Scott method), but does not provide specific version numbers for any software libraries or frameworks used in its implementation.
Experiment Setup Yes In our experiments, we ran EMix Max for 10 steps with η = 2.0 for all the sequence modeling tasks, and for 20 steps with η = 0.1 for the tabular datasets unless otherwise specified; preliminary testing showed that this was enough to have the objective converge within 0.01 between iterates. The transformer used is GPTNeo Black et al. (2021) with 6 hidden states, 2 hidden layers, 2 attention heads, 8 intermediate size, and with 12 max position embeddings... The proxy model is trained for 20 epochs using Adam W with learning rates 0.01, 0.001 and 0.0001 (and otherwise default Pytorch hyperparameters).