MixMax: Distributional Robustness in Function Space via Optimal Data Mixtures
Authors: Anvith Thudi, Chris Maddison
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we found that Mix Max matched or outperformed the standard group DRO baselines, and in particular, Mix Max improved the performance of XGBoost over the only baseline, data balancing, for variations of the ACSIncome and Celeb A annotations datasets. |
| Researcher Affiliation | Academia | Anvith Thudi Department of Computer Science University of Toronto and Vector Institute EMAIL Chris J. Maddison Department of Computer Science University of Toronto and Vector Institute EMAIL |
| Pseudocode | Yes | Algorithm 1 Empirical Mix Max |
| Open Source Code | No | The paper mentions 'GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, Mar. 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata.' This refers to a third-party tool used by the authors, not their own implementation code for the methodology described in the paper. |
| Open Datasets | Yes | We selected ACSIncome Ding et al. (2021) (released under the MIT license) and Celeb A annotations Liu et al. (2015) (released for non-commercial use) to test on. |
| Dataset Splits | Yes | We used random 80% 20% train-test splits in all settings. We applied E2Mix Max given a small transformer trained for next token prediction on 600 of the 800 training samples per length (leaving the other 200 training samples per length to run EMix Max). |
| Hardware Specification | Yes | We used Nvidia RTX 2080 Ti and A100 GPUs to accelerate our experiments involving small transformers, and otherwise used Intel Xeon Silver 4210 CPUs and AMD EPYC 7643 CPUs. |
| Software Dependencies | No | The paper mentions using specific models and optimizers (GPTNeo, XGBoost, Adam W, Pytorch hyperparameters) and methods (Gaussian kernel density estimation with Scott method), but does not provide specific version numbers for any software libraries or frameworks used in its implementation. |
| Experiment Setup | Yes | In our experiments, we ran EMix Max for 10 steps with η = 2.0 for all the sequence modeling tasks, and for 20 steps with η = 0.1 for the tabular datasets unless otherwise specified; preliminary testing showed that this was enough to have the objective converge within 0.01 between iterates. The transformer used is GPTNeo Black et al. (2021) with 6 hidden states, 2 hidden layers, 2 attention heads, 8 intermediate size, and with 12 max position embeddings... The proxy model is trained for 20 epochs using Adam W with learning rates 0.01, 0.001 and 0.0001 (and otherwise default Pytorch hyperparameters). |