reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MixMax: Distributional Robustness in Function Space via Optimal Data Mixtures

Authors: Anvith Thudi, Chris Maddison

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we found that Mix Max matched or outperformed the standard group DRO baselines, and in particular, Mix Max improved the performance of XGBoost over the only baseline, data balancing, for variations of the ACSIncome and Celeb A annotations datasets.
Researcher Affiliation	Academia	Anvith Thudi Department of Computer Science University of Toronto and Vector Institute EMAIL Chris J. Maddison Department of Computer Science University of Toronto and Vector Institute EMAIL
Pseudocode	Yes	Algorithm 1 Empirical Mix Max
Open Source Code	No	The paper mentions 'GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, Mar. 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata.' This refers to a third-party tool used by the authors, not their own implementation code for the methodology described in the paper.
Open Datasets	Yes	We selected ACSIncome Ding et al. (2021) (released under the MIT license) and Celeb A annotations Liu et al. (2015) (released for non-commercial use) to test on.
Dataset Splits	Yes	We used random 80% 20% train-test splits in all settings. We applied E2Mix Max given a small transformer trained for next token prediction on 600 of the 800 training samples per length (leaving the other 200 training samples per length to run EMix Max).
Hardware Specification	Yes	We used Nvidia RTX 2080 Ti and A100 GPUs to accelerate our experiments involving small transformers, and otherwise used Intel Xeon Silver 4210 CPUs and AMD EPYC 7643 CPUs.
Software Dependencies	No	The paper mentions using specific models and optimizers (GPTNeo, XGBoost, Adam W, Pytorch hyperparameters) and methods (Gaussian kernel density estimation with Scott method), but does not provide specific version numbers for any software libraries or frameworks used in its implementation.
Experiment Setup	Yes	In our experiments, we ran EMix Max for 10 steps with η = 2.0 for all the sequence modeling tasks, and for 20 steps with η = 0.1 for the tabular datasets unless otherwise specified; preliminary testing showed that this was enough to have the objective converge within 0.01 between iterates. The transformer used is GPTNeo Black et al. (2021) with 6 hidden states, 2 hidden layers, 2 attention heads, 8 intermediate size, and with 12 max position embeddings... The proxy model is trained for 20 epochs using Adam W with learning rates 0.01, 0.001 and 0.0001 (and otherwise default Pytorch hyperparameters).