reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback

Authors: Zaid Khan, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment with DATAENVGYM environments in four domains: visual question answering, mathematics, programming and tool-use. For visual question answering, we use GQA (Hudson & Manning, 2019) and Natural Bench (Li et al., 2024); for mathematics, we use MATH (Hendrycks et al., 2021); for programming, we use Live Code Bench (Jain et al., 2024); for tool-use, we use Mn Ms (Ma et al., 2024). ... Tab. 2 presents results on example instantiations of environments within DATAENVGYM. Here, we compare students before and after a multi-step trajectory of training across environments, with different data generation policies.
Researcher Affiliation	Academia	Zaid Khan Elias Stengel-Eskin Jaemin Cho Mohit Bansal UNC Chapel Hill EMAIL
Pseudocode	No	The paper describes the DATAENVGYM framework, its components, and processes using prose and diagrams (e.g., Figure 1, Figure 2). It also provides LLM prompt templates (e.g., Figure 13, 14, 15, 16, 17, 18) which are code examples for prompts, but not general pseudocode or algorithm blocks for the overall methodology.
Open Source Code	Yes	REPRODUCIBILITY STATEMENT: We will publicly release our code and leaderboard. For all experiments, we use publicly available datasets and student models. Project page: https://Data Env Gym.github.io.
Open Datasets	Yes	For visual question answering, we use GQA (Hudson & Manning, 2019) and Natural Bench (Li et al., 2024); for mathematics, we use MATH (Hendrycks et al., 2021); for programming, we use Live Code Bench (Jain et al., 2024); for tool-use, we use Mn Ms (Ma et al., 2024).
Dataset Splits	Yes	For GQA, we create validation and test splits by doing a balanced stratified sampling of the validation and testdev sets repeatedly. Specifically, we sample 5 questions from each of the 100 question types in GQA, following Gupta & Kembhavi (2023). For MATH, we create a validation set by doing balanced stratified sampling of the test set across all levels and topics in MATH, selecting 50 from each group. We use the official test set for MATH. For Live Code Bench, we create a validation set by choosing all problems that are in the 2nd release but not in the 1st release as our validation set, and use the entire 1st release as our test set. This results in a relatively small validation set of only 100 problems.
Hardware Specification	Yes	Training time for most environments, even with a single A6000 GPU, is typically less than 30 minutes.
Software Dependencies	Yes	We use supervised finetuning for training using the Transformers (Wolf et al., 2020) library. We use Transformers (Wolf et al., 2020) and Llama-Factory (Zheng et al., 2024) libraries for training. We train Pali Gemma-3b-pt-224 (Beyer et al., 2024) for 10 epochs using the Adam W (Loshchilov & Hutter, 2017) optimizer with 2 warmup steps and a learning rate of 2 10 5, a weight decay of 10 6 using the BF16 datatype and batch size of 16. We apply Lo RA (Hu et al., 2022) with a rank of 16 and an alpha of 32, no bias, and a dropout of 0.05.
Experiment Setup	Yes	We train Pali Gemma-3b-pt-224 (Beyer et al., 2024) for 10 epochs using the Adam W (Loshchilov & Hutter, 2017) optimizer with 2 warmup steps and a learning rate of 2 10 5, a weight decay of 10 6 using the BF16 datatype and batch size of 16. We apply Lo RA (Hu et al., 2022) with a rank of 16 and an alpha of 32, no bias, and a dropout of 0.05. We apply Lo RA to all linear layers. We use the Adam optimizer with a batch size of 16 and a cosine learning rate scheduler with a warmup ratio of 0.1 and train for 3 epochs in the FP16 datatype. We apply Lo RA to all linear layers with a rank of 16 and an alpha of 32, no bias, and a dropout of 0.05. We truncate all training examples to a maximum length of 1024 tokens.