DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
Authors: Zaid Khan, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment with DATAENVGYM environments in four domains: visual question answering, mathematics, programming and tool-use. For visual question answering, we use GQA (Hudson & Manning, 2019) and Natural Bench (Li et al., 2024); for mathematics, we use MATH (Hendrycks et al., 2021); for programming, we use Live Code Bench (Jain et al., 2024); for tool-use, we use Mn Ms (Ma et al., 2024). ... Tab. 2 presents results on example instantiations of environments within DATAENVGYM. Here, we compare students before and after a multi-step trajectory of training across environments, with different data generation policies. |
| Researcher Affiliation | Academia | Zaid Khan Elias Stengel-Eskin Jaemin Cho Mohit Bansal UNC Chapel Hill EMAIL |
| Pseudocode | No | The paper describes the DATAENVGYM framework, its components, and processes using prose and diagrams (e.g., Figure 1, Figure 2). It also provides LLM prompt templates (e.g., Figure 13, 14, 15, 16, 17, 18) which are code examples for prompts, but not general pseudocode or algorithm blocks for the overall methodology. |
| Open Source Code | Yes | REPRODUCIBILITY STATEMENT: We will publicly release our code and leaderboard. For all experiments, we use publicly available datasets and student models. Project page: https://Data Env Gym.github.io. |
| Open Datasets | Yes | For visual question answering, we use GQA (Hudson & Manning, 2019) and Natural Bench (Li et al., 2024); for mathematics, we use MATH (Hendrycks et al., 2021); for programming, we use Live Code Bench (Jain et al., 2024); for tool-use, we use Mn Ms (Ma et al., 2024). |
| Dataset Splits | Yes | For GQA, we create validation and test splits by doing a balanced stratified sampling of the validation and testdev sets repeatedly. Specifically, we sample 5 questions from each of the 100 question types in GQA, following Gupta & Kembhavi (2023). For MATH, we create a validation set by doing balanced stratified sampling of the test set across all levels and topics in MATH, selecting 50 from each group. We use the official test set for MATH. For Live Code Bench, we create a validation set by choosing all problems that are in the 2nd release but not in the 1st release as our validation set, and use the entire 1st release as our test set. This results in a relatively small validation set of only 100 problems. |
| Hardware Specification | Yes | Training time for most environments, even with a single A6000 GPU, is typically less than 30 minutes. |
| Software Dependencies | Yes | We use supervised finetuning for training using the Transformers (Wolf et al., 2020) library. We use Transformers (Wolf et al., 2020) and Llama-Factory (Zheng et al., 2024) libraries for training. We train Pali Gemma-3b-pt-224 (Beyer et al., 2024) for 10 epochs using the Adam W (Loshchilov & Hutter, 2017) optimizer with 2 warmup steps and a learning rate of 2 10 5, a weight decay of 10 6 using the BF16 datatype and batch size of 16. We apply Lo RA (Hu et al., 2022) with a rank of 16 and an alpha of 32, no bias, and a dropout of 0.05. |
| Experiment Setup | Yes | We train Pali Gemma-3b-pt-224 (Beyer et al., 2024) for 10 epochs using the Adam W (Loshchilov & Hutter, 2017) optimizer with 2 warmup steps and a learning rate of 2 10 5, a weight decay of 10 6 using the BF16 datatype and batch size of 16. We apply Lo RA (Hu et al., 2022) with a rank of 16 and an alpha of 32, no bias, and a dropout of 0.05. We apply Lo RA to all linear layers. We use the Adam optimizer with a batch size of 16 and a cosine learning rate scheduler with a warmup ratio of 0.1 and train for 3 epochs in the FP16 datatype. We apply Lo RA to all linear layers with a rank of 16 and an alpha of 32, no bias, and a dropout of 0.05. We truncate all training examples to a maximum length of 1024 tokens. |