Adaptive teachers for amortized samplers

Authors: Minsu Kim, Sanghyeok Choi, Taeyoung Yun, Emmanuel Bengio, Leo Feng, Jarrid Rector-Brooks, Sungsoo Ahn, Jinkyoo Park, Nikolay Malkin, Yoshua Bengio

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of this approach in a synthetic environment designed to present an exploration challenge, two diffusion-based sampling tasks, and four biochemical discovery tasks demonstrating its ability to improve sample efficiency and mode coverage. 5 EXPERIMENTS This section provides empirical validation of our method.
Researcher Affiliation Collaboration Minsu Kim Mila, KAIST Sanghyeok Choi KAIST Taeyoung Yun KAIST Emmanuel Bengio Recursion Leo Feng Mila, Universit e de Montr eal Jarrid Rector-Brooks Mila, Universit e de Montr eal Sungsoo Ahn KAIST Jinkyoo Park KAIST Nikolay Malkin University of Edinburgh Yoshua Bengio Mila, Universit e de Montr eal
Pseudocode Yes Algorithm 1 Teacher-Student Training of GFlow Nets
Open Source Code Yes Source code is available at https://github.com/alstn12088/adaptive-teacher.
Open Datasets Yes QM9. The objects being sampled are small molecular graphs. ... The reward function is a HOMO-LUMO gap on the target transcription factor, which is obtained via a pre-trained MXMNet proxy from Zhang et al. (2020). ... TFbind8. The generated objects are DNA sequences with 8 nucleotides. The reward function is a binding affinity to a human transcription factor (Barrera et al., 2016), which is obtained via a pre-trained proxy model provided by Trabucco et al. (2022). ... L14-RNA1. The generated objects are RNA sequences of length 14. The reward function is a binding affinity to a human transcription factor, which is obtained via a pre-trained proxy model from Sinai et al. (2020).
Dataset Splits No No specific dataset splits for training/testing/validation are provided as the paper focuses on sampling from target distributions or generating data rather than training on a fixed dataset. The evaluation involves generating samples from the trained models and comparing them against target distributions or known modes.
Hardware Specification No The research was enabled in part by computational resources provided by the Digital Research Alliance of Canada (https://alliancecan.ca), Mila (https://mila.quebec), and NVIDIA.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions) are explicitly mentioned in the paper.
Experiment Setup Yes For all tasks, we set α= 0.5 except for the exploration-intensive deceptive grid world tasks, where we use α= 0.0. C is set to 19 for all tasks. ... For neural networks, we use identical architectures for both the Student and Teacher models. Specifically, for the GFN architecture design, we match the architectures used by each baseline for every task. ... When selecting the behavior policy during training, we periodically choose among the Student, Teacher, and buffer in specific proportions: a ratio of 1:1:0 for Grid World tasks, 3:1:2 for Diffusion Sampler tasks, and 2:1:3 for Biochemical tasks. ... For the deceptive grid world, we use a two-layer MLP with 256 hidden units for the parameterized policy P_F( ; θ) along with a learnable parameter for log Z_θ. We train them using the Adam optimizer with a learning rate of 10^-3 for policy and 10^-1 for log Z_θ. The backward policy P_B is fixed as a uniform random policy. ... We use a batch size of 16. ... We set σ^2 = 5.0 for 25GMM and σ= 1.0 for Manywell, with the number of time steps T= 100... We employ the same architecture as Zhang & Chen (2022) and Sendera et al. (2024), increasing the hidden dimension from 64 to 256 for Manywell... For training GFlow Nets, we use Adam optimizer (Kingma & Ba, 2015) with learning rate 10^-2 for log Z_θ, 10^-4 for forward policy, 5 * 10^-4 for teacher policy. ... For QM9 and sEH tasks, we employ a two-layer architecture with 1024 hidden units, while for the other tasks, we choose to use a two-layer architecture with 128 hidden units. We initialize log Z_θ to 5.0 for all methods. For backward policy, we use a fixed uniform policy. In terms of reward exponent, we use a value of 20 for both QM9 and TFbind8. For sEH and L14-RNA1, we use relatively higher values, 200 and 40, respectively.