reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Speeding up Policy Simulation in Supply Chain RL

Authors: Vivek Farias, Joren Gijsbrechts, Aryan I. Khojandi, Tianyi Peng, Andrew Zheng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate practical speedups of 400x on large-scale SCO problems even with a single GPU, and also demonstrate practical efficacy in other RL environments. We implement two practical optimizations that improve performance (but do not impact our theoretical analysis): First, if treset is the smallest t for which αk t = αk 1 t , then we know that the actions evaluated for times t < treset are correct and it is sufficient to start the kth Picard iteration at time t = treset. Second, as opposed to running Picard iteration over the entire horizon, we run the iteration in chunks of size max_steps and move on to the next chunk only after convergence of the preceding one. More precisely, we run the for loop in Line 5 of the algorithm over t [treset, min(T, treset + max_steps)]. Tuning the max_steps parameter thus trades off the need for synchronization (the number of iterations of the while loop in Line 2), with the potential for wasting computation (the number of iterations of the for loop in Line 5).
Researcher Affiliation	Academia	1Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA 02139 2Esade Business School, Ramon Llull University, Barcelona, Spain 3Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139 4Columbia Business School, New York, NY 10027 5Sauder School of Business, The University of British Columbia, Vancouver, Canada. Correspondence to: Vivek Farias <EMAIL>.
Pseudocode	Yes	Algorithm 1 The Picard Iteration
Open Source Code	Yes	The source code has been made publicly available (see the supplementary for anonymity). The code is implemented in JAX (Bradbury et al., 2018) and is available on Github4. See further details in Appendix A.3. Footnote 4: https://github.com/atzheng/ picard-iteration-icml
Open Datasets	No	Our experiments use synthetic data based broadly on fulfillment-network and demand-distribution patterns observed at modern (moderately) large industrial-scale retailers. We consider a base setting inspired by Walmart, which operates over 4,000 stores (Walmart, 2022) to fulfill their online demand.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits. It describes generating synthetic data for problem instances and collecting policy iterates for simulation, but not specific partitioning percentages, counts, or references to standard splits for experimental reproduction.
Hardware Specification	Yes	All experiments were conducted on a single A100 GPU with 40GB of VRAM.
Software Dependencies	No	The code is implemented in JAX (Bradbury et al., 2018). The paper mentions the JAX library but does not provide a specific version number for JAX or any other key software dependencies.
Experiment Setup	Yes	For policy evaluation, we approximate a greedy-like policy using a simple MLP with two layers of width 64, with the goal of constructing a policy with predictable behavior which represents a realistic computational workload. For optimizing θ, we perform 1K gradient steps using Adam with learning rate 3e-3. For numerical stability, we set a learning rate of 3e-5 and perform one update per trajectory instead of 10. We use 4,000 parallel workers and set max_steps=130; the policy call consists of a Multi-Layer Perceptron with two hidden layers with width 512.