Speeding up Policy Simulation in Supply Chain RL
Authors: Vivek Farias, Joren Gijsbrechts, Aryan I. Khojandi, Tianyi Peng, Andrew Zheng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate practical speedups of 400x on large-scale SCO problems even with a single GPU, and also demonstrate practical efficacy in other RL environments. We implement two practical optimizations that improve performance (but do not impact our theoretical analysis): First, if treset is the smallest t for which αk t = αk 1 t , then we know that the actions evaluated for times t < treset are correct and it is sufficient to start the kth Picard iteration at time t = treset. Second, as opposed to running Picard iteration over the entire horizon, we run the iteration in chunks of size max_steps and move on to the next chunk only after convergence of the preceding one. More precisely, we run the for loop in Line 5 of the algorithm over t [treset, min(T, treset + max_steps)]. Tuning the max_steps parameter thus trades off the need for synchronization (the number of iterations of the while loop in Line 2), with the potential for wasting computation (the number of iterations of the for loop in Line 5). |
| Researcher Affiliation | Academia | 1Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA 02139 2Esade Business School, Ramon Llull University, Barcelona, Spain 3Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139 4Columbia Business School, New York, NY 10027 5Sauder School of Business, The University of British Columbia, Vancouver, Canada. Correspondence to: Vivek Farias <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 The Picard Iteration |
| Open Source Code | Yes | The source code has been made publicly available (see the supplementary for anonymity). The code is implemented in JAX (Bradbury et al., 2018) and is available on Github4. See further details in Appendix A.3. Footnote 4: https://github.com/atzheng/ picard-iteration-icml |
| Open Datasets | No | Our experiments use synthetic data based broadly on fulfillment-network and demand-distribution patterns observed at modern (moderately) large industrial-scale retailers. We consider a base setting inspired by Walmart, which operates over 4,000 stores (Walmart, 2022) to fulfill their online demand. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits. It describes generating synthetic data for problem instances and collecting policy iterates for simulation, but not specific partitioning percentages, counts, or references to standard splits for experimental reproduction. |
| Hardware Specification | Yes | All experiments were conducted on a single A100 GPU with 40GB of VRAM. |
| Software Dependencies | No | The code is implemented in JAX (Bradbury et al., 2018). The paper mentions the JAX library but does not provide a specific version number for JAX or any other key software dependencies. |
| Experiment Setup | Yes | For policy evaluation, we approximate a greedy-like policy using a simple MLP with two layers of width 64, with the goal of constructing a policy with predictable behavior which represents a realistic computational workload. For optimizing θ, we perform 1K gradient steps using Adam with learning rate 3e-3. For numerical stability, we set a learning rate of 3e-5 and perform one update per trajectory instead of 10. We use 4,000 parallel workers and set max_steps=130; the policy call consists of a Multi-Layer Perceptron with two hidden layers with width 512. |