Learning to Search from Demonstration Sequences
Authors: Dixant Mittal, Liwei Kang, Wee Sun Lee
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study problems from these two scenarios, including Game of 24, 2D grid navigation, and Procgen games, to understand when D-TSN is more helpful. Through our experiments, we show that D-TSN is effective, especially when the world model with a latent state space is jointly learned. The code is available at https://github. com/dixantmittal/differentiable-tree-search-network. Section 4, titled 'EXPERIMENTS', further details empirical evaluations with results presented in tables such as Table 1, Table 2, and Table 3. |
| Researcher Affiliation | Collaboration | Dixant Mittal1,2 Liwei Kang1 Wee Sun Lee1 1National University of Singapore 2 Moovita EMAIL. One author, Dixant Mittal, is affiliated with both National University of Singapore (an academic institution) and Moovita (an industry affiliation). |
| Pseudocode | Yes | A DIFFERENTIABLE TREE SEARCH NETWORK ALGORITHM. A.1 DIFFERENTIABLE TREE SEARCH NETWORK PSEUDO-CODE. Algorithm 1: Differentiable Tree Search (D-TSN) |
| Open Source Code | Yes | The code is available at https://github. com/dixantmittal/differentiable-tree-search-network. |
| Open Datasets | No | For Game of 24, the authors state: 'We collected all valid Game of 24 problems and their solutions through an exhaustive search of all combinations, then randomly selected 530 problems for evaluation. The remaining 527 problems have 16k valid solutions, from which we randomly sampled a subset for training.' For Navigation and Procgen, they state: 'We use a behavior policy, which can be optimal or sub-optimal, to collect demonstration sequences for training.' The paper describes collecting its own datasets for experiments and does not provide public access links or specific citations for these collected datasets. |
| Dataset Splits | Yes | We collected all valid Game of 24 problems and their solutions through an exhaustive search of all combinations, then randomly selected 530 problems for evaluation. The remaining 527 problems have 16k valid solutions, from which we randomly sampled a subset for training. |
| Hardware Specification | No | The paper states: 'Notably, greater depths, such as 3 or more, are infeasible since the resulting computation graph exceeds the memory capacity (roughly 11GB) of a typical consumer-grade GPU.' This describes a general limitation or characteristic of a type of hardware, rather than explicitly specifying the hardware used for the experiments conducted in the paper. |
| Software Dependencies | No | The paper mentions using 'Llama3-8B (Dubey et al., 2024)' and refers to 'Phasic Policy Gradient (PPG) (Cobbe et al., 2021)' as a baseline, but it does not provide specific version numbers for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow) used in the implementation of D-TSN or its experiments. |
| Experiment Setup | Yes | We train D-TSN using 8 search iterations in training and compare the resulting value function with a supervised fine-tuned model... For our empirical evaluations, we set the maximum limit for search iterations at 10. For our evaluations, we perform 10 search iterations for each input state. To train this model, we compute the Q-value, Qθ, without performing the search and optimize the loss defined as: LSearch = λ1LQ + λ2LD + λ3LTθ + λ4LRθ. For evaluations, we adhere to a depth of 2 for Tree QN... We limit the number of trajectories to 1000 for each domain to evaluate the sample complexity and generalization capabilities of each method. We fine-tune the hyperparameters, λ1, λ2, λ3 and λ4, using grid search on a log scale. |