Simple Regret Optimization in Online Planning for Markov Decision Processes
Authors: Z. Feldman, C. Domshlak
JAIR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluation shows that both BRUE and its generalization, BRUE(α), are also very effective in practice and compare favorably to the state-of-the-art. We have evaluated BRUE empirically on the MDP Sailing domain (P eret & Garcia, 2004), used in previous works for evaluating MCTS algorithms (P eret & Garcia, 2004; Kocsis & Szepesv ari, 2006; Tolpin & Shimony, 2012), as well as on an MDP version of random game trees used in the original empirical evaluation of UCT (Kocsis & Szepesv ari, 2006). |
| Researcher Affiliation | Academia | Zohar Feldman EMAIL Carmel Domshlak EMAIL Faculty of Industrial Engineering & Management, Technion Israel Institute of Technology, Haifa, Israel |
| Pseudocode | Yes | MCTS: [input: S, A, Tr, R ; s0 S] while time permits do ρ Rollout // generate rollout Update(ρ) return arg maxa b Q(s0 H , a) procedure Rollout... (Figure 1). procedure Update(ρ) for d |ρ|, . . . , 1 do... (Figure 2). procedure MC-backup(s h , a, r) nα α n(s h , a) n n(s h , a) L(s h , a)[n] r b Q(s h , a) 1 nα Pn i=n nα L(s h , a)[i] (Figure 4). |
| Open Source Code | No | The paper does not contain any explicit statement about making the source code available, nor does it provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | We have evaluated BRUE empirically on the MDP Sailing domain (P eret & Garcia, 2004), used in previous works for evaluating MCTS algorithms (P eret & Garcia, 2004; Kocsis & Szepesv ari, 2006; Tolpin & Shimony, 2012), as well as on an MDP version of random game trees used in the original empirical evaluation of UCT (Kocsis & Szepesv ari, 2006). |
| Dataset Splits | No | For the Sailing domain: Each algorithm was run on 1000 randomly chosen initial states s0, with the target being fixed at one of the corners of the grid. For random game trees: Figure 6 shows the performance of the four algorithms for two game configurations, B = 6, D = 8 and B = 2, D = 22, with each configuration being represented by 1000 game trees. The paper describes using environments (Sailing domain, game trees) and generating instances/initial states for evaluation, but does not provide details on training/test/validation splits for a static dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (like CPU/GPU models, memory, or specific computing infrastructure) used for running the experiments. |
| Software Dependencies | No | The paper mentions that 'All four algorithms were implemented within a single software infrastructure' but does not specify any programming languages, libraries, or solvers with version numbers. |
| Experiment Setup | Yes | In the Sailing domain, we set H to 4 n, where n is the grid-size of the problem instance... the exploration coefficient for UCT and u UCT (parameter c in Eq. 2) was set to the difference between the largest possible and the smallest possible values of the H-step rollouts from the root. In the Sailing domain, this corresponds to the maximal move duration, 6, multiplied by the number of steps-to-go h. ...the exploration coefficient for UCT and u UCT was set to the range of the game values, 127H, since rewards are bounded by the interval 127H 2. ...we used α = n H nh , where n H denotes the average number of samples of leaf nodes, and nh denotes the average number of samples of nodes at the same depth of the value estimator under consideration. |