Simple Regret Optimization in Online Planning for Markov Decision Processes

Authors: Z. Feldman, C. Domshlak

JAIR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluation shows that both BRUE and its generalization, BRUE(α), are also very effective in practice and compare favorably to the state-of-the-art. We have evaluated BRUE empirically on the MDP Sailing domain (P eret & Garcia, 2004), used in previous works for evaluating MCTS algorithms (P eret & Garcia, 2004; Kocsis & Szepesv ari, 2006; Tolpin & Shimony, 2012), as well as on an MDP version of random game trees used in the original empirical evaluation of UCT (Kocsis & Szepesv ari, 2006).
Researcher Affiliation Academia Zohar Feldman EMAIL Carmel Domshlak EMAIL Faculty of Industrial Engineering & Management, Technion Israel Institute of Technology, Haifa, Israel
Pseudocode Yes MCTS: [input: S, A, Tr, R ; s0 S] while time permits do ρ Rollout // generate rollout Update(ρ) return arg maxa b Q(s0 H , a) procedure Rollout... (Figure 1). procedure Update(ρ) for d |ρ|, . . . , 1 do... (Figure 2). procedure MC-backup(s h , a, r) nα α n(s h , a) n n(s h , a) L(s h , a)[n] r b Q(s h , a) 1 nα Pn i=n nα L(s h , a)[i] (Figure 4).
Open Source Code No The paper does not contain any explicit statement about making the source code available, nor does it provide a link to a code repository for the methodology described.
Open Datasets Yes We have evaluated BRUE empirically on the MDP Sailing domain (P eret & Garcia, 2004), used in previous works for evaluating MCTS algorithms (P eret & Garcia, 2004; Kocsis & Szepesv ari, 2006; Tolpin & Shimony, 2012), as well as on an MDP version of random game trees used in the original empirical evaluation of UCT (Kocsis & Szepesv ari, 2006).
Dataset Splits No For the Sailing domain: Each algorithm was run on 1000 randomly chosen initial states s0, with the target being fixed at one of the corners of the grid. For random game trees: Figure 6 shows the performance of the four algorithms for two game configurations, B = 6, D = 8 and B = 2, D = 22, with each configuration being represented by 1000 game trees. The paper describes using environments (Sailing domain, game trees) and generating instances/initial states for evaluation, but does not provide details on training/test/validation splits for a static dataset.
Hardware Specification No The paper does not provide specific hardware details (like CPU/GPU models, memory, or specific computing infrastructure) used for running the experiments.
Software Dependencies No The paper mentions that 'All four algorithms were implemented within a single software infrastructure' but does not specify any programming languages, libraries, or solvers with version numbers.
Experiment Setup Yes In the Sailing domain, we set H to 4 n, where n is the grid-size of the problem instance... the exploration coefficient for UCT and u UCT (parameter c in Eq. 2) was set to the difference between the largest possible and the smallest possible values of the H-step rollouts from the root. In the Sailing domain, this corresponds to the maximal move duration, 6, multiplied by the number of steps-to-go h. ...the exploration coefficient for UCT and u UCT was set to the range of the game values, 127H, since rewards are bounded by the interval 127H 2. ...we used α = n H nh , where n H denotes the average number of samples of leaf nodes, and nh denotes the average number of samples of nodes at the same depth of the value estimator under consideration.