Multiple-policy Evaluation via Density Estimation

Authors: Yilei Chen, Aldo Pacchiano, Ioannis Paschalidis

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We propose an algorithm named CAESAR for this problem. Our approach is based on computing an approximately optimal sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values. Up to low order and logarithmic terms CAESAR achieves a sample complex- ity.
Researcher Affiliation Academia 1Boston University, Boston, USA 2Broad Institute of MIT and Harvard, Cambridge, USA. Correspondence to: Yilei Chen <EMAIL>.
Pseudocode Yes Algorithm 1 Importance Density Estimation (IDES) Input: Horizon H, accuracy ϵ, target policy π, coarse estimator { ˆdπ h}H h=1 , {ˆµh}H h=1 and dataset µ Define feasible sets {Dh}H h=1 where Dh(s, a) = [0, 2 ˆdπ h(s, a)]. Initialize w0 h = 0, h = 1, . . . , H, and set µ0(s0, a0) = 1, P0(s|s0, a0) = ν(s), ˆw0 = ˆµ0 = 1. for h = 1 to H do Set the iteration number of optimization, nh = s,a ( ˆdπ h(s,a))2 ˆµh(s,a) + ( ˆdπ h 1(s,a))2 is a known constant. for i = 1 to nh do Sample {si h, ai h} from µh and {si h 1, ai h 1, si h} from µh 1. Calculate gradient g(wi 1 h ), g(wi 1 h )(s, a) = wi 1 h (s, a) ˆµh(s, a) I(si h = s, ai h = a) ˆwh 1(si h 1, ai h 1) ˆµh 1(si h 1, ai h 1) π(a|s)I(si h = s). Update wi h = Projw Dh{wi 1 h ηi hg(wi 1 h )}. end for Output the estimator ˆwh = 1 Pnh i=1 i Pnh i=1 wi h. end for
Open Source Code No The paper does not provide any explicit statements about releasing code or links to source code repositories.
Open Datasets No The paper is theoretical and discusses a general 'offline dataset' or 'batch of data' within the problem formulation, but it does not specify or use any particular publicly available dataset for experiments.
Dataset Splits No The paper is theoretical and does not describe experiments with specific datasets, therefore, it does not mention any training, test, or validation dataset splits.
Hardware Specification No The paper is theoretical and focuses on algorithm design and sample complexity analysis. It does not describe any experimental setup or the specific hardware used to run experiments.
Software Dependencies No The paper mentions theoretical concepts and algorithms like 'stochastic gradient descent' and 'Dual DICE', and references related works, but it does not specify any software libraries or packages with version numbers used for implementation.
Experiment Setup No The paper is theoretical and does not present experimental results, therefore, it does not include details on experimental setup, hyperparameters, or training configurations.