Combinatorial Reinforcement Learning with Preference Feedback
Authors: Joongkyu Lee, Min-Hwan Oh
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically evaluate the performance of our algorithm, MNL-VQL, in two settings: a synthetic environment (Subsection 6.1) and a real-world dataset (Subsection 6.2). We compare our algorithm against two baselines: Myopic and LSVI-UCB (Jin et al., 2020). |
| Researcher Affiliation | Academia | 1Seoul National University, Seoul, Korea. |
| Pseudocode | Yes | Algorithm 1 MNL-VQL, MNL Preference Model with Variance-weighted Item-level Q-Learning |
| Open Source Code | No | The paper does not provide any explicit statement about releasing code or a link to a code repository. |
| Open Datasets | Yes | The Movie Lens dataset contains 25 million ratings on a 5-star scale for 62,000 movies (base items a) provided by 162,000 users (u). |
| Dataset Splits | No | The paper mentions using a subset of the Movie Lens dataset containing "1.1 × 10^3 users and a varying number of movies, N ∈ {50, 100, 200}". However, it does not specify how this data was split into training, validation, or test sets, nor does it mention any cross-validation setup. |
| Hardware Specification | No | The paper does not explicitly describe any specific hardware (e.g., GPU, CPU models, or cloud resources with specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation of the algorithms. |
| Experiment Setup | Yes | We set the parameters as follows: K = 10000, H = 3, M = 4, |S| = 100 + (H-1)*4 = 400 (including the absorbing state), d = 26 (MNL feature dimension), dlin = 204 (Linear MDP feature dimension), N ∈ {50, 100, 200} (number of base items) and |A| = ∑(M-1) m=1 N choose m ∈ {20875, 166750, 1333500}. The proportion of junk items is set to 30%. For our experiments, we use a subset of the dataset containing 1.1 × 10^3 users and a varying number of movies, N ∈ {50, 100, 200}. To construct MNL features, we follow a similar experimental setup as in Li et al. (2019), employing low-rank matrix factorization. For linear MDP features, we apply the same approach as used in our synthetic data experiments. |