reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Offline Learning for Combinatorial Multi-armed Bandits

Authors: Xutong Liu, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, Carlee Joe-Wong, John C.S. Lui, Wei Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical Validation: Finally, extensive experiments on both synthetic and real-world datasets for learning to rank and LLM caching validate the superior performance of CLCB compared to baseline algorithms.
Researcher Affiliation	Collaboration	1ECE Department, Carnegie Mellon University, Pittsburgh PA, United States 2CSE Department, Chinese University of Hong Kong, Hong Kong SAR, China 3CS Department, City University of Hong Kong, Hong Kong SAR, China 4Microsoft Research, Beijing, China.
Pseudocode	Yes	Algorithm 1 CLCB: Combinatorial Lower Confidence Bound Algorithm for Off-CMAB
Open Source Code	No	The paper does not provide explicit links to source code for the methodology described, nor does it contain an unambiguous statement of code release.
Open Datasets	Yes	For real-world evaluation, we use the Yelp dataset3, where users rate businesses (Dai et al., 2024c). ... We use the Sci Q dataset (Welbl et al., 2017).
Dataset Splits	No	The paper mentions running experiments over a certain number of rounds (e.g., "n = 100 rounds") or with specific cache sizes, but it does not provide specific training/test/validation dataset splits for reproducibility of data partitioning.
Hardware Specification	Yes	All tests were performed on a mac OS system equipped with an Apple M3 Pro processor and 18 GB of RAM.
Software Dependencies	No	The paper mentions using GPT-4-o and GPT-4-turbo, along with Open AI’s tiktoken library and Open AI LLM API, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	In the synthetic setup, we simulate 100 distinct queries with a cache size of 40, following a power-law frequency distribution (α = 0.9) as in (Zhu et al., 2023). ... For the evaluation, we work with 100 distinct prompts from the Sci Q dataset in an offline setting, performing a total of 10,000 queries with cache sizes of K = 10 and K = 20, respectively.