reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Strategic A/B testing via Maximum Probability-driven Two-armed Bandit

Authors: Yu Zhang, Shanshan Zhao, Bokui Wan, Jinjuan Wang, Xiaodong Yan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results indicate a significant improvement in the A/B testing, highlighting the potential to reduce experimental costs while maintaining high statistical power. In this section, we conduct detailed comparisons between the proposed method and other state-of-the-art methods via synthetic data (Section 5.1) and real-world data (Section 5.2).
Researcher Affiliation	Collaboration	1Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, China 2School of Mathematics, Shandong University, Jinan, China 3Didi Chuxing, Beijing, China 4School of Mathematics and Statistics, Beijing Institute of Technology, Beijing, China 5School of Mathematics and Statistics, Xi an Jiaotong University, Xian, China.
Pseudocode	Yes	Algorithm 1 Permuted WTAB algorithm Input: data D = {(Xi, Yi, Ai), i = 1, . . . , n}, threshold τ, permutation times B. Output: the aggregated p-value pa.
Open Source Code	No	The text does not include an unambiguous statement where the authors state they are releasing the code for the work described in this paper, nor does it provide a direct link to a source-code repository.
Open Datasets	No	In this section, the application of the proposed method is demonstrated through an analysis of three real data sets obtained from a world-leading ride-sharing company. Due to privacy considerations, we refer to them as data sets A, B, and C. Given that real-world data distributions are often difficult to replicate using purely synthetic data, we additionally construct a semi-synthetic dataset based on real-world data. Following the approach proposed in (Kohavi et al., 2020), we generate synthetic data based on real-world observations.
Dataset Splits	Yes	In practice, the observation dataset D is divided into K equal subsets Dk. For each Dk, a Light GBM model is trained on the remaining data D/Dk and applied to estimate counterfactual results of Dk. This procedure is repeated for all subsets Dk. The simulations in Section 5 show that a good performance is achievable with K = 2.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	Specifically, Light GBM (Ke et al., 2017), a state-of-the-art gradient boosting algorithm, is employed within the double machine learning (DML) framework (Chernozhukov et al., 2018). Additionally, XGBoost (Chen & Guestrin, 2016) is used to estimate m1(x), m0(x), and e(x), exhibiting similar performance to Light GBM. The paper mentions software names but does not provide specific version numbers for the libraries or packages used in their experiments.
Experiment Setup	Yes	Specifically, first determine a number of permutations, denoted as B. Subsequently, the sequence {1, 2, . . . , n} is reordered into a new one by applying a mapping πb : {1, 2, . . . , n} {1, 2, . . . , n}. For each element i in the original sequence, its position in the reordered sequence is given by πb(i). For b = 1, . . . , B, the mapping πb is applied to the counterfactual outcomes {bµi, i = 1, . . . , n}, resulting in reordered samples {bµπb(i), i = 1, . . . , n}. In this paper, B is set to 25, as it has been observed that increasing B further does not substantially improve statistical power. The simulations in Section 5 show that a good performance is achievable with K = 2. the sample size is fixed to n = 20000. a threshold (typically 0.03) is selected to regulate the magnitude of λ, that is, to identify the largest that satisfies λσ/ (1 λ) n 0.03.