reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Counterfactual Realizability

Authors: Arvind Raghavan, Elias Bareinboim

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Simulations in the online setting corroborate this finding. Fig. 6(c,d) shows the cumulative regret (CR) and optimal arm probability (OAP) over 2000 iterations averaged over 200 epochs (CI=95%). We adapt Thompson Sampling to implement the strategies in Table 1. Details of implementation are in App. F.3.1 here.
Researcher Affiliation	Academia	Arvind Raghavan and Elias Bareinboim Causal Artificial Intelligence Lab Columbia University EMAIL
Pseudocode	Yes	Algorithm 1 CTF-REALIZE 1: Input: L3-distribution Q = P(W ); causal diagram G; action set A 2: Output: I.i.d sample W(i) from Q; FAIL if Q is not realizable given G, A 3: Fix a topological ordering Top(G) 4: SELECT(i) for a new unit i 5: for V in order Top(G) do 6: INTV {Interventions for V } 7: OUTPUTV {Index in output vector} 8: for each term Wt in expression W do 9: if V An(W)GT and V = W then 10: Call COMPATIBLE(V, Wt) Alg. 2 11: end if 12: if V = W then 13: Add {Wt} to OUTPUTV 14: end if 15: end for
Open Source Code	No	The paper does not provide an explicit statement about releasing code or a link to a code repository. It mentions 'Proofs and experiment details are in the full technical report (Raghavan & Bareinboim, 2025)' and 'Details of implementation are in App. F.3.1 here', referring to another document, but not source code.
Open Datasets	No	The paper uses simulated examples (Example 2: 'holdout set of fake CVs', Example 3: 'user of a social media platform'). It does not provide access information (links, DOIs, formal citations) for any publicly available datasets.
Dataset Splits	No	The paper mentions a 'holdout set of fake CVs' in Example 2 but does not specify any training/test/validation splits. Example 3 involves simulations over iterations and epochs, which is not dataset splitting in the traditional sense.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running experiments, such as GPU models, CPU types, or cloud computing resources.
Software Dependencies	No	The paper mentions adapting 'Thompson Sampling' for implementation but does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow, specific libraries, or their versions).
Experiment Setup	No	The paper mentions '2000 iterations averaged over 200 epochs (CI=95%)' for simulations. However, it defers detailed experimental setup, such as specific hyperparameters or model initialization settings, to appendices of a separate technical report (e.g., 'Details of the SCM, latent confounders, and the optimal L3-strategy are in App. F.3 here.'). The main text does not contain these specific details.