reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Score-based Causal Representation Learning: Linear and General Transformations

Authors: Burak Varici, Emre Acartürk, Karthikeyan Shanmugam, Abhishek Kumar, Ali Tajer

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we provide empirical assessments of our theoretical guarantees. Specifically, we empirically evaluate the performance of the LSCALE-I (Section 7.1) and GSCALE-I (Section 7.2) algorithms for recovering the latent causal variables and the latent DAG G on synthetic data. In Section 7.3, we compare the performance of LSCALE-I to that of the existing algorithms in the closely related literature on both synthetic and biological data. Next, we also apply GSCALE-I on image data to demonstrate the potential of our approach in realistic high-dimensional datasets (Section 7.5). Finally, we note that any desired score estimator can be modularly incorporated into our algorithms. In Section 7.6, we assess our performance s sensitivity to the estimators quality. 7 Additional results and further implementation details are deferred to Appendix E. Evaluation metrics. The objectives are recovering the graph G and the latent variables Z. We use the following metrics to evaluate the accuracy of LSCALE-I and GSCALE-I in recovering these (depending on the specifics of the transformations and interventions, we will also have more specific metrics). For each metric, we will report the mean and standard error over multiple runs. Structural Hamming distance: For assessing the recovery of the latent DAG, we report structural Hamming distance (SHD) between the estimate G and true DAG G. This captures the number of edge operations (add, delete, flip) needed to transform G to G. Mean correlation coefficient: For the recovery of the latent variables, we use the mean correlation coefficient (MCC), which was introduced in (Khemakhem et al., 2020b) and commonly used as a standard metric in CRL.
Researcher Affiliation	Collaboration	Burak Varıcı EMAIL Machine Learning Department Carnegie Mellon University, Pittsburgh, PA 15213, USA Emre Acartürk EMAIL Electrical, Computer, and Systems Engineering Rensselaer Polytechnic Institute, Troy, NY 12180, USA Karthikeyan Shanmugam EMAIL Google Deep Mind India Bengaluru 560043, India Abhishek Kumar EMAIL Amazon AGI, USA Ali Tajer EMAIL Electrical, Computer, and Systems Engineering Rensselaer Polytechnic Institute, Troy, NY 12180, USA
Pseudocode	Yes	Algorithm 1 Linear Score-based Causal Latent Estimation via Interventions (LSCALE-I) [...] Algorithm 2 LSCALE-I for sufficiently nonlinear latent causal models [...] Algorithm 3 Generalized Score-based Causal Latent Estimation via Interventions (GSCALE-I)
Open Source Code	Yes	The codebase for the algorithms and simulations is available at: https://github.com/acarturk-e/score-based-crl.
Open Datasets	Yes	In this section, we apply our algorithms to the Perturb-seq dataset of Norman et al. (2019), which is also used in Zhang et al. (2023).
Dataset Splits	No	The paper mentions generating 'ns independent and identically distributed (i.i.d.) samples of Z from each environment' (Section 7.1) and '10000 samples from each graph under each environment' (Section 7.5). It also details training parameters like '100 epochs, Adam optimizer, batch size:16' for autoencoders (Table 17). However, it does not specify explicit training, validation, and test splits (e.g., percentages or counts) of the data itself.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It only mentions the availability of the codebase for algorithms and simulations.
Software Dependencies	No	The paper mentions using 'Langevin sampling (Welling and Teh, 2011)', 'sliced score matching with variance reduction (SSM-VR) (Song et al., 2020)', 'CNN-based model', 'RMSprop optimizer', and 'Adam optimizer'. These are techniques, models, or optimizers, but specific software libraries or solvers with version numbers are not provided.
Experiment Setup	Yes	In the simulation results reported in Section 7.2, we set λ = 1 and solve (302) using RMSprop optimizer with learning rate 10-3 for 3 x 10^4 steps for n = 5 and 4 x 10^4 steps for n = 8. We also use early stopping when the training converges before the maximum number of steps. For image experiments, Table 17 details: 'Autoencoder-1 Training: 100 epochs, Adam optimizer, batch size:16 Learning rate: 10-3, weight decay: 0.01 Minimize E X ˆXc 2 recons. loss' and 'Autoencoder-2 Training: 100 epochs, Adam optimizer, batch size: 16, weight decay: 0.01 Learning rate: 0.95 decay/epoch for 75 epochs, reset to 10-3 for 25 epochs Minimize Dt(h) In n 1,1 + λE Y ˆY 2 score loss + recons. loss'.