reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large-Scale Targeted Cause Discovery via Learning from Simulated Data

Authors: Jang-Hyun Kim, Claudia Skok Gibbs, Sangdoo Yun, Hyun Oh Song, Kyunghyun Cho

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results demonstrate superior performance in identifying causal relationships within large-scale gene regulatory networks, outperforming existing methods that emphasize full-graph discovery. We validate our model s generalization capability across out-of-distribution graph structures and generating mechanisms, including gene regulatory networks of E. coli and the human K562 cell line. Empirical evaluations demonstrate that our method effectively identifies causal relationships within complex systems involving thousands of variables, under the availability of high-fidelity simulators (Section 5).
Researcher Affiliation	Collaboration	Jang-Hyun Kim EMAIL Department of Computer Science, Seoul National University Claudia Skok Gibbs EMAIL Center for Data Science, New York University Sangdoo Yun EMAIL NAVER AI Lab Hyun Oh Song EMAIL Department of Computer Science, Seoul National University Kyunghyun Cho EMAIL Center for Data Science, New York University Prescient Design, Genetech
Pseudocode	Yes	Algorithms 1 and 2 describe pseudo codes of our final training and inference algorithms.
Open Source Code	Yes	Implementation codes are available at https://github.com/snu-mllab/Targeted-Cause-Discovery.
Open Datasets	Yes	The test data is generated from biological structures of E. coli (1,565 genes) and yeast (4,441 genes) as obtained from Marbach et al. (2009). Real-world human cell analysis. We test our simulator-trained model in a real-world scenario using a Perturb-seq dataset derived from the K562 cell line of a patient with chronic myelogenous leukemia (Replogle et al., 2022). Additionally, we evaluate the methods on a real-world dataset called Sachs, which contains 11 variables (Sachs et al., 2005). We obtain interventional data from bnlearn1, including six intervention types, each applied to a single node.
Dataset Splits	Yes	Table 6: Dataset configuration. Note for abbreviations used: ER (Erdős Rényi), SF (Scale-Free), SFdirect (directional Scale-Free), and SBM (Stochastic Block Model) (Drobyshevskiy & Turdakov, 2019). For the training data, we randomly select the graph structure and edge degree independently from the candidate sets. We use a slash (/) symbol to separately denote the statistics for E. coli and yeast GRNs. For the exact configuration of technical noise, please refer to Table 4 in Lorch et al. (2022).
Hardware Specification	Yes	We conduct all experiments including training and inference, using a NVIDIA RTX 3090 GPU with 24GB memory.
Software Dependencies	No	The paper mentions software like Adam W optimizer but does not provide specific version numbers for any software libraries or frameworks used.
Experiment Setup	Yes	Training configuration. We train a neural network using the Adam W optimizer (Loshchilov & Hutter, 2019), with training configurations detailed in Table 4. Table 4: Training configuration. Argument Value Batch size 32 Training step 40,000 Learning rate 8e-4 Learning rate scheduler cosine Weight decay 1e-5