Large-Scale Targeted Cause Discovery via Learning from Simulated Data
Authors: Jang-Hyun Kim, Claudia Skok Gibbs, Sangdoo Yun, Hyun Oh Song, Kyunghyun Cho
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate superior performance in identifying causal relationships within large-scale gene regulatory networks, outperforming existing methods that emphasize full-graph discovery. We validate our model s generalization capability across out-of-distribution graph structures and generating mechanisms, including gene regulatory networks of E. coli and the human K562 cell line. Empirical evaluations demonstrate that our method effectively identifies causal relationships within complex systems involving thousands of variables, under the availability of high-fidelity simulators (Section 5). |
| Researcher Affiliation | Collaboration | Jang-Hyun Kim EMAIL Department of Computer Science, Seoul National University Claudia Skok Gibbs EMAIL Center for Data Science, New York University Sangdoo Yun EMAIL NAVER AI Lab Hyun Oh Song EMAIL Department of Computer Science, Seoul National University Kyunghyun Cho EMAIL Center for Data Science, New York University Prescient Design, Genetech |
| Pseudocode | Yes | Algorithms 1 and 2 describe pseudo codes of our final training and inference algorithms. |
| Open Source Code | Yes | Implementation codes are available at https://github.com/snu-mllab/Targeted-Cause-Discovery. |
| Open Datasets | Yes | The test data is generated from biological structures of E. coli (1,565 genes) and yeast (4,441 genes) as obtained from Marbach et al. (2009). Real-world human cell analysis. We test our simulator-trained model in a real-world scenario using a Perturb-seq dataset derived from the K562 cell line of a patient with chronic myelogenous leukemia (Replogle et al., 2022). Additionally, we evaluate the methods on a real-world dataset called Sachs, which contains 11 variables (Sachs et al., 2005). We obtain interventional data from bnlearn1, including six intervention types, each applied to a single node. |
| Dataset Splits | Yes | Table 6: Dataset configuration. Note for abbreviations used: ER (Erdős Rényi), SF (Scale-Free), SFdirect (directional Scale-Free), and SBM (Stochastic Block Model) (Drobyshevskiy & Turdakov, 2019). For the training data, we randomly select the graph structure and edge degree independently from the candidate sets. We use a slash (/) symbol to separately denote the statistics for E. coli and yeast GRNs. For the exact configuration of technical noise, please refer to Table 4 in Lorch et al. (2022). |
| Hardware Specification | Yes | We conduct all experiments including training and inference, using a NVIDIA RTX 3090 GPU with 24GB memory. |
| Software Dependencies | No | The paper mentions software like Adam W optimizer but does not provide specific version numbers for any software libraries or frameworks used. |
| Experiment Setup | Yes | Training configuration. We train a neural network using the Adam W optimizer (Loshchilov & Hutter, 2019), with training configurations detailed in Table 4. Table 4: Training configuration. Argument Value Batch size 32 Training step 40,000 Learning rate 8e-4 Learning rate scheduler cosine Weight decay 1e-5 |