Causal Graph Transformer for Treatment Effect Estimation Under Unknown Interference

Authors: Anpeng Wu, Haiyi Qiu, Zhengming Chen, Zijian Li, Ruoxuan Xiong, Fei Wu, Kun Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two widely-used benchmarks demonstrate the effectiveness and superiority of Cau Gramer.
Researcher Affiliation Academia 1Zhejiang University 2MBZUAI 3Guangdong University of Technology 4Emory University 5Carnegie Mellon University
Pseudocode Yes The pseudo-code is placed in Algorithm 1.
Open Source Code Yes The code is available at https://github.com/anpwu/Cau Gramer.
Open Datasets Yes The Blog Catalog and Flickr datasets are available at: https://github.com/songjiang0909/Causal-Inference-on-Networked-Data.
Dataset Splits Yes Jiang & Sun (2022) use METIS (Karypis & Kumar, 1998) to partition the original networks into three sub-networks as train/valid/test data with 2482, 2461, and 2358 samples in Flickr, and 1784, 1716, and 1696 samples in Blog Catalog.
Hardware Specification Yes Hardware used: (1) Mac Book Pro with Apple M2 Pro. (2) Ubuntu 16.04.3 LTS operating system with 2 * Intel Xeon E5-2660 v3 @ 2.60GHz CPU (40 CPU cores, 10 cores per physical CPU, 2 threads per core), 256 GB of RAM, and 4 * Ge Force GTX TITAN X GPU with 12GB of VRAM.
Software Dependencies Yes Software used: Python 3.9 with numpy 1.26.4, scipy 1.13.0, pandas 2.2.2, torch 2.3.0, scikit-learn 1.4.2, openpyxl 3.1.2, torch geometric 2.5.2, torch-scatter 1.1.0.
Experiment Setup Yes In this paper, we propose a L-layers (default: 2) M-heads (default: 3) cross-attention GCN to learn the representation Rx = gx(x, A). In each attention head, all neural networks consist of one layer comprising 32 hidden units. We then perform cross-attention computation, which yields the concatenation of M head embeddings, followed by a feed-forward network to output a 32-dimensional representation. ... Then, we use three two-layer linear networks, where each layer comprises 64 hidden units, to regress treatments T and potential outcomes {Y0, Y1}. ... we use the ReLU activation function and set the dropout rate to 0.1 to mitigate overfitting. Then, the objective function is: ... Then, we adopt Adam optimization with a learning rate of 0.01 and set epochs to 300 to alternately train WA and { ˆT , ˆY0, ˆY1}.