reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Isolated Causal Effects of Natural Language

Authors: Victoria Lin, Louis-Philippe Morency, Eli Ben-Michael

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments1 demonstrate the validity of our framework on both semi-synthetic and real-world data. Using evaluation settings where the ground truth is known, we observe that our estimation framework is able to recover the true isolated effect across multiple interventions.
Researcher Affiliation	Academia	Victoria Lin 1 Louis-Philippe Morency 1 Eli Ben-Michael 1 1Carnegie Mellon University, Pittsburgh, PA, USA. Correspondence to: Victoria Lin <EMAIL>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It provides mathematical derivations in Appendix A but no algorithm pseudo-code.
Open Source Code	Yes	Our data and code are publicly available at https://github.com/torylin/isolated-text-effects.
Open Datasets	Yes	The Amazon dataset (Mc Auley & Leskovec, 2013) consists of reviews from the Amazon e-commerce site... The Sv T dataset (Dhawan et al., 2024) consists of posts from weight-loss communities on the social media site Reddit...
Dataset Splits	Yes	For each non-focal language representation ac s(X), we use 5-fold cross-fitting to train an outcome model bg to predict Y given ac s(X) and a classifier to predict a(X) given ac s(X). Within the training folds, we conduct 5-fold cross-validation to select model hyperparameters.
Hardware Specification	No	All experiments were conducted on consumer-level machines. Experiments involving language models, such as those with MPNet and Sente Con embeddings, were conducted using consumer-level NVIDIA GPUs.
Software Dependencies	Yes	To implement our lexicons, we use the third-party liwc Python library and the empath library released by its creators. Sente Con-LIWC and Sente Con-Empath representations are obtained using the sentecon library released by its creators. BERT and Ro BERT-a embeddings are obtained via the Hugging Face transformers library using the pre-trained models bert-base-uncased and roberta-base, respectively. MPNet and Mini LM embeddings are obtained via the Hugging Face sentence-transformers library using the pre-trained models all-mpnet-base-v2 and all-Mini LM-L6-v2, respectively. Finally, LLM (GPT-3.5) prompting covariates are taken directly from the Sv T dataset released by Dhawan et al. (2024). Additional technical details are provided in Table 2. ... All outcome models and a(X) classifiers are implemented using the scikit-learn Python library (version 1.3.0).
Experiment Setup	Yes	Gradient boosting models use a subsample proportion of 0.7, i.e., 70% of training samples are used to fit the individual base learners. Neural networks used for outcome models in the nonlinear Amazon setting are implemented with the MLPRegressor class and tuned over the following possible layer counts and sizes: (128,), (128, 128), (128, 256, 128). Logistic and linear regression models are optimized for L1 ratio over the range [0.0, 0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0], where 1.0 corresponds to L1 penalty only and 0.0 corresponds to L2 penalty only. Logistic regression models are further tuned for C (inverse regularization strength) over the following search space: [0.001, 0.01, 0.1, 1.0, 10, 100]. For all interventions, the optimal hyperparameters are a linear regression L1 ratio of 0.5, logistic regression L1 ratio of 0.0, and C of 0.001.