reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Text Classification under Confounding Shift

Authors: Virgile Landeiro, Aron Culotta

JAIR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach does not make any causal conclusions but by experimenting on 6 datasets, we show that our approach is able to outperform baselines 1) in controlled cases where confounding shift is manually injected between ﬁtting time and prediction time 2) in natural experiments where confounding shift appears either abruptly or gradually 3) in cases where there is one or multiple confounders.
Researcher Affiliation	Academia	Virgile Landeiro EMAIL Aron Culotta EMAIL Department of Computer Science Illinois Institute of Technology Chicago, IL 60616
Pseudocode	No	The paper describes the method using mathematical formulations and prose, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We also provide the code and datasets required to reproduce our experiments on Git Hub1. 1. https://github.com/tapilab/jair-2018-confound
Open Datasets	Yes	We also provide the code and datasets required to reproduce our experiments on Git Hub1. 1. https://github.com/tapilab/jair-2018-confound. To build this dataset, we use the data from Maas, Daly, Pham, Huang, Ng, and Potts (2011). It contains 50,000 movie reviews from IMDb labeled with positive or negative sentiment. ... For these experiments, we obtain the data from the 8th round of the Yelp Dataset Challenge.
Dataset Splits	Yes	For each btrain, btest pair, we sample 5 train/test splits and report the average accuracy. ... To do so, we ﬁx the training data to an initial time period t, then sample testing data from future time periods t + g. The gap size g determines the time between the training and testing set.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions using "L2-regularized logistic regression" but does not specify any particular software library or its version number.
Experiment Setup	Yes	In our experiments, we use L2-regularized logistic regression. ... L(D, θ) = Xi D log pθ(yi\| xi, zi) λx Pk (θx k)2 λz Pk (θz k)2 (7) where the terms λx and λz control the regularization strength of the term coeﬃcients and confound coeﬃcients, respectively. A default implementation would set λx = λz = 1. ... we only assigned the values 1 or 10 to the tuning parameter v of our approach.