reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning by Self-Explaining

Authors: Wolfgang Stammer, Felix Friedrich, David Steinmann, Manuel Brack, Hikaru Shindo, Kristian Kersting

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide an overview of important components of LSX and, based on this, perform extensive experimental evaluations via three different example instantiations. Our results indicate improvements via Learning by Self-Explaining on several levels: in terms of model generalization, reducing the influence of confounding factors, and providing more task-relevant and faithful model explanations.
Researcher Affiliation	Academia	1Artificial Intelligence and Machine Learning Group, TU Darmstadt 2Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt 3Centre for Cognitive Science, TU Darmstadt 4German Center for Artificial Intelligence (DFKI)
Pseudocode	Yes	Algorithm 1 Learning by Self-Explaining in pseudocode. Given: the two submodels, a learner model, f, and internal critic model, c, dataset X = (X, Y ) (i.e., input images and corresponding class labels), Xc X, iteration budget T and base task, i.e., image classification.
Open Source Code	Yes	Code available at: https://github.com/ml-research/learning-by-self-explaining
Open Datasets	Yes	To provide evidence for the benefits of LSX, we examine each instantiation via several suited datasets. Particularly, we examine i) CNN-LSX on the MNIST (Le Cun et al., 1989) and Chest MNIST (Yang et al., 2023; Wang et al., 2017) datasets, ii) Ne Sy-LSX on the concept-based datasets CLEVR-Hans3 (Stammer et al., 2021) and a variant of Caltech-UCSD Birds-200-2011 dataset (Wah et al., 2011), CUB-10, and iii) VLM-LSX on the VQA-X dataset (Park et al., 2018). Furthermore, for investigating the effect of confounding factors (Q3), we also use the decoy version of CLEVR-Hans3, as well as Decoy MNIST (Ross et al., 2017) and Color MNIST (Kim et al., 2019; Rieger et al., 2020).
Dataset Splits	Yes	To provide evidence for the benefits of LSX, we examine each instantiation via several suited datasets. Particularly, we examine i) CNN-LSX on the MNIST (Le Cun et al., 1989) and Chest MNIST (Yang et al., 2023; Wang et al., 2017) datasets... We provide results as mean values with standard deviations over five runs with random seeds. For the results in Tab. 2 via CNN-LSX (for both MNIST and Chest MNIST) Xc presented about 1/3 and 1/2 of X, from left column to right column, respectively. For the results in Tab. 3 via CNN-LSX on Decoy MNIST we present the critic with 512 samples from approximately 60000 training samples for w/ conf. and 1024 test set samples for w/ deconf.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It mentions computational costs but not the hardware specifications.
Software Dependencies	No	The explanation method of CNN-LSX corresponds to the differentiable Input XGradient method described in Shrikumar et al. (2017) and Hechtlinger (2016) and implemented via the captum pytorch package. While 'captum' and 'pytorch' are mentioned, specific version numbers are not provided, which is required for reproducibility.
Experiment Setup	Yes	When comparing these different training configurations, we use the same setup, i.e., training steps, datasets, hyperparameters etc. In Revise the learner performs the original base task for one epoch via lB while jointly optimizing for the critic’s explanatory feedback of the previous Reflect module. Specifically, the learner optimizes a joint loss: L = lB + λlcCE(c(ei), yi) for (xi, yi) ∈ X and ei based on Eq. 1. Hereby, λ represents a scaling hyperparameter which we set high (e.g., λ ≈ 100) in our evaluations to prevent the learner from mainly optimizing for good prediction. ... We select temperatures T ∈ {0.01, 0.1, 0.3, 0.6, 0.9}. This process overall results in Ne = 25 different natural language explanations per data sample. ... The number of LSX iterations was set to T = 8