reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Intrinsic User-Centric Interpretability through Global Mixture of Experts

Authors: Vinitra Swamy, Syrielle Montariol, Julian Blackwell, Jibril Frej, Martin Jaggi, Tanja Käser

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply Interpret CC for text, time series and tabular data across several real-world datasets, demonstrating comparable performance with non-interpretable baselines and outperforming intrinsically interpretable baselines. Through a user study involving 56 teachers, Interpret CC explanations are found to have higher actionability and usefulness over other intrinsically interpretable approaches. (Abstract) ... 5 EXPERIMENTAL RESULTS: Through the following three experiments, we demonstrate that our Interpret CC models do not compromise performance compared to black-box models and provide explanations that are faithful as well as human-centered.
Researcher Affiliation	Academia	Vinitra Swamy EPFL EMAIL; Syrielle Montariol EPFL EMAIL; Julian Blackwell EPFL EMAIL; Jibril Frej EPFL EMAIL; Martin Jaggi EPFL EMAIL; Tanja Käser EPFL EMAIL
Pseudocode	No	The paper describes the methodology in Section 3 and illustrates architectures in Figure 1, but it does not include explicit pseudocode blocks or algorithms formatted as code.
Open Source Code	Yes	We provide our code open source: https://github.com/epfl-ml4ed/interpretcc.
Open Datasets	Yes	For news categorization (AG News), we classify news into four categories ... (Zhang et al., 2015). ... For sentiment prediction (Stanford Sentiment Treebank, SST), we use 11,855 sentences from movie reviews ... (Socher et al., 2013). ... The Wisconsin Breast Cancer dataset identifies cancerous tissue ... (Wolberg et al., 1995). ... We use Open XAI s synthetic dataset (Agarwal et al., 2022), which includes ground truth labels and explanations...
Dataset Splits	Yes	We perform an 80-10-10 train-validation-test data split stratified on the output label, to conserve the class imbalance in each subset.
Hardware Specification	No	The paper mentions using fine-tuned Distil BERT models and notes the computational demands, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for their experiments.
Software Dependencies	No	The paper references several software components and models like "Gumbel Softmax trick (Jang et al., 2017)", "Sentence BERT (Reimers and Gurevych, 2019)", and "Distil BERT variations", but it does not specify exact version numbers for these or other key software dependencies required for reproducibility.
Experiment Setup	Yes	We run hyperparameter tuning and three different random seeds for each reported model (reproducibility details in Appendix F). Since EDU MOOC courses have a low passing rate (below 30%), and thus the dataset has a heavy class imbalance, we use balanced accuracy for evaluation. The other datasets are more balanced (AG News, SST, Breast Cancer, Synthetic), hence we use accuracy as our evaluation metric. ... For both education and health tasks, a τ of 10 and a Gumbel-Softmax threshold of around 0.7 to 0.8 are performant, sparse in activated features, and relatively stable. ... Interpret CC Top-K expert network solution with k=2 for group routing.