reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Concept Bottleneck Large Language Models

Authors: Chung-En Sun, Tuomas Oikarinen, Berk Ustun, Tsui-Wei Weng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. In this section, we evaluate our CB-LLMs (classification) in terms of three crucial aspects: Accuracy, Efficency, and Faithfulness. The test accuracy is shown in Table 2. In general, our CB-LLMs (classification) demonstrate high accuracy across various datasets, including large ones such as Yelp P and DBpedia. We conduct human evaluations through Amazon Mechanical Turk (MTurk) for Task 1 and 2 to compare our CB-LLMs (classification) with TBM&C3M.
Researcher Affiliation	Academia	Chung-En Sun, Tuomas Oikarinen, Berk Ustun, Tsui-Wei Weng University of California San Diego EMAIL
Pseudocode	No	The paper describes the method using a 5-step process for classification and two modules for generation, along with equations and diagrams (Figure 1, Figure 3). However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured procedural steps in a code-like format.
Open Source Code	Yes	Our code is available at https://github.com/TrustworthyML-Lab/CB-LLMs.
Open Datasets	Yes	We work with four datasets for text classification: SST2 [15], Yelp Polarity (Yelp P) [22], AGnews [22], and DBpedia [6]. [15] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013. [22] Xiang Zhang, Junbo Jake Zhao, and Yann Le Cun. Character-level convolutional networks for text classification. In Neur IPS, 2015. [6] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, S. Auer, and Christian Bizer. Dbpedia a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 2015.
Dataset Splits	No	The paper mentions 'Yelp P and DBpedia contain 560, 000 training samples' for classification and 'reduce the size of Yelp P, AGnews, and DBpedia to 100k samples' for generation. It also refers to 'test accuracy'. However, it does not explicitly provide the specific percentages or counts for training, validation, and test splits for all datasets, nor does it detail the methodology for creating these splits in a reproducible manner for all cases.
Hardware Specification	No	The paper acknowledges 'computing support from CIS230154 in Advanced Cyberinfrastructure Coordination Ecosystem' but does not specify any particular hardware components like GPU models, CPU models, or detailed computer specifications used for running the experiments.
Software Dependencies	No	The paper mentions using 'all-mpnet-base-v2 from Huggingface [18]' and 'RoBERTa-base [7] and GPT2 [12] pretrained model as the backbone'. It also refers to 'Llama3-8B-Instruct [1]' and 'bart-large-mnli'. While these are specific models and frameworks, the paper does not provide specific version numbers for underlying software dependencies such as Python, PyTorch/TensorFlow, or the Huggingface library itself, which are necessary for full reproducibility.
Experiment Setup	Yes	After obtaining A+ N, we train a final linear layer with sparsity constraint to make the final text classification interpretable: min W,b 1 \|D\| x,y D LCE(WF A+ N(x) + b F , y) + λR(WF ), (4) where WF Rn k is the weight matrix and b F Rn is the bias vector of the final linear layer, y is the label of x, and R(W) = α\|\|W\|\|1 + (1 α) 1 2\|\|W\|\|2 2 is the elastic-net regularization, which is the combination of ℓ1 and ℓ2 penalty. λ is set to 0.0007 and α is set to 0.99. Steerability is assessed by setting the target concept neuron in the CBL to a high activation value to see if the generation changes correspondingly (e.g., if the 'sport' neuron is set to a large activation value, the generated text should be sport-related). We fine-tuned a chatbot using Llama3-8B with a combination of Toxic DPOqa and toxic-chat, incorporating four interpretable neurons