Concept Bottleneck Large Language Models
Authors: Chung-En Sun, Tuomas Oikarinen, Berk Ustun, Tsui-Wei Weng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. In this section, we evaluate our CB-LLMs (classification) in terms of three crucial aspects: Accuracy, Efficency, and Faithfulness. The test accuracy is shown in Table 2. In general, our CB-LLMs (classification) demonstrate high accuracy across various datasets, including large ones such as Yelp P and DBpedia. We conduct human evaluations through Amazon Mechanical Turk (MTurk) for Task 1 and 2 to compare our CB-LLMs (classification) with TBM&C3M. |
| Researcher Affiliation | Academia | Chung-En Sun, Tuomas Oikarinen, Berk Ustun, Tsui-Wei Weng University of California San Diego EMAIL |
| Pseudocode | No | The paper describes the method using a 5-step process for classification and two modules for generation, along with equations and diagrams (Figure 1, Figure 3). However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured procedural steps in a code-like format. |
| Open Source Code | Yes | Our code is available at https://github.com/TrustworthyML-Lab/CB-LLMs. |
| Open Datasets | Yes | We work with four datasets for text classification: SST2 [15], Yelp Polarity (Yelp P) [22], AGnews [22], and DBpedia [6]. [15] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013. [22] Xiang Zhang, Junbo Jake Zhao, and Yann Le Cun. Character-level convolutional networks for text classification. In Neur IPS, 2015. [6] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, S. Auer, and Christian Bizer. Dbpedia a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 2015. |
| Dataset Splits | No | The paper mentions 'Yelp P and DBpedia contain 560, 000 training samples' for classification and 'reduce the size of Yelp P, AGnews, and DBpedia to 100k samples' for generation. It also refers to 'test accuracy'. However, it does not explicitly provide the specific percentages or counts for training, validation, and test splits for all datasets, nor does it detail the methodology for creating these splits in a reproducible manner for all cases. |
| Hardware Specification | No | The paper acknowledges 'computing support from CIS230154 in Advanced Cyberinfrastructure Coordination Ecosystem' but does not specify any particular hardware components like GPU models, CPU models, or detailed computer specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'all-mpnet-base-v2 from Huggingface [18]' and 'RoBERTa-base [7] and GPT2 [12] pretrained model as the backbone'. It also refers to 'Llama3-8B-Instruct [1]' and 'bart-large-mnli'. While these are specific models and frameworks, the paper does not provide specific version numbers for underlying software dependencies such as Python, PyTorch/TensorFlow, or the Huggingface library itself, which are necessary for full reproducibility. |
| Experiment Setup | Yes | After obtaining A+ N, we train a final linear layer with sparsity constraint to make the final text classification interpretable: min W,b 1 |D| x,y D LCE(WF A+ N(x) + b F , y) + λR(WF ), (4) where WF Rn k is the weight matrix and b F Rn is the bias vector of the final linear layer, y is the label of x, and R(W) = α||W||1 + (1 α) 1 2||W||2 2 is the elastic-net regularization, which is the combination of ℓ1 and ℓ2 penalty. λ is set to 0.0007 and α is set to 0.99. Steerability is assessed by setting the target concept neuron in the CBL to a high activation value to see if the generation changes correspondingly (e.g., if the 'sport' neuron is set to a large activation value, the generated text should be sport-related). We fine-tuned a chatbot using Llama3-8B with a combination of Toxic DPOqa and toxic-chat, incorporating four interpretable neurons |