Large Language Models are Interpretable Learners
Authors: Ruochen Wang, Si Si, Felix Yu, Dorothea Rothuizen, Cho-Jui Hsieh, Inderjit Dhillon
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of LSPs in extracting interpretable and accurate knowledge from data, we introduce IL-Bench, a collection of diverse tasks, including both synthetic and real-world scenarios across different modalities. Empirical results demonstrate LSP s superior performance compared to traditional neurosymbolic programs and vanilla automatic prompt tuning methods. |
| Researcher Affiliation | Collaboration | Ruochen Wang UCLA Si Si Google Felix Yu Google Dorothea Wiesmann Google Cho-Jui Hsieh Google, UCLA Inderjit Dhillon Google |
| Pseudocode | Yes | Algorithm 1 learn_llm_module: Learning LLM Module by summarizing predictive rules Algorithm 2 Complete pipeline of optimizing LSPs |
| Open Source Code | No | Our code and benchmark will be released for future research. |
| Open Datasets | Yes | To evaluate model proficiency in intricate real-world scenarios, we utilize Fine-Grained Visual Classification (FGVC) datasets (Maji et al., 2013; Wah et al., 2011; Kramberger & Potoˇcnik, 2020; Nilsback & Zisserman, 2008; Van Horn et al., 2015), such as CUB commonly used in XAI research. |
| Dataset Splits | No | The paper does not explicitly provide specific training/test/validation dataset split percentages, sample counts, or detailed splitting methodologies. |
| Hardware Specification | No | The paper mentions specific LLM models (e.g., GPT-4V, Gemini-Vision) used for experiments, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | Yes | For language tasks, we test popular LLMs, including GPT-3.5 (turbo-1104) (Ouyang et al., 2022), GPT-4 (1106-preview) (Achiam et al., 2023), and Gemini-M (1.0-pro) (Team et al., 2023). For vision tasks, GPT-4V (1106-vision-preview) and Gemini-Vision (1.5-flash) are utilized. |
| Experiment Setup | Yes | LSP Throughout our main experiment, we use an expansion ratio of 4, batch size of 64, a maximum number of four iterations, and a maximum of 8 candidate (LLM module) proposals for each iteration. The settings for beam search follows that of APO, which uses a beam size of 4 and deploys UCBBandits algorithm with a sample size of 32 to speedup the candidate ranking Pryzant et al. (2023). The only exception is that for vision tasks, we use a batch size of 4 for cost reduction. The temperature for all API models are set to their default (0.7). |