Large Language Models are Interpretable Learners

Authors: Ruochen Wang, Si Si, Felix Yu, Dorothea Rothuizen, Cho-Jui Hsieh, Inderjit Dhillon

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness of LSPs in extracting interpretable and accurate knowledge from data, we introduce IL-Bench, a collection of diverse tasks, including both synthetic and real-world scenarios across different modalities. Empirical results demonstrate LSP s superior performance compared to traditional neurosymbolic programs and vanilla automatic prompt tuning methods.
Researcher Affiliation Collaboration Ruochen Wang UCLA Si Si Google Felix Yu Google Dorothea Wiesmann Google Cho-Jui Hsieh Google, UCLA Inderjit Dhillon Google
Pseudocode Yes Algorithm 1 learn_llm_module: Learning LLM Module by summarizing predictive rules Algorithm 2 Complete pipeline of optimizing LSPs
Open Source Code No Our code and benchmark will be released for future research.
Open Datasets Yes To evaluate model proficiency in intricate real-world scenarios, we utilize Fine-Grained Visual Classification (FGVC) datasets (Maji et al., 2013; Wah et al., 2011; Kramberger & Potoˇcnik, 2020; Nilsback & Zisserman, 2008; Van Horn et al., 2015), such as CUB commonly used in XAI research.
Dataset Splits No The paper does not explicitly provide specific training/test/validation dataset split percentages, sample counts, or detailed splitting methodologies.
Hardware Specification No The paper mentions specific LLM models (e.g., GPT-4V, Gemini-Vision) used for experiments, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies Yes For language tasks, we test popular LLMs, including GPT-3.5 (turbo-1104) (Ouyang et al., 2022), GPT-4 (1106-preview) (Achiam et al., 2023), and Gemini-M (1.0-pro) (Team et al., 2023). For vision tasks, GPT-4V (1106-vision-preview) and Gemini-Vision (1.5-flash) are utilized.
Experiment Setup Yes LSP Throughout our main experiment, we use an expansion ratio of 4, batch size of 64, a maximum number of four iterations, and a maximum of 8 candidate (LLM module) proposals for each iteration. The settings for beam search follows that of APO, which uses a beam size of 4 and deploys UCBBandits algorithm with a sample size of 32 to speedup the candidate ranking Pryzant et al. (2023). The only exception is that for vision tasks, we use a batch size of 4 for cost reduction. The temperature for all API models are set to their default (0.7).