reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large Language Models are Interpretable Learners

Authors: Ruochen Wang, Si Si, Felix Yu, Dorothea Rothuizen, Cho-Jui Hsieh, Inderjit Dhillon

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the effectiveness of LSPs in extracting interpretable and accurate knowledge from data, we introduce IL-Bench, a collection of diverse tasks, including both synthetic and real-world scenarios across different modalities. Empirical results demonstrate LSP s superior performance compared to traditional neurosymbolic programs and vanilla automatic prompt tuning methods.
Researcher Affiliation	Collaboration	Ruochen Wang UCLA Si Si Google Felix Yu Google Dorothea Wiesmann Google Cho-Jui Hsieh Google, UCLA Inderjit Dhillon Google
Pseudocode	Yes	Algorithm 1 learn_llm_module: Learning LLM Module by summarizing predictive rules Algorithm 2 Complete pipeline of optimizing LSPs
Open Source Code	No	Our code and benchmark will be released for future research.
Open Datasets	Yes	To evaluate model proficiency in intricate real-world scenarios, we utilize Fine-Grained Visual Classification (FGVC) datasets (Maji et al., 2013; Wah et al., 2011; Kramberger & Potoˇcnik, 2020; Nilsback & Zisserman, 2008; Van Horn et al., 2015), such as CUB commonly used in XAI research.
Dataset Splits	No	The paper does not explicitly provide specific training/test/validation dataset split percentages, sample counts, or detailed splitting methodologies.
Hardware Specification	No	The paper mentions specific LLM models (e.g., GPT-4V, Gemini-Vision) used for experiments, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies	Yes	For language tasks, we test popular LLMs, including GPT-3.5 (turbo-1104) (Ouyang et al., 2022), GPT-4 (1106-preview) (Achiam et al., 2023), and Gemini-M (1.0-pro) (Team et al., 2023). For vision tasks, GPT-4V (1106-vision-preview) and Gemini-Vision (1.5-flash) are utilized.
Experiment Setup	Yes	LSP Throughout our main experiment, we use an expansion ratio of 4, batch size of 64, a maximum number of four iterations, and a maximum of 8 candidate (LLM module) proposals for each iteration. The settings for beam search follows that of APO, which uses a beam size of 4 and deploys UCBBandits algorithm with a sample size of 32 to speedup the candidate ranking Pryzant et al. (2023). The only exception is that for vision tasks, we use a batch size of 4 for cost reduction. The temperature for all API models are set to their default (0.7).