reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

Authors: Lisa Dunlap, Krishna Mandal, trevor darrell, Jacob Steinhardt, Joseph E Gonzalez

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We run Vibe Check on several datasets to evaluate its effectiveness across different scenarios in Section 5. First, we validate that the vibes discovered by Vibe Check align well with human-annotated differences between Chat GPT and human responses using the Human Chat GPT Comparison Corpus (HC3). Next, we demonstrate that Vibe Check outperforms a predefined list of vibes in predicting user preferences on real-world comparison data from Chatbot Arena, achieving 80% accuracy at predicting model identity and 61% accuracy and predicting user preference. Lastly, in Section 6 we apply Vibe Check to several applications: text summarization on CNN/Daily Mail, math problem-solving on MATH, and image captioning on COCO.
Researcher Affiliation	Academia	Lisa Dunlap UC Berkeley Krishna Mandal UC Berkeley Trevor Darrell UC Berkeley Jacob Steinhardt UC Berkeley Joseph Gonzalez UC Berkeley
Pseudocode	No	The paper describes the Vibe Check system and its three stages (vibe discovery, vibe validation, and vibe iteration) in Section 4, detailing the implementation and prompts used. However, it does not present this method in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Code can be found at https://github.com/lisadunlap/Vibe Check
Open Datasets	Yes	We validate that the vibes generated by Vibe Check align with those found in human discovery and run Vibe Check on pairwise preference data from real-world user conversations with Llama-3-70b vs GPT-4. Vibe Check reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run Vibe Check on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. Vibe Check discovers vibes like Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often overexplains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash. Code can be found at https://github.com/lisadunlap/Vibe Check
Dataset Splits	Yes	Table 3: Dataset Statistics
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud computing instances used for running its experiments or training the models discussed. It mentions LLMs like GPT-4o and Llama-3-70b as judges or models being compared, but not the underlying hardware.
Software Dependencies	No	The paper mentions several LLMs and models used, such as GPT-4o (Open AI, 2024), Llama-3-70b (AI@Meta, 2024), GPT-4o-mini (Open AI, 2024), Claude 3.5 Sonnet, Llama-3-405b, GPT-4V (Open AI, 2023), Gemini-1.5-Flash (Reid et al., 2024), TNLG v2 (Smith et al., 2022), Cohere Command X large Beta (Inc., 2023), and the hkunlp/instructor-xl model. However, it does not provide version numbers for general ancillary software components like programming languages (e.g., Python), libraries (e.g., PyTorch), or operating systems, which are crucial for full reproducibility.
Experiment Setup	Yes	Experimental setup. Unless otherwise stated, we run Vibe Check for 3 iterations, use a proposer batch size of 5, and set Ddiscovery to be 20 samples per iteration. Some datasets such as MATH, CNN/Daily Mail, and COCO captions have no pre-computed preference labels; to simulate preferences, we apply LLM-as-a-judge and ensemble GPT-4o and Claude 3.5 Sonnet as a judge using a similar procedure to (Zheng et al., 2023), removing any samples declared a tie. Additional details on the experimental setup and hyperparameters are given in the Section A. Table 5: Vibe Check Hyperparameters [lists specific parameters like d, batch, num eval vibes, num final vibes, iterations for each dataset].