reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Idiosyncrasies in Large Language Models

Authors: Mingjie Sun, Yida Yin, Zhiqiu Xu, J Zico Kolter, Zhuang Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving Chat GPT, Claude, Grok, Gemini, and Deep Seek.
Researcher Affiliation	Academia	1Carnegie Mellon University 2UC Berkeley 3University of Pennsylvania 4Princeton University.
Pseudocode	No	The paper describes the methodology in narrative text and presents results in tables and figures, but no structured pseudocode or algorithm blocks are provided.
Open Source Code	Yes	Code is available at github.com/locuslab/llm-idiosyncrasies.
Open Datasets	Yes	For chat APIs and instruct LLMs, we generate outputs from Ultra Chat (Ding et al., 2023), a diverse dialogue and instruction dataset. For base LLMs, we synthesize new texts using prompts from Fine Web (Penedo et al., 2024), a high-quality LLM pretraining dataset. [...] To evaluate this, we collect responses from instruct LLMs across four diverse datasets: i.e., Ultra Chat, Cosmopedia (Ben Allal et al., 2024), Lmsys Chat (Zheng et al., 2024), and Wild Chat (Zhao et al., 2024).
Dataset Splits	Yes	For a given prompt dataset, we collect 11K text sequences, splitting them into 10K and 1K as training and validation sets, respectively. The same split is used across all LLMs.
Hardware Specification	No	The paper describes the models and training setup, but does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types).
Software Dependencies	No	The paper mentions several software components and models such as ELMo, BERT, T5, GPT-2, LLM2vec, LoRA, and AdamW. However, it does not provide specific version numbers for these software dependencies, which are required for reproducibility.
Experiment Setup	Yes	In Appendix A.2, the paper details the fine-tuning process, stating 'Input sequences are truncated to a maximum length of 512 tokens.' and 'We employ the parameter-efficient Lo RA (Hu et al., 2022) fine-tuning method with a rank of 16, Lo RA α of 32, a dropout rate of 0.05, and a base learning rate of 5e-5.' Table 9 provides further specifics: 'optimizer AdamW', 'weight decay 0.001', 'optimizer momentum β1, β2 = 0.9, 0.999', 'training epochs 3', 'batch size 8', 'learning rate schedule cosine decay', 'warmup schedule linear warmup ratio 10%', 'gradient clip 0.3'.