Idiosyncrasies in Large Language Models

Authors: Mingjie Sun, Yida Yin, Zhiqiu Xu, J Zico Kolter, Zhuang Liu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving Chat GPT, Claude, Grok, Gemini, and Deep Seek.
Researcher Affiliation Academia 1Carnegie Mellon University 2UC Berkeley 3University of Pennsylvania 4Princeton University.
Pseudocode No The paper describes the methodology in narrative text and presents results in tables and figures, but no structured pseudocode or algorithm blocks are provided.
Open Source Code Yes Code is available at github.com/locuslab/llm-idiosyncrasies.
Open Datasets Yes For chat APIs and instruct LLMs, we generate outputs from Ultra Chat (Ding et al., 2023), a diverse dialogue and instruction dataset. For base LLMs, we synthesize new texts using prompts from Fine Web (Penedo et al., 2024), a high-quality LLM pretraining dataset. [...] To evaluate this, we collect responses from instruct LLMs across four diverse datasets: i.e., Ultra Chat, Cosmopedia (Ben Allal et al., 2024), Lmsys Chat (Zheng et al., 2024), and Wild Chat (Zhao et al., 2024).
Dataset Splits Yes For a given prompt dataset, we collect 11K text sequences, splitting them into 10K and 1K as training and validation sets, respectively. The same split is used across all LLMs.
Hardware Specification No The paper describes the models and training setup, but does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types).
Software Dependencies No The paper mentions several software components and models such as ELMo, BERT, T5, GPT-2, LLM2vec, LoRA, and AdamW. However, it does not provide specific version numbers for these software dependencies, which are required for reproducibility.
Experiment Setup Yes In Appendix A.2, the paper details the fine-tuning process, stating 'Input sequences are truncated to a maximum length of 512 tokens.' and 'We employ the parameter-efficient Lo RA (Hu et al., 2022) fine-tuning method with a rank of 16, Lo RA α of 32, a dropout rate of 0.05, and a base learning rate of 5e-5.' Table 9 provides further specifics: 'optimizer AdamW', 'weight decay 0.001', 'optimizer momentum β1, β2 = 0.9, 0.999', 'training epochs 3', 'batch size 8', 'learning rate schedule cosine decay', 'warmup schedule linear warmup ratio 10%', 'gradient clip 0.3'.