Idiosyncrasies in Large Language Models
Authors: Mingjie Sun, Yida Yin, Zhiqiu Xu, J Zico Kolter, Zhuang Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving Chat GPT, Claude, Grok, Gemini, and Deep Seek. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University 2UC Berkeley 3University of Pennsylvania 4Princeton University. |
| Pseudocode | No | The paper describes the methodology in narrative text and presents results in tables and figures, but no structured pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | Code is available at github.com/locuslab/llm-idiosyncrasies. |
| Open Datasets | Yes | For chat APIs and instruct LLMs, we generate outputs from Ultra Chat (Ding et al., 2023), a diverse dialogue and instruction dataset. For base LLMs, we synthesize new texts using prompts from Fine Web (Penedo et al., 2024), a high-quality LLM pretraining dataset. [...] To evaluate this, we collect responses from instruct LLMs across four diverse datasets: i.e., Ultra Chat, Cosmopedia (Ben Allal et al., 2024), Lmsys Chat (Zheng et al., 2024), and Wild Chat (Zhao et al., 2024). |
| Dataset Splits | Yes | For a given prompt dataset, we collect 11K text sequences, splitting them into 10K and 1K as training and validation sets, respectively. The same split is used across all LLMs. |
| Hardware Specification | No | The paper describes the models and training setup, but does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types). |
| Software Dependencies | No | The paper mentions several software components and models such as ELMo, BERT, T5, GPT-2, LLM2vec, LoRA, and AdamW. However, it does not provide specific version numbers for these software dependencies, which are required for reproducibility. |
| Experiment Setup | Yes | In Appendix A.2, the paper details the fine-tuning process, stating 'Input sequences are truncated to a maximum length of 512 tokens.' and 'We employ the parameter-efficient Lo RA (Hu et al., 2022) fine-tuning method with a rank of 16, Lo RA α of 32, a dropout rate of 0.05, and a base learning rate of 5e-5.' Table 9 provides further specifics: 'optimizer AdamW', 'weight decay 0.001', 'optimizer momentum β1, β2 = 0.9, 0.999', 'training epochs 3', 'batch size 8', 'learning rate schedule cosine decay', 'warmup schedule linear warmup ratio 10%', 'gradient clip 0.3'. |