Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Authors: Benjamin Feuer, Micah Goldblum, Teresa Datta, Sanjana Nambiar, Raz Besaleli, Samuel Dooley, Max Cembalest, John P Dickerson

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct (to the best of our knowledge) the largest controlled meta-analysis of publicly available post-training methods to date, and show that data scaling in the SFT stage as well as prompt diversity are the most important predictors of improved alignment.
Researcher Affiliation Collaboration Benjamin Feuer 1,2, Micah Goldblum3, Teresa Datta1, Sanjana Nambiar2, Raz Besaleli, Samuel Dooley, Max Cembalest, John P. Dickerson1 1 Arthur AI, 2 NYU, 3 Columbia University
Pseudocode No The paper describes methods and processes in narrative text and bullet points (e.g., 'Implementation Roadmap', 'How it works' sections) and provides prompt templates in an appendix, but does not include a formally structured pseudocode or algorithm block.
Open Source Code Yes Our codebase and complete results can be found at https://github.com/penfever/sos-bench.
Open Datasets Yes SOS-BENCH (Substance Over Style Benchmark) combines 19 existing world knowledge, instruction following, and safety benchmarks for a holistic view of model performance. For the complete list of benchmarks we use, please refer to Table 8.
Dataset Splits No The paper discusses evaluating models against test sets from various benchmarks, such as 'Arena-Hard has a test set of 500 questions' and 'All in all, we test models on 152,380 data points' for SOS-BENCH. However, it does not provide specific train/validation/test splits for the datasets used to train the models in their own experiments, such as those mentioned in Section 6 or Appendix A for fine-tuning LLAMA-3-8B models.
Hardware Specification Yes Compute costs. The compute cost for Figure 3 was 250 A100-hours. Table 4, which required more model training, was 850 A100-hours. Table 5 was 225 A100-hours.
Software Dependencies No The paper mentions software like 'Axolotl' and 'Open AI API', and libraries such as 'spa Cy, pdfminer, Tesseract OCR', but does not provide specific version numbers for any of these components.
Experiment Setup Yes Our Llama3-8B models were fine-tuned for 10000 steps or 2 epochs (whichever came first), at a learning rate of 2e-5. Our Mistral-7B models were finetuned for 3 epochs at a learning rate of 5e-6. All models were trained at sequence lengths of 8192, with an Adam W optimizer, and a cosine LR scheduler. We utilized gradient checkpointing, flash attention and sample packing.