reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Authors: Benjamin Feuer, Micah Goldblum, Teresa Datta, Sanjana Nambiar, Raz Besaleli, Samuel Dooley, Max Cembalest, John P Dickerson

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct (to the best of our knowledge) the largest controlled meta-analysis of publicly available post-training methods to date, and show that data scaling in the SFT stage as well as prompt diversity are the most important predictors of improved alignment.
Researcher Affiliation	Collaboration	Benjamin Feuer 1,2, Micah Goldblum3, Teresa Datta1, Sanjana Nambiar2, Raz Besaleli, Samuel Dooley, Max Cembalest, John P. Dickerson1 1 Arthur AI, 2 NYU, 3 Columbia University
Pseudocode	No	The paper describes methods and processes in narrative text and bullet points (e.g., 'Implementation Roadmap', 'How it works' sections) and provides prompt templates in an appendix, but does not include a formally structured pseudocode or algorithm block.
Open Source Code	Yes	Our codebase and complete results can be found at https://github.com/penfever/sos-bench.
Open Datasets	Yes	SOS-BENCH (Substance Over Style Benchmark) combines 19 existing world knowledge, instruction following, and safety benchmarks for a holistic view of model performance. For the complete list of benchmarks we use, please refer to Table 8.
Dataset Splits	No	The paper discusses evaluating models against test sets from various benchmarks, such as 'Arena-Hard has a test set of 500 questions' and 'All in all, we test models on 152,380 data points' for SOS-BENCH. However, it does not provide specific train/validation/test splits for the datasets used to train the models in their own experiments, such as those mentioned in Section 6 or Appendix A for fine-tuning LLAMA-3-8B models.
Hardware Specification	Yes	Compute costs. The compute cost for Figure 3 was 250 A100-hours. Table 4, which required more model training, was 850 A100-hours. Table 5 was 225 A100-hours.
Software Dependencies	No	The paper mentions software like 'Axolotl' and 'Open AI API', and libraries such as 'spa Cy, pdfminer, Tesseract OCR', but does not provide specific version numbers for any of these components.
Experiment Setup	Yes	Our Llama3-8B models were fine-tuned for 10000 steps or 2 epochs (whichever came first), at a learning rate of 2e-5. Our Mistral-7B models were finetuned for 3 epochs at a learning rate of 5e-6. All models were trained at sequence lengths of 8192, with an Adam W optimizer, and a cosine LR scheduler. We utilized gradient checkpointing, flash attention and sample packing.