reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Lawma: The Power of Specialization for Legal Annotation

Authors: Ricardo Dominguez-Olmedo, Vedant Nanda, Rediet Abebe, Stefan Bechtold, Christoph Engel, Jens Frankenreiter, Krishna Gummadi, Moritz Hardt, Michael Livermore

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present a comprehensive analysis of large language models current abilities to perform legal annotation tasks. To do so, we construct Caselaw QA, a benchmark comprising 260 legal text classification tasks, nearly all new to the machine learning community. We demonstrate that commercial models, such as GPT-4.5 and Claude 3.7 Sonnet, achieve non-trivial accuracy but generally fall short of the performance required for legal work. We then demonstrate that small, lightly fine-tuned models vastly outperform commercial models.
Researcher Affiliation	Academia	1Max Planck Institute for Intelligent Systems, T ubingen, and T ubingen AI Center 2Max Planck Institute for Software Systems, Saarbr ucken 3ELLIS Institute, T ubingen 4ETH Zurich 5Max Planck Institute for Research on Collective Goods, Bonn 6Washington University in St. Louis School of Law 7University of Virginia School of Law
Pseudocode	No	The paper describes methodologies and experiments but does not include any explicit pseudocode or algorithm blocks. The methods are described in narrative text.
Open Source Code	Yes	Code, datasets, and fine-tuned models are available at https://github.com/socialfoundations/lawma.
Open Datasets	Yes	Code, datasets, and fine-tuned models are available at https://github.com/socialfoundations/lawma. ... The tasks we introduce are actual legal annotation tasks based on the U.S. Supreme Court (Spaeth et al., 2023) and Court of Appeals (Songer) databases.
Dataset Splits	Yes	We match a total of 24,916 court cases, which we divide into a 70%/10%/20% train/validation/test split.
Hardware Specification	Yes	We fine-tune on a cluster consisting of NVIDIA H100 GPUs. Finetuning on all tasks simultaneously required approximately 600 H100 hours for the 8B model and 1600 GPU hours for the 70B model. In total, the experiments presented in the paper required approximately 8000 H100 GPU hours. ... We train on a node of 7 H100s using Deep Speed Zero 2, with a global batch size of 56. For Lawma 70B, we fine-tune Llama 3 70B Instruct for 1 epoch. We train on 8 nodes of 8 H100s each using Deep Speed Zero 3, with a global batch size of 64.
Software Dependencies	No	We pack samples using the axolotl library (Cloud, 2024), which improves training efficiency by approximately 40%. While a library is named and cited, a specific version number for 'axolotl' is not provided.
Experiment Setup	Yes	We fine-tuning with a maximum sequence length of 8192 tokens. We use the Adam W optimizer with full precision, β1 = 0.9, β2 = 0.95, ϵ = 10 8. We use a peak learning rate of 2 10 6. We use a cosine learning rate schedule, with 180 warm-up steps (approx. 4% of a full epoch) and decay to 10% of the peak learning rate. We use a weight decay of 0.1. We clip gradient to 1.0 max norm. ... For Lawma 8B, we fine-tune Llama 3 8B Instruct for 3 epochs. ... with a global batch size of 56. For Lawma 70B, we fine-tune Llama 3 70B Instruct for 1 epoch. ... with a global batch size of 64.