Lawma: The Power of Specialization for Legal Annotation
Authors: Ricardo Dominguez-Olmedo, Vedant Nanda, Rediet Abebe, Stefan Bechtold, Christoph Engel, Jens Frankenreiter, Krishna Gummadi, Moritz Hardt, Michael Livermore
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we present a comprehensive analysis of large language models current abilities to perform legal annotation tasks. To do so, we construct Caselaw QA, a benchmark comprising 260 legal text classification tasks, nearly all new to the machine learning community. We demonstrate that commercial models, such as GPT-4.5 and Claude 3.7 Sonnet, achieve non-trivial accuracy but generally fall short of the performance required for legal work. We then demonstrate that small, lightly fine-tuned models vastly outperform commercial models. |
| Researcher Affiliation | Academia | 1Max Planck Institute for Intelligent Systems, T ubingen, and T ubingen AI Center 2Max Planck Institute for Software Systems, Saarbr ucken 3ELLIS Institute, T ubingen 4ETH Zurich 5Max Planck Institute for Research on Collective Goods, Bonn 6Washington University in St. Louis School of Law 7University of Virginia School of Law |
| Pseudocode | No | The paper describes methodologies and experiments but does not include any explicit pseudocode or algorithm blocks. The methods are described in narrative text. |
| Open Source Code | Yes | Code, datasets, and fine-tuned models are available at https://github.com/socialfoundations/lawma. |
| Open Datasets | Yes | Code, datasets, and fine-tuned models are available at https://github.com/socialfoundations/lawma. ... The tasks we introduce are actual legal annotation tasks based on the U.S. Supreme Court (Spaeth et al., 2023) and Court of Appeals (Songer) databases. |
| Dataset Splits | Yes | We match a total of 24,916 court cases, which we divide into a 70%/10%/20% train/validation/test split. |
| Hardware Specification | Yes | We fine-tune on a cluster consisting of NVIDIA H100 GPUs. Finetuning on all tasks simultaneously required approximately 600 H100 hours for the 8B model and 1600 GPU hours for the 70B model. In total, the experiments presented in the paper required approximately 8000 H100 GPU hours. ... We train on a node of 7 H100s using Deep Speed Zero 2, with a global batch size of 56. For Lawma 70B, we fine-tune Llama 3 70B Instruct for 1 epoch. We train on 8 nodes of 8 H100s each using Deep Speed Zero 3, with a global batch size of 64. |
| Software Dependencies | No | We pack samples using the axolotl library (Cloud, 2024), which improves training efficiency by approximately 40%. While a library is named and cited, a specific version number for 'axolotl' is not provided. |
| Experiment Setup | Yes | We fine-tuning with a maximum sequence length of 8192 tokens. We use the Adam W optimizer with full precision, β1 = 0.9, β2 = 0.95, ϵ = 10 8. We use a peak learning rate of 2 10 6. We use a cosine learning rate schedule, with 180 warm-up steps (approx. 4% of a full epoch) and decay to 10% of the peak learning rate. We use a weight decay of 0.1. We clip gradient to 1.0 max norm. ... For Lawma 8B, we fine-tune Llama 3 8B Instruct for 3 epochs. ... with a global batch size of 56. For Lawma 70B, we fine-tune Llama 3 70B Instruct for 1 epoch. ... with a global batch size of 64. |