Holistically Evaluating the Environmental Impact of Creating Language Models

Authors: Jacob Morrison, Clara Na, Jared Fernandez, Tim Dettmers, Emma Strubell, Jesse Dodge

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we estimate the real-world environmental impact of developing a series of language models, ranging from 20 million to 13 billion active parameters, trained on up to 5.6 trillion tokens each. [...] We measure and report the environmental impact of our model development; to the best of our knowledge we are the first to do so for LLMs, and we find that model development, the impact of which is generally not disclosed by most model developers, amounted to 50% of that of training.
Researcher Affiliation Collaboration Jacob Morrison1 Clara Na2 Jared Fernandez2 Tim Dettmers1,2 Emma Strubell1,2 Jesse Dodge1 1Allen Institute for AI 2Carnegie Mellon University
Pseudocode No The paper describes methodologies and calculations but does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using third-party tools like SGLang and Code Carbon, but there is no explicit statement or link indicating that the authors have released their own source code for the methodology described in this paper.
Open Datasets Yes The requests themselves come from the Share GPT dataset,9 and each inference scenario involves the same sample of 2400 prompts (same random seed). Input and output lengths, therefore, are the same in theory for a given model, but due to differences in tokenization and model context length, there are slight variations in mean input/output lengths across models, 225-250 and 190-230 tokens respectively. 9https://huggingface.co/datasets/anon8231489123/Share GPT_Vicuna_unfiltered/resolve/main/ Share GPT_V3_unfiltered_cleaned_split.json,anon8231489123/Share GPT_Vicuna_unfiltered
Dataset Splits No The paper mentions the models were trained on 1.7 to 5.6 trillion tokens and that a 'sample of 2400 prompts' was used for inference simulation, but it does not specify explicit training, validation, or test dataset splits for the main model training.
Hardware Specification Yes Each model was trained on standard HGX servers with 8 NVIDIA H100 GPUs per server, with high speed interconnect between each node, and between 2 and 128 nodes concurrently per training run. All models except the 13B were trained in the same data center.
Software Dependencies Yes We additionally estimate the environmental impact from mining rare earth metals used during manufacturing, assuming an H100 is 0.1% rare earth metal by mass. Mining 1 kg of rare earth materials consumes about 11 k L of water and releases 65.4 kg CO2eq (Browning et al., 2016), and one 12inch silicon wafer weighs 125 grams12 and produces about 63 H100s.13 14 Together, these add an additional 2.2 liters consumed and 0.013 kg CO2eq per GPU. ... In our inference experiments, we measure cumulative energy consumption using Code Carbon (Courty et al., 2024) tracking, which was verified against the same time series monitoring used throughout training.
Experiment Setup No The paper states, 'Before launching our final training runs for each model, we ran a series of controlled experiments to stabilize and improve our training setup, to explore different parameter initializations and mid-training recipes, and to determine our final hyperparameters and data mixtures through scaling law experiments (Bhagia et al., 2024).' However, it does not explicitly list the chosen hyperparameters or specific training configuration details for the final models.