reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Holistically Evaluating the Environmental Impact of Creating Language Models

Authors: Jacob Morrison, Clara Na, Jared Fernandez, Tim Dettmers, Emma Strubell, Jesse Dodge

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we estimate the real-world environmental impact of developing a series of language models, ranging from 20 million to 13 billion active parameters, trained on up to 5.6 trillion tokens each. [...] We measure and report the environmental impact of our model development; to the best of our knowledge we are the first to do so for LLMs, and we find that model development, the impact of which is generally not disclosed by most model developers, amounted to 50% of that of training.
Researcher Affiliation	Collaboration	Jacob Morrison1 Clara Na2 Jared Fernandez2 Tim Dettmers1,2 Emma Strubell1,2 Jesse Dodge1 1Allen Institute for AI 2Carnegie Mellon University
Pseudocode	No	The paper describes methodologies and calculations but does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using third-party tools like SGLang and Code Carbon, but there is no explicit statement or link indicating that the authors have released their own source code for the methodology described in this paper.
Open Datasets	Yes	The requests themselves come from the Share GPT dataset,9 and each inference scenario involves the same sample of 2400 prompts (same random seed). Input and output lengths, therefore, are the same in theory for a given model, but due to differences in tokenization and model context length, there are slight variations in mean input/output lengths across models, 225-250 and 190-230 tokens respectively. 9https://huggingface.co/datasets/anon8231489123/Share GPT_Vicuna_unfiltered/resolve/main/ Share GPT_V3_unfiltered_cleaned_split.json,anon8231489123/Share GPT_Vicuna_unfiltered
Dataset Splits	No	The paper mentions the models were trained on 1.7 to 5.6 trillion tokens and that a 'sample of 2400 prompts' was used for inference simulation, but it does not specify explicit training, validation, or test dataset splits for the main model training.
Hardware Specification	Yes	Each model was trained on standard HGX servers with 8 NVIDIA H100 GPUs per server, with high speed interconnect between each node, and between 2 and 128 nodes concurrently per training run. All models except the 13B were trained in the same data center.
Software Dependencies	Yes	We additionally estimate the environmental impact from mining rare earth metals used during manufacturing, assuming an H100 is 0.1% rare earth metal by mass. Mining 1 kg of rare earth materials consumes about 11 k L of water and releases 65.4 kg CO2eq (Browning et al., 2016), and one 12inch silicon wafer weighs 125 grams12 and produces about 63 H100s.13 14 Together, these add an additional 2.2 liters consumed and 0.013 kg CO2eq per GPU. ... In our inference experiments, we measure cumulative energy consumption using Code Carbon (Courty et al., 2024) tracking, which was verified against the same time series monitoring used throughout training.
Experiment Setup	No	The paper states, 'Before launching our final training runs for each model, we ran a series of controlled experiments to stabilize and improve our training setup, to explore different parameter initializations and mid-training recipes, and to determine our final hyperparameters and data mixtures through scaling law experiments (Bhagia et al., 2024).' However, it does not explicitly list the chosen hyperparameters or specific training configuration details for the final models.