reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Simple Model of Inference Scaling Laws

Authors: Noam Itzhak Levi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main predictions are verified by empirical results on mathematical reasoning tasks for several LLMs, following Brown et al. (2024). Lastly, we test the universality of our theory on an entirely different generative model. We train a Variational Autoencoder (VAE) (Kingma and Welling, 2022) to generate reconstructions of its training data by sampling from a latent space with an associated temperature. We find that the same behavior persists for both LLMs and the VAE setup, in spite of the vast differences in models and tasks.
Researcher Affiliation	Academia	1École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland. Correspondence to: Noam Levi <EMAIL>.
Pseudocode	No	The paper describes methods and models using mathematical equations and textual descriptions, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code for the methodology described, nor does it include links to a code repository.
Open Datasets	Yes	To test these predictions, we utilize the reported pass@k results of (Brown et al., 2024), which evaluated Gemma2B (Team et al., 2024), Llama3-8B (AI@Meta, 2024) and Pythia-2.8B (Biderman et al., 2023) on mathematical and coding tasks. Here, we take the results for the MATH dataset, which consists of difficult math word problems (Chen et al., 2024). ...we train a VAE with a temperature parameter to study how errors propagate over multiple trials and to compare empirical pass@k with theoretical predictions under correlated trials. We refer to this as the VAE reconstruction task. ...To do this, we train a VAE with a temperature parameter to study how errors propagate over multiple trials and to compare empirical pass@k with theoretical predictions under correlated trials. We refer to this as the VAE reconstruction task. ...we train a VAE with a temperature parameter to study how errors propagate over multiple trials and to compare empirical pass@k with theoretical predictions under correlated trials. We refer to this as the VAE reconstruction task. We utilize a Variational Autoencoder (VAE) with the following architectural details: Input dimension: 28 × 28 = 784, corresponding to the flattened pixel values of the Fashion MNIST dataset. ...We utilized a subset of the MNIST dataset (Deng, 2012).
Dataset Splits	Yes	Here, we take the results for the MATH dataset, which consists of difficult math word problems (Chen et al., 2024), where 128 random problems from the test set were chosen for evaluation. The VAE was trained on the first 400 samples from the Fashion MNIST dataset. We utilized a subset of the MNIST dataset (Deng, 2012). Subset Size: We selected the first Nsamples = 1000 images from the MNIST training set.
Hardware Specification	No	The paper discusses 'Floating Point Operations Per Second (FLOPS)' as a cost metric but does not specify the actual hardware (e.g., GPU, CPU models) used for running the experiments.
Software Dependencies	No	The paper mentions 'Adam optimizer (Kingma and Ba, 2017)' and refers to specific LLM models (Gemma2B, Llama3-8B, Pythia-2.8B), but it does not specify version numbers for any software libraries, frameworks, or programming languages used for implementation.
Experiment Setup	Yes	We utilize a Variational Autoencoder (VAE) with the following architectural details: Input dimension: 28 × 28 = 784, corresponding to the flattened pixel values of the Fashion MNIST dataset. Hidden dimension: 400. Latent dimension: 20, controlling the bottleneck for information in the latent space. Decoder: The decoder reconstructs the original input through two fully connected layers, outputting a 784-dimensional vector followed by a sigmoid activation to ensure pixel values remain between 0 and 1. Temperature parameter: A temperature parameter T = 1.1 is applied during the reparameterization step to control the variance of the latent variables, allowing us to model uncertainty in the latent space more effectively. The VAE was trained on the first 400 samples from the Fashion MNIST dataset. The loss function combines binary cross-entropy for reconstruction and the Kullback-Leibler divergence to regularize the latent variables. We ran the training for 1000 epochs using the Adam optimizer with a learning rate of 1 × 10−3. B.2. Memory Model The “memory” component was implemented as a simple Multi-Layer Perceptron (MLP). Architecture: The MLP consisted of: 1. An input layer accepting Din = 784 features. 2. A first hidden layer with 256 neurons, followed by a ReLU activation function. 3. A second hidden layer with 128 neurons, followed by a ReLU activation function. 4. An output layer with Nsamples neurons (one for each unique class), producing logits. Training Objective: The model was trained to perform classification, mapping each unique input sample to its assigned unique class index. Training Parameters: Loss Function: Cross-Entropy Loss (LCE). Optimizer: Adam optimizer (Kingma and Ba, 2017). Learning Rate: η = 0.001. Epochs: The model was trained for E = 50 epochs. Batch Size: B = 32.