reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Verbalized Machine Learning: Revisiting Machine Learning with Language Models

Authors: Tim Z. Xiao, Robert Bamler, Bernhard Schölkopf, Weiyang Liu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verify the effectiveness of VML, and hope that VML can serve as a stepping stone to stronger interpretability. We conduct empirical studies for the injection of verbalized inductive bias and show that it is promising to use natural language as a unified way to encode prior knowledge. Moreover, we validate the effectiveness of VML in different applications (Section 4, Appendix A, E,F,G,H).
Researcher Affiliation	Academia	Tim Z. Xiao EMAIL Max Planck Institute for Intelligent Systems, Tübingen & University of Tübingen Robert Bamler EMAIL University of Tübingen Bernhard Schölkopf EMAIL Max Planck Institute for Intelligent Systems, Tübingen Weiyang Liu EMAIL Max Planck Institute for Intelligent Systems, Tübingen & University of Cambridge
Pseudocode	Yes	Algorithm 1 Training in VML Initialize model parameters θ0, iteration number T, batch size M and optimizer parameters ψ; for i = 1, , T do Sample M training examples x1, , x M; for m = 1, 2, , M do ˆym = fmodel(xm; θi 1); end θi=fopt {xm, ˆym, ym}M m=1, θi 1; ψ ; end
Open Source Code	No	No explicit statement or link to the authors' own source code for the methodology described in this paper is provided. Mentions of code refer to third-party tools like vLLM or open-interpreter, or code for comparison methods like APE, but not their own implementation of VML.
Open Datasets	Yes	We create a subset of the dataset Pneumonia MNIST [64]. [64] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 2023.
Dataset Splits	Yes	The training set for each task consists of 100 data points. For classifications, we use additional test sets (20 data points), and evaluate both training and testing accuracies. Our dataset consists of 100 training data and 100 test data (half pneumonia and half normal for both sets).
Hardware Specification	Yes	The LLM is ran on a node of 8 A100 using the inference engine provided by vLLM [20].
Software Dependencies	No	The paper mentions software like "vLLM" and "open-interpreter" and models like "Llama-3 70B" and "GPT-4o" but does not specify version numbers for any libraries or programming languages used in their own implementation.
Experiment Setup	Yes	The training set for each task consists of 100 data points. For all tasks, we use a batch size of 10 for each optimization step (see Figure 2 (right) as an example), which corresponds to 10 steps per training epoch. For classifications, we use additional test sets (20 data points), and evaluate both training and testing accuracies. Models are trained for 5 epochs. We use a batch size of 10 and train for 10 steps.