Verbalized Machine Learning: Revisiting Machine Learning with Language Models

Authors: Tim Z. Xiao, Robert Bamler, Bernhard Schölkopf, Weiyang Liu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically verify the effectiveness of VML, and hope that VML can serve as a stepping stone to stronger interpretability. We conduct empirical studies for the injection of verbalized inductive bias and show that it is promising to use natural language as a unified way to encode prior knowledge. Moreover, we validate the effectiveness of VML in different applications (Section 4, Appendix A, E,F,G,H).
Researcher Affiliation Academia Tim Z. Xiao EMAIL Max Planck Institute for Intelligent Systems, Tübingen & University of Tübingen Robert Bamler EMAIL University of Tübingen Bernhard Schölkopf EMAIL Max Planck Institute for Intelligent Systems, Tübingen Weiyang Liu EMAIL Max Planck Institute for Intelligent Systems, Tübingen & University of Cambridge
Pseudocode Yes Algorithm 1 Training in VML Initialize model parameters θ0, iteration number T, batch size M and optimizer parameters ψ; for i = 1, , T do Sample M training examples x1, , x M; for m = 1, 2, , M do ˆym = fmodel(xm; θi 1); end θi=fopt {xm, ˆym, ym}M m=1, θi 1; ψ ; end
Open Source Code No No explicit statement or link to the authors' own source code for the methodology described in this paper is provided. Mentions of code refer to third-party tools like vLLM or open-interpreter, or code for comparison methods like APE, but not their own implementation of VML.
Open Datasets Yes We create a subset of the dataset Pneumonia MNIST [64]. [64] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 2023.
Dataset Splits Yes The training set for each task consists of 100 data points. For classifications, we use additional test sets (20 data points), and evaluate both training and testing accuracies. Our dataset consists of 100 training data and 100 test data (half pneumonia and half normal for both sets).
Hardware Specification Yes The LLM is ran on a node of 8 A100 using the inference engine provided by vLLM [20].
Software Dependencies No The paper mentions software like "vLLM" and "open-interpreter" and models like "Llama-3 70B" and "GPT-4o" but does not specify version numbers for any libraries or programming languages used in their own implementation.
Experiment Setup Yes The training set for each task consists of 100 data points. For all tasks, we use a batch size of 10 for each optimization step (see Figure 2 (right) as an example), which corresponds to 10 steps per training epoch. For classifications, we use additional test sets (20 data points), and evaluate both training and testing accuracies. Models are trained for 5 epochs. We use a batch size of 10 and train for 10 steps.