Verbalized Machine Learning: Revisiting Machine Learning with Language Models
Authors: Tim Z. Xiao, Robert Bamler, Bernhard Schölkopf, Weiyang Liu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify the effectiveness of VML, and hope that VML can serve as a stepping stone to stronger interpretability. We conduct empirical studies for the injection of verbalized inductive bias and show that it is promising to use natural language as a unified way to encode prior knowledge. Moreover, we validate the effectiveness of VML in different applications (Section 4, Appendix A, E,F,G,H). |
| Researcher Affiliation | Academia | Tim Z. Xiao EMAIL Max Planck Institute for Intelligent Systems, Tübingen & University of Tübingen Robert Bamler EMAIL University of Tübingen Bernhard Schölkopf EMAIL Max Planck Institute for Intelligent Systems, Tübingen Weiyang Liu EMAIL Max Planck Institute for Intelligent Systems, Tübingen & University of Cambridge |
| Pseudocode | Yes | Algorithm 1 Training in VML Initialize model parameters θ0, iteration number T, batch size M and optimizer parameters ψ; for i = 1, , T do Sample M training examples x1, , x M; for m = 1, 2, , M do ˆym = fmodel(xm; θi 1); end θi=fopt {xm, ˆym, ym}M m=1, θi 1; ψ ; end |
| Open Source Code | No | No explicit statement or link to the authors' own source code for the methodology described in this paper is provided. Mentions of code refer to third-party tools like vLLM or open-interpreter, or code for comparison methods like APE, but not their own implementation of VML. |
| Open Datasets | Yes | We create a subset of the dataset Pneumonia MNIST [64]. [64] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 2023. |
| Dataset Splits | Yes | The training set for each task consists of 100 data points. For classifications, we use additional test sets (20 data points), and evaluate both training and testing accuracies. Our dataset consists of 100 training data and 100 test data (half pneumonia and half normal for both sets). |
| Hardware Specification | Yes | The LLM is ran on a node of 8 A100 using the inference engine provided by vLLM [20]. |
| Software Dependencies | No | The paper mentions software like "vLLM" and "open-interpreter" and models like "Llama-3 70B" and "GPT-4o" but does not specify version numbers for any libraries or programming languages used in their own implementation. |
| Experiment Setup | Yes | The training set for each task consists of 100 data points. For all tasks, we use a batch size of 10 for each optimization step (see Figure 2 (right) as an example), which corresponds to 10 steps per training epoch. For classifications, we use additional test sets (20 data points), and evaluate both training and testing accuracies. Models are trained for 5 epochs. We use a batch size of 10 and train for 10 steps. |