Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Pareto-Optimal Fronts for Benchmarking Symbolic Regression Algorithms

Authors: Kei Sen Fong, Mehul Motani

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we explore absolute Pareto-optimal (APO) solutions instead, which have the optimal tradeoff between the multiple SR objectives, for 34 datasets in the widely-used SR benchmark, SRBench, by performing exhaustive search. Additionally, we include comparisons between eight numerical optimization methods.
Researcher Affiliation Academia Kei Sen Fong 1 Mehul Motani 1 2 1Department of Electrical and Computer Engineering, National University of Singapore, Singapore. 2N.1 Institute for Health, Institute for Digital Medicine (Wis DM), Institute of Data Science, National University of Singapore, Singapore. Correspondence to: Kei Sen Fong <EMAIL>, Mehul Motani <EMAIL>.
Pseudocode Yes Algorithm 1 outlines the steps for our exhaustive search SR algorithm. At the start of the algorithm, all possible K-expressions are constructed.
Open Source Code Yes The APO fronts provided serves as an important benchmark and performance limit for SR algorithms and is made publicly available at: https://github.com/kentridgeai/SRPareto Fronts
Open Datasets Yes In this paper, we explore absolute Pareto-optimal (APO) solutions instead, which have the optimal tradeoff between the multiple SR objectives, for 34 datasets in the widely-used SR benchmark, SRBench, by performing exhaustive search.
Dataset Splits No The optimized expression and its R2 score on the dataset X are then stored. This forms the raw data which we make available. For head length = 3, we ran Algorithm 1 on 34 datasets from SRBench (see Appendix A for dataset details), which had the condition that there were less than 1000 datapoints and less than 10 features, and for head length = 4, we reduced this to 30 datasets by excluding datasets with more than 6 features. The text does not provide specific train/test/validation split percentages or methodologies within the main paper for the APO front generation, it evaluates on the entire dataset 'X'.
Hardware Specification No Within our high budget of 1,480,000 core-compute-hours, we repeated the search for 10 random seed (11284, 11964, 15795, 21575, 22118, 23654, 29802, 5390, 6265, 860) for head length = 3 and did the search for one random seed (11284) for head length = 4. The computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg). The text mentions computational resources but lacks specific hardware models (e.g., CPU/GPU types, memory).
Software Dependencies No Specifically, we consider the following additional 7 methods: (i) L-BFGS-B (Liu & Nocedal, 1989; Zhu et al., 1997), (ii) conjugate gradient (CG) (Hestenes et al., 1952), (iii) Nelder-Mead (Nelder & Mead, 1965), (iv) Powell (Powell, 1977), (v) sequential least squares programming (SLSQP) (Lawson & Hanson, 1995), (vi) truncated Newton constrained (TNC) (Nash, 2000), (vii) trust-region constrained (trust-constr) (Conn et al., 2000). The paper lists various numerical optimization methods but does not provide specific software library names with version numbers.
Experiment Setup Yes In our experiments, we use a primitive function set of {Add,Sub,Mul,Div,Pow}, representing addition, subtraction, multiplication, division and power (the absolute value of the base is taken) respectively, all of which have arity two. ... We select two values of head length. ... for 10 random seed (11284, 11964, 15795, 21575, 22118, 23654, 29802, 5390, 6265, 860) for head length = 3 and did the search for one random seed (11284) for head length = 4. The random seeds are the same values as the 10 used in SRBench and were used to generate initial guesses for numerical optimization for the range (-1,1).