AKORN: Adaptive Knots generated Online for RegressioN splines

Authors: Sunil Madhow, Dheeraj Baby, Yu-Xiang Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experimental Results 5.1. Local adaptivity As a matter of interest, we begin by noting that the fitted values from ADDLE can be expressed as ˆYaddle = Waddle Y for some hat-matrix Waddle that depends on the weights of the ADDLE instance. We can do the same for AKORN: ˆYakorn = Wakorn Y where Wakorn = HT K(HKHT K) 1HK. Because the bandwidth of these matrices around each diagonal elment (i, i) corresponds to the neighborhood of the data used in forming the ith prediction, ˆyi, we dub Waddle and Wakorn attention maps. It is informative to compare these learned attention maps to the static attention induced by local linear regression, as we do in Figure 1 for Donoho & Johnstone (1994) s Doppler function. Unlike local linear regression, we can see that that ADDLE has learned how to adaptively optimize a biasvariance trade-off in the neighborhood of each data point by choosing a spatially varying bandwidth . Similarly, we see that AKORN inherits ADDLE s learned knowledge of the geometry of the ground truth. It is well-known that this kind of local adaptivity is essential to getting optimal rates over TV classes (Donoho & Johnstone, 1994). 5.2. Performance comparison While ADDLE s ability to learn the local smoothness of the ground truth is remarkable, it performs comparatively poorly on offline datasets, as we demonstrate in Figure 5 in Appendix G.2. In this section, we show that AKORN is able to use ADDLE s adaptivity while efficiently leveraging offline data. We compare several policies. 1. Oracle linear trend filtering. We solve the variational problem described in Section A for a grid of possible λ. We then measure the true MSE against the ground truth, and return the best fit. We emphasize that this policy requires oracle knowledge of the ground truth. 2. Do F linear trend filtering. We solve the same trend filtering optimization problem for the same set of possible λ. We then form the Stein estimate of the risk by estimating the degrees of freedom of each model (Tibshirani & Taylor, 2012), and choose the best fit. Note that this procedure has no theoretical guarantees, is computationally intensive, and can require high-precision arithmetic when the dataset is large. 3. Wavelets. We use the soft-thresholding estimator of (Donoho & Johnstone, 1998) with Debauchies 2 wavelets. To our knowledge, apart from AKORN, this is the only optimal and parameter-free algorithm for estimating TV1 functions. 4. AKORN. We run AKORN as described in Section 4.2, with failure probability δ = 0.1. In Figure 2, we display a log-log plot of the error of each policy for various ground truths, against exponentially increasing values of n. In Table 1, we report the slope of the lines corresponding to Oracle Trend Filtering and AKORN, as estimated by the Linear Least Squares fit. This gives the approximate rate of each estimator. Across all ground truths, we observe that AKORN competes closely with Trend Filtering, despite the fact that the latter is provided with access to the ground truth. The first example in Figure 2, together with the first column in Table 1, suggests that AKORN adaptively achieves the parametric rate 1 n on piecewise linear functions, as does trend filtering. The second example in Figure 2 and AKORN: Adaptive Knots generated Online for Regressio N splines Table 1 validates that AKORN achieves the claimed rate of O(n 4/5) on spatially heterogenous functions like the Doppler function of (Donoho & Johnstone, 1994). The final example in Figure 2 represents runs of each policy on the ground truths θn = [0T R1 n 5, 1, 2, 3, 4, 5]T . In this setting, the fast rates for Trend Filtering from (Guntuboyina et al., 2020) do not apply, because the final linear segment of the ground truth is small. Notice that in this case, TV1[θ] = Θ(n), which means that the rate predicted both by our theory and that of Trend Filtering is O(n 4/5n2/5) = O(n 2/5). While this rate seems to be accurate for Oracle Trend Filtering, our experimental results seem to indicate that AKORN outperforms this rate substantially, as the least-squares slope of the orange line is about 0.7 (Table 1). These results indicate that AKORN s enhanced adaptivity leads to favorable performance on highly irregular problem instances. In Table 2, we report the MSEs of each policy for fixed n and various noise levels σ. From the table, we gather that AKORN is competitive with both Oracle and Do F Trend Filtering, especially for larger values of σ. Wavelets is substantially behind all other policies.
Researcher Affiliation Collaboration 1Halicioglu Data Science Institute, UC San Diego 2Amazon (Work was completed prior to joining.). Correspondence to: Sunil Madhow <EMAIL>.
Pseudocode Yes Algorithm 1 Find Knots Input: data {(xt, yt}n t=1, variance σ2 b 0 K {} Start ADDLE instance A for t {1, ..., n} do θt prediction for xt from A ˆat Linear Least Squares(zb:(t 1)) { R2} For all j [b, ...t] set ˆwt j ˆa T t xj s Pt j=b( ˆwt j θt)2 if s > 5σ2 log 2n2 δ then K = K {xt 1} b = t Restart A end if Update A with yt end for output K Algorithm 2 AKORN Input: data {xt, yt}n t=1, variance σ2 Kf Find Knots({xt, yt}n t=1, σ2) Kb Find Knots({xt, yt}1 t=n, σ2) g PF (Kf )Y h PF (Kb)Y K Crossovers(g, h, {xt}n t=1) Kcollision {xt : (t < n 1) xt+1 Kf Kb} K = Kf Kb K Kcollision output ˆf = PS(Kf Kb K)Y Algorithm 3 Bounded Linear Regression Expert Input: history H (X [0, 1]) , feature x, bound B {Generates B-bounded prediction given history H} if H is empty then Output 0 else if H = {(x H 1 , y H 1 )} then Output [y H 1 ][ B,B] {z[A,B] clips a real number z between A and B} else ˆl linear least squares fit on H Output ˆl(x)[ B,B] end if Algorithm 4 Linear regression expert Input: history H (X [0, 1]) , feature x{Generates prediction given history H} if H is empty then Output 0 else if H = {(x H 1 , y H 1 )} then Output y H 1 else ˆl linear least squares fit on H Output ˆl(x) end if Algorithm 5 ADDLE input D, σ B maxi |yi| + max{σ p 2 log 4n/δ, 1} Run FLH with learning rate η = 1 8(1+σ log 2n/δ)2 , base learners Ej(t) =Predict({(xk, yk}t 1 k=j, xt, B)
Open Source Code Yes Code is available at github.com/ Sunil Madhow/AKORN.
Open Datasets No Consider the problem of non-parametric regression over total variation smoothness classes. For some covariates {xi}n i=1, we observe data yi = f(xi) + ϵi (1) where {ϵi} are i.i.d. N(0, σ2) random variables and f has bounded k-th order total variation, which means that the variation in its kth derivative is controlled.
Dataset Splits No The paper describes generating synthetic data for experiments based on mathematical functions and adding noise, e.g., "Generate ϵ N(0, σ2In)" and "20 Monte-Carlo runs for each n". It does not mention predefined training, validation, or test splits for a specific dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory) used for running its experiments.
Software Dependencies No We use the library glmgen: https://github.com/glmgen/glmgen. The paper mentions the 'glmgen' library but does not provide a specific version number.
Experiment Setup Yes 4. AKORN. We run AKORN as described in Section 4.2, with failure probability δ = 0.1. Figure 2. Estimated rates for AKORN and Oracle Trend Filtering formed using 20 Monte-Carlo runs for each n, together with representative fits for n = 5000. σ = 0.3, δ = 0.1. Algorithm 5 ADDLE input D, σ B maxi |yi| + max{σ p 2 log 4n/δ, 1} Run FLH with learning rate η = 1 8(1+σ log 2n/δ)2 , base learners Ej(t) =Predict({(xk, yk}t 1 k=j, xt, B)