Minimum Width for Universal Approximation using Squashable Activation Functions

Authors: Jonghyun Shin, Namjun Kim, Geonho Hwang, Sejun Park

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this work, we study the minimum width of networks using general activation functions. Specifically, we focus on squashable functions that can approximate the identity function and binary step function by alternatively composing with affine transformations. We show that for networks using a squashable activation function to universally approximate Lp functions from [0, 1]dx to Rdy, the minimum width is max{dx, dy, 2} unless dx = dy = 1; the same bound holds for dx = dy = 1 if the activation function is monotone. We then provide sufficient conditions for squashability and show that all non-affine analytic functions and a class of piecewise functions are squashable, i.e., our minimum width result holds for those general classes of activation functions. Our result can be used to characterize the minimum width for a general class of practical activation functions, by showing their squashability. For example, we show that any non-affine analytic function (e.g., non-affine polynomial, SIGMOID, tanh, sin, exp, etc.) is squashable (Lemma 4). Furthermore, we also show that a wide class of piecewise continuously differentiable functions including leaky RELU and HARDSWISH are also squashable (Lemma 5). Hence, our result significantly extends the prior exact minimum width results for RELU and its variants. We prove our main result in Section 4 and conclude the paper in Section 5. Proofs of technical lemmas are deferred to Appendix.
Researcher Affiliation Academia 1Department of Mathematics Education, Korea University 2Department of Artificial Intelligence, Korea University 3Department of Mathematical Sciences, GIST. Correspondence to: Sejun Park <EMAIL>.
Pseudocode No The paper describes theoretical constructions and proofs, such as "We use the coding scheme (Park et al., 2021b) to prove our result. In particular, we construct our decoder fdec as a curve that densely fills the codomain of a target function so that supx f ([0,1]dx) infy fdec([0,1]) x y is small. We then construct our encoder to map each x [0, 1]dx to a neighborhood of f 1 dec (z) for some z f (x).", but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statements about releasing code, links to code repositories, or mentions of code in supplementary materials.
Open Datasets No We show that for networks using a squashable activation function to universally approximate Lp functions from [0, 1]dx to Rdy, the minimum width is max{dx, dy, 2} unless dx = dy = 1; the same bound holds for dx = dy = 1 if the activation function is monotone. wσ networks of width w are dense in Lp([0, 1]dx, Rdy) but σ networks of width w 1 are not dense. The paper uses theoretical function spaces (Lp functions on [0,1]dx to Rdy) rather than empirical datasets.
Dataset Splits No The paper analyzes theoretical properties of neural networks using abstract function spaces (Lp functions on [0,1]dx to Rdy) and does not involve empirical datasets or their splits.
Hardware Specification No The paper focuses on theoretical mathematical proofs regarding neural network properties and does not describe any experimental setup or hardware used for computations.
Software Dependencies No The paper is theoretical and focuses on mathematical proofs and definitions; therefore, it does not mention any software dependencies or version numbers.
Experiment Setup No The paper provides theoretical analysis and mathematical proofs for the minimum width for universal approximation. It does not describe any experimental setup, hyperparameter values, or training configurations.