Minimum Width for Universal Approximation using Squashable Activation Functions
Authors: Jonghyun Shin, Namjun Kim, Geonho Hwang, Sejun Park
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this work, we study the minimum width of networks using general activation functions. Specifically, we focus on squashable functions that can approximate the identity function and binary step function by alternatively composing with affine transformations. We show that for networks using a squashable activation function to universally approximate Lp functions from [0, 1]dx to Rdy, the minimum width is max{dx, dy, 2} unless dx = dy = 1; the same bound holds for dx = dy = 1 if the activation function is monotone. We then provide sufficient conditions for squashability and show that all non-affine analytic functions and a class of piecewise functions are squashable, i.e., our minimum width result holds for those general classes of activation functions. Our result can be used to characterize the minimum width for a general class of practical activation functions, by showing their squashability. For example, we show that any non-affine analytic function (e.g., non-affine polynomial, SIGMOID, tanh, sin, exp, etc.) is squashable (Lemma 4). Furthermore, we also show that a wide class of piecewise continuously differentiable functions including leaky RELU and HARDSWISH are also squashable (Lemma 5). Hence, our result significantly extends the prior exact minimum width results for RELU and its variants. We prove our main result in Section 4 and conclude the paper in Section 5. Proofs of technical lemmas are deferred to Appendix. |
| Researcher Affiliation | Academia | 1Department of Mathematics Education, Korea University 2Department of Artificial Intelligence, Korea University 3Department of Mathematical Sciences, GIST. Correspondence to: Sejun Park <EMAIL>. |
| Pseudocode | No | The paper describes theoretical constructions and proofs, such as "We use the coding scheme (Park et al., 2021b) to prove our result. In particular, we construct our decoder fdec as a curve that densely fills the codomain of a target function so that supx f ([0,1]dx) infy fdec([0,1]) x y is small. We then construct our encoder to map each x [0, 1]dx to a neighborhood of f 1 dec (z) for some z f (x).", but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statements about releasing code, links to code repositories, or mentions of code in supplementary materials. |
| Open Datasets | No | We show that for networks using a squashable activation function to universally approximate Lp functions from [0, 1]dx to Rdy, the minimum width is max{dx, dy, 2} unless dx = dy = 1; the same bound holds for dx = dy = 1 if the activation function is monotone. wσ networks of width w are dense in Lp([0, 1]dx, Rdy) but σ networks of width w 1 are not dense. The paper uses theoretical function spaces (Lp functions on [0,1]dx to Rdy) rather than empirical datasets. |
| Dataset Splits | No | The paper analyzes theoretical properties of neural networks using abstract function spaces (Lp functions on [0,1]dx to Rdy) and does not involve empirical datasets or their splits. |
| Hardware Specification | No | The paper focuses on theoretical mathematical proofs regarding neural network properties and does not describe any experimental setup or hardware used for computations. |
| Software Dependencies | No | The paper is theoretical and focuses on mathematical proofs and definitions; therefore, it does not mention any software dependencies or version numbers. |
| Experiment Setup | No | The paper provides theoretical analysis and mathematical proofs for the minimum width for universal approximation. It does not describe any experimental setup, hyperparameter values, or training configurations. |