Mask in the Mirror: Implicit Sparsification
Authors: Tom Jacobs, Rebekka Burkholz
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of PILo T in extensive experiments covering three different scenarios. Firstly, we confirm our theoretical results on the gradient flow in Theorem 2.3. Secondly, we compare PILo T with other state-of-the-art continuous sparsification methods such as STR (Kusupati et al., 2020) and spred (Ziyin & Wang, 2023) in a one-shot setting. In this context, we also isolate the individual contribution of our initialization. Finally, we combine PILo T with iterative pruning methods such as WR (Frankle & Carbin, 2019) and LRR (Maene et al., 2021). ... In experiments for diagonal linear networks and vision benchmarks (including Image Net), PILo T consistently outperforms baseline sparsification methods such as STR and spred, which demonstrates the utility of our theoretical insights. |
| Researcher Affiliation | Academia | Tom Jacobs CISPA Helmholtz Center for Information Security EMAIL Rebekka Burkholz CISPA Helmholtz Center for Information Security EMAIL |
| Pseudocode | Yes | Algorithm 1 PILo T Require: epochs T, schedule αinit, initialization xinit, scaling constant β Initialize m0, w0 such that m0 w0 = xinit, m2 0 w2 0 = β, δ 1 and, K α0 αinit Current training acc 0 Set f(m, w, α0) := f(m w) + α0 ||m||2 L2 + ||w||2 L2 for k in 1 . . . T do (mk, wk) = Optimizer Step f(mk 1, wk 1, αk 1) if Training acc Current training acc and ||mk wk||L1 K and k T 2 then αk αk 1δ else αk αk 1/δ end if Current training acc Training acc end for return Model f(x T ) with x T = m T w T |
| Open Source Code | No | The codebase for the experiments is written in Py Torch and torchvision and their relevant primitives for model construction and data-related operations. |
| Open Datasets | Yes | Firstly, we compare our method PILo T with STR, spred, and LASSO on CIFAR10 and CIFAR100 training a Res Net-20 or Res Net-18, respectively. ... In Table 1, we compare PILo T to both STR and spred on Image Net (Deng et al., 2009). |
| Dataset Splits | No | The paper uses well-known datasets like CIFAR10, CIFAR100, and Image Net, which typically have standard training, validation, and test splits. However, the paper does not explicitly state the percentages, counts, or specific citations for the splits used in its experiments, nor does it detail a custom splitting methodology. |
| Hardware Specification | Yes | The experiments in the paper are trained on an NVIDIA A6000. In addition, the diagonal linear network is trained on a CPU 13th Gen INTEL(R) Core(TM) i9-13900H. |
| Software Dependencies | No | The codebase for the experiments is written in Py Torch and torchvision and their relevant primitives for model construction and data-related operations. |
| Experiment Setup | Yes | Table 2: One-shot experiment Parameter Setting Comments Optimizer SGD Momentum 0.9 Batch size 256 Activation function Re Lu Weight decay2 10 4 Base learning rate {0.1, 0.2} Epochs 150 Warmup period 0 Initialization Kaiming normal Scaling 1 Only for m w δ 1.01 K 8000 Learning rate schedule cosine warmup ... Table 3: Res Net-50 on Image Net configurations for each sparsity (%). ... Table 5: WR and LRR experiment on Image Net Parameter Setting Comments Optimizer SGD Momentum 0.9 Batch size 512 Activation function Re Lu Weight decay {0, 10 4} Learning rate schedule step warmup Base learning rate {0.1, 0.2} Cycles 25 Pruning rate 0.8 Epochs per cycle 90 Warmup period 10 Initialization Kaiming normal L2 regularization 5 10 5 Only for m w PILOT regularization {0} Only for m w Scaling 1 Only for m w δ 1 Only for m w K Only for m w |