Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation
Authors: Lun Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we aim to elucidate the behavior of micro-batch clipping through a combination of theoretical analysis and empirical evaluation. Specifically, we conceptualize micro-batch clipping as a specialized form of data pruning (Sorscher et al., 2022). Unlike traditional data pruning techniques, which deterministically exclude redundant data, micro-batch clipping adaptively suppresses samples that hinder convergence, referred to as draggers , recognizing that a data sample s helpfulness can change throughout training. Guided by this intuition, we introduce Assumption 4.4 to capture certain properties of the draggers gradients, which are later empirically verified in Section 5.1. Based on the assumption, we analyze the convergence-to-stationary-points rate for both standard SGD and micro-batch clipping on smooth loss manifolds and summarize the results in in Table 1. |
| Researcher Affiliation | Industry | Lun Wang Google EMAIL |
| Pseudocode | Yes | Algorithm 1 Pseudocode for SGD with adaptive micro-batch clipping. gt represents mini-batch gradients in the tth iteration. ˆgj t represents the jth micro-batch s gradient in the tth iteration. Input: initial parameters w0, loss function L, training data D, #iterations T, micro-batch size b, #mini-batch size B, learning rate η. 1: for t = 1, 2, . . . T do 2: {dt i}i {1,...,B} D sample a mini-batch 3: for j = 1, 2, . . . B/b (parallelly) do 4: Load a micro-batch {dt i}i {b(j 1)+1,...,bj} 5: ˆgj t = L(wt 1, {dt i}i {b(j 1)+1,...,bj}) obtain average gradient of a micro-batch 6: ρt = minj ||ˆgj t ||2 7: for j = 1, 2, . . . B/b (parallelly) do 8: ˆgj t = ρt ||ˆgj t ||2 ˆgj t adaptive clipping j {1,...,B/b} ˆgj t 10: wt = wt 1 η gt update the model parameters |
| Open Source Code | No | The paper does not contain any explicit statement about providing source code or links to a code repository. |
| Open Datasets | Yes | Specifically, we fine-tune 600M Conformer XL (Zhang et al., 2020) models on the Libri Speech dataset (Panayotov et al., 2015) and handcrafted canaries (Wang et al., 2024b). The model s encoder is pre-trained using BEST-RQ (Chiu et al., 2022) on the Libri Light dataset (Kahn et al., 2020). The key advantage of this setup is the ability to treat the inserted canaries as surrogate draggers, thereby circumventing the technical challenge of identifying natural draggers in large models. |
| Dataset Splits | Yes | Performance is assessed using word error rate (WER) on two splits of the Libri Speech test dataset: test-clean, consisting of relatively clean utterances, and test-other, consisting of noisier utterances. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | To answer question 1), we adopt the experimental setup from Wang et al. (Wang et al., 2024a). Specifically, we fine-tune 600M Conformer XL (Zhang et al., 2020) models on the Libri Speech dataset (Panayotov et al., 2015) and handcrafted canaries (Wang et al., 2024b). The model s encoder is pre-trained using BEST-RQ (Chiu et al., 2022) on the Libri Light dataset (Kahn et al., 2020). |