reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation

Authors: Lun Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we aim to elucidate the behavior of micro-batch clipping through a combination of theoretical analysis and empirical evaluation. Speciﬁcally, we conceptualize micro-batch clipping as a specialized form of data pruning (Sorscher et al., 2022). Unlike traditional data pruning techniques, which deterministically exclude redundant data, micro-batch clipping adaptively suppresses samples that hinder convergence, referred to as draggers , recognizing that a data sample s helpfulness can change throughout training. Guided by this intuition, we introduce Assumption 4.4 to capture certain properties of the draggers gradients, which are later empirically veriﬁed in Section 5.1. Based on the assumption, we analyze the convergence-to-stationary-points rate for both standard SGD and micro-batch clipping on smooth loss manifolds and summarize the results in in Table 1.
Researcher Affiliation	Industry	Lun Wang Google EMAIL
Pseudocode	Yes	Algorithm 1 Pseudocode for SGD with adaptive micro-batch clipping. gt represents mini-batch gradients in the tth iteration. ˆgj t represents the jth micro-batch s gradient in the tth iteration. Input: initial parameters w0, loss function L, training data D, #iterations T, micro-batch size b, #mini-batch size B, learning rate η. 1: for t = 1, 2, . . . T do 2: {dt i}i {1,...,B} D sample a mini-batch 3: for j = 1, 2, . . . B/b (parallelly) do 4: Load a micro-batch {dt i}i {b(j 1)+1,...,bj} 5: ˆgj t = L(wt 1, {dt i}i {b(j 1)+1,...,bj}) obtain average gradient of a micro-batch 6: ρt = minj \|\|ˆgj t \|\|2 7: for j = 1, 2, . . . B/b (parallelly) do 8: ˆgj t = ρt \|\|ˆgj t \|\|2 ˆgj t adaptive clipping j {1,...,B/b} ˆgj t 10: wt = wt 1 η gt update the model parameters
Open Source Code	No	The paper does not contain any explicit statement about providing source code or links to a code repository.
Open Datasets	Yes	Speciﬁcally, we ﬁne-tune 600M Conformer XL (Zhang et al., 2020) models on the Libri Speech dataset (Panayotov et al., 2015) and handcrafted canaries (Wang et al., 2024b). The model s encoder is pre-trained using BEST-RQ (Chiu et al., 2022) on the Libri Light dataset (Kahn et al., 2020). The key advantage of this setup is the ability to treat the inserted canaries as surrogate draggers, thereby circumventing the technical challenge of identifying natural draggers in large models.
Dataset Splits	Yes	Performance is assessed using word error rate (WER) on two splits of the Libri Speech test dataset: test-clean, consisting of relatively clean utterances, and test-other, consisting of noisier utterances.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for experiments.
Software Dependencies	No	The paper does not provide specific software dependency details with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	To answer question 1), we adopt the experimental setup from Wang et al. (Wang et al., 2024a). Speciﬁcally, we ﬁne-tune 600M Conformer XL (Zhang et al., 2020) models on the Libri Speech dataset (Panayotov et al., 2015) and handcrafted canaries (Wang et al., 2024b). The model s encoder is pre-trained using BEST-RQ (Chiu et al., 2022) on the Libri Light dataset (Kahn et al., 2020).