reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

lo-fi: distributed fine-tuning without communication

Authors: Mitchell Wortsman, Suchin Gururangan, Shen Li, Ali Farhadi, Ludwig Schmidt, Michael Rabbat, Ari S. Morcos

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section presents our experiments which test whether communication is required during fine-tuning. First we use the Dei T-III codebase (Touvron et al., 2022) to fine-tune their pre-trained Image Net-21k models on Image Net, where we observe that lo-fi matches the baseline but without communication between nodes (Section 3.1). Next, we fine-tune CLIP (Radford et al., 2021) on Image Net, WILDS-FMo W (Koh et al., 2021; Christie et al., 2018) and WILDS-i Wild Cam (Koh et al., 2021; Beery et al., 2021) (Section 3.2). Finally, we show preliminary experiments applying lo-fi outside of computer vision (Section 3.3) and benchmark the associated speed-ups by removing communication (Section 3.4).
Researcher Affiliation	Collaboration	Mitchell Wortsman EMAIL University of Washington Suchin Gururangan EMAIL University of Washington Shen Li EMAIL Meta AI Research, FAIR Team Ali Farhadi EMAIL University of Washington Ludwig Schmidt EMAIL University of Washington Michael Rabbat EMAIL Meta AI Research, FAIR Team Ari S. Morcos EMAIL Meta AI Research, FAIR Team
Pseudocode	No	The paper describes the 'lo-fi' method and 'With communication' method in prose. For example: 'lo-fi. With local-finetuning (lo-fi), we partition the n devices into K disjoint groups...Then, at the end of fine-tuning there is a single communication and the parameters from each group are averaged to produce a final solution θ = 1 K PK k=1 θk.' However, there are no clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The text does not explicitly state that the authors are releasing their own code for the lo-fi methodology. It mentions using 'the Dei T-III codebase (Touvron et al., 2022)' and 'Huggingface Transformers library (Wolf et al., 2020)' which are third-party tools, but not specific code for the paper's contributions.
Open Datasets	Yes	In particular, we fine-tune their Image Net-21k models on Image Net-1k (Deng et al., 2009) with and without lo-fi... We also test CLIP Vi T-L on two further datasets, WILDS-FMo W (Koh et al., 2021; Christie et al., 2018), a satellite image recognition dataset with a temporal distribution shift and WILDS-i Wild Cam (Koh et al., 2021; Beery et al., 2021)... We fine-tune on the Pile s Common Crawl subset (Gao et al., 2021)... we expand our results to also consider fine-tuning a Ro BERTa-base (Liu et al., 2019) model on tasks from the GLUE benchmark (Wang et al., 2018).
Dataset Splits	Yes	In addition to evaluating on Image Net (IN), the task used for fine-tuning, we also evaluate on the distribution shifts Image Net-V2 (IN-V2, (Recht et al., 2019)), Image Net-R (IN-R, (Hendrycks et al., 2021a)), Image Net-Sketch (Wang et al., 2019), and Image Net-A (INA (Hendrycks et al., 2021b))... WILDS-FMo W: ID... WILDS-FMo W: OOD... WILDS-i Wild Cam: ID... WILDS-i Wild Cam: OOD... We also found that scheduling a single node job was notably faster than multi-node jobs, taking 45 minutes for 1 node, 2 hours for 2-4 nodes, and 3 hours for 8 nodes.
Hardware Specification	Yes	We examine the wall-clock training time advantage once nodes are allocated and also the time it takes for node allocation on a slurm cluster... To examine the wall-clock advantage of lo-fi compared to the baseline we use A100 GPUs on AWS with fast interconnect of 400 GBps (EFA)... Each node consists of 8 Volta 32GB GPUs connected with 400GBps interconnect.
Software Dependencies	Yes	A recent innovation in distributed training tooling is to overlap the backwards pass computation and gradient communication... which is the default in Py Torch (Paszke et al., 2019) as of v1.5... We fine-tune on the Pile s Common Crawl subset (Gao et al., 2021) using the Huggingface Transformers library (Wolf et al., 2020).
Experiment Setup	Yes	We improved our own baseline over that in the paper with the following hyperparemter changes: (i) Instead of removing the classification layer of the pre-trained model, we implement a version of LP-FT (Kumar et al., 2022) to fine-tune we preserved the Image Net-21k classifier then use a class mapping from Image Net-21k to Image Net classes. (ii) We remove the grayscale, solarization, and Gaussian blur augmentations, since we found this improves accuracy. (iii) We fine-tuned for fewer epochs, which also required a switch to a cosine scheduler that updates every iteration instead of every epoch so the schedule could complete. We also considered different values for the learning rate and stochastic depth... Lo-fi was run using identical hyperparameters except we decreased the stochastic depth drop rate by 0.05... For the 125M parameter model, we set the learning rate to 6e-5, with 1024-length sequence blocks, and 500K tokens per batch. For the 1.3B parameter model, we set the learning rate to 1e-5, with 512-length sequence blocks, and 1M tokens per batch. We use fp16 mixed precision (Micikevicius et al., 2017) for all experiments.