lo-fi: distributed fine-tuning without communication
Authors: Mitchell Wortsman, Suchin Gururangan, Shen Li, Ali Farhadi, Ludwig Schmidt, Michael Rabbat, Ari S. Morcos
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section presents our experiments which test whether communication is required during fine-tuning. First we use the Dei T-III codebase (Touvron et al., 2022) to fine-tune their pre-trained Image Net-21k models on Image Net, where we observe that lo-fi matches the baseline but without communication between nodes (Section 3.1). Next, we fine-tune CLIP (Radford et al., 2021) on Image Net, WILDS-FMo W (Koh et al., 2021; Christie et al., 2018) and WILDS-i Wild Cam (Koh et al., 2021; Beery et al., 2021) (Section 3.2). Finally, we show preliminary experiments applying lo-fi outside of computer vision (Section 3.3) and benchmark the associated speed-ups by removing communication (Section 3.4). |
| Researcher Affiliation | Collaboration | Mitchell Wortsman EMAIL University of Washington Suchin Gururangan EMAIL University of Washington Shen Li EMAIL Meta AI Research, FAIR Team Ali Farhadi EMAIL University of Washington Ludwig Schmidt EMAIL University of Washington Michael Rabbat EMAIL Meta AI Research, FAIR Team Ari S. Morcos EMAIL Meta AI Research, FAIR Team |
| Pseudocode | No | The paper describes the 'lo-fi' method and 'With communication' method in prose. For example: 'lo-fi. With local-finetuning (lo-fi), we partition the n devices into K disjoint groups...Then, at the end of fine-tuning there is a single communication and the parameters from each group are averaged to produce a final solution θ = 1 K PK k=1 θk.' However, there are no clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The text does not explicitly state that the authors are releasing their own code for the lo-fi methodology. It mentions using 'the Dei T-III codebase (Touvron et al., 2022)' and 'Huggingface Transformers library (Wolf et al., 2020)' which are third-party tools, but not specific code for the paper's contributions. |
| Open Datasets | Yes | In particular, we fine-tune their Image Net-21k models on Image Net-1k (Deng et al., 2009) with and without lo-fi... We also test CLIP Vi T-L on two further datasets, WILDS-FMo W (Koh et al., 2021; Christie et al., 2018), a satellite image recognition dataset with a temporal distribution shift and WILDS-i Wild Cam (Koh et al., 2021; Beery et al., 2021)... We fine-tune on the Pile s Common Crawl subset (Gao et al., 2021)... we expand our results to also consider fine-tuning a Ro BERTa-base (Liu et al., 2019) model on tasks from the GLUE benchmark (Wang et al., 2018). |
| Dataset Splits | Yes | In addition to evaluating on Image Net (IN), the task used for fine-tuning, we also evaluate on the distribution shifts Image Net-V2 (IN-V2, (Recht et al., 2019)), Image Net-R (IN-R, (Hendrycks et al., 2021a)), Image Net-Sketch (Wang et al., 2019), and Image Net-A (INA (Hendrycks et al., 2021b))... WILDS-FMo W: ID... WILDS-FMo W: OOD... WILDS-i Wild Cam: ID... WILDS-i Wild Cam: OOD... We also found that scheduling a single node job was notably faster than multi-node jobs, taking 45 minutes for 1 node, 2 hours for 2-4 nodes, and 3 hours for 8 nodes. |
| Hardware Specification | Yes | We examine the wall-clock training time advantage once nodes are allocated and also the time it takes for node allocation on a slurm cluster... To examine the wall-clock advantage of lo-fi compared to the baseline we use A100 GPUs on AWS with fast interconnect of 400 GBps (EFA)... Each node consists of 8 Volta 32GB GPUs connected with 400GBps interconnect. |
| Software Dependencies | Yes | A recent innovation in distributed training tooling is to overlap the backwards pass computation and gradient communication... which is the default in Py Torch (Paszke et al., 2019) as of v1.5... We fine-tune on the Pile s Common Crawl subset (Gao et al., 2021) using the Huggingface Transformers library (Wolf et al., 2020). |
| Experiment Setup | Yes | We improved our own baseline over that in the paper with the following hyperparemter changes: (i) Instead of removing the classification layer of the pre-trained model, we implement a version of LP-FT (Kumar et al., 2022) to fine-tune we preserved the Image Net-21k classifier then use a class mapping from Image Net-21k to Image Net classes. (ii) We remove the grayscale, solarization, and Gaussian blur augmentations, since we found this improves accuracy. (iii) We fine-tuned for fewer epochs, which also required a switch to a cosine scheduler that updates every iteration instead of every epoch so the schedule could complete. We also considered different values for the learning rate and stochastic depth... Lo-fi was run using identical hyperparameters except we decreased the stochastic depth drop rate by 0.05... For the 125M parameter model, we set the learning rate to 6e-5, with 1024-length sequence blocks, and 500K tokens per batch. For the 1.3B parameter model, we set the learning rate to 1e-5, with 512-length sequence blocks, and 1M tokens per batch. We use fp16 mixed precision (Micikevicius et al., 2017) for all experiments. |