On the Crucial Role of Initialization for Matrix Factorization
Authors: Bingcong Li, Liang Zhang, Aryan Mokhtari, Niao He
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation starts with a few-shot learning task following (Malladi et al., 2023). The objective is to rapidly adapt a language model with a small training set. The datasets for this experiment are drawn from GLUE and Super GLUE benchmarks (Wang et al., 2019b;a). The performance of different algorithms is summarized in Tab. 2. It is evident that OLo RA, Pi SSA, No RA, and No RA+ all outperform Lo RA because their initialization strategies have provided more favorable directions for optimization. |
| Researcher Affiliation | Academia | 1ETH Zurich, 2The University of Texas at Austin EMAIL EMAIL |
| Pseudocode | Yes | We summarize No RA and No RA+ in Algs. 1 and 2, respectively in the appendix, with additional explanations in Apdx. A.3. |
| Open Source Code | Yes | Code is available at https://github.com/Bingcong Li/No RA. |
| Open Datasets | Yes | The datasets for this experiment are drawn from GLUE and Super GLUE benchmarks (Wang et al., 2019b;a). Consistent with (Malladi et al., 2023), we randomly sample 1,000 data points for training and another 1,000 for testing. ... The base model is selected as Stable Diffusion v1.4 (Rombach et al., 2022) (0.98B parameters in total). ... We tackle commonsense reasoning tasks following the setup in (Hu et al., 2023). Training data are merged from 8 datasets listed in Tab. 4. ... For mathematical problems, we consider GSM8K (Cobbe et al., 2021) dataset ... We also adopt Meta Math QA dataset (Yu et al., 2024)... We also use SQu AD (question answering, (Rajpurkar et al., 2016)) in our experiments... |
| Dataset Splits | Yes | Consistent with (Malladi et al., 2023), we randomly sample 1,000 data points for training and another 1,000 for testing. |
| Hardware Specification | Yes | The experiments are conducted with PyTorch (Paszke et al., 2019) on NVIDIA H100 GPUs. |
| Software Dependencies | No | The experiments are conducted with PyTorch (Paszke et al., 2019) on NVIDIA H100 GPUs. |
| Experiment Setup | Yes | The hyperparameters adopted are searched over values in Tab. 5. Adam is adopted for optimization. ... For this experiment, we first search for the best batchsizes for Lo RA, and the same batchsize is applied for other tested algorithms as well. Then we search additionally for the best learning rate for each algorithm. |