Efficient Learning with Sine-Activated Low-Rank Matrices
Authors: Yiping Ji, Hemanth Saratchandran, Cameron Gordon, Zeyu Zhang, Simon Lucey
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose a novel theoretical framework that integrates a sinusoidal function within the lowrank decomposition. This approach not only preserves the benefits of the parameter efficiency of low-rank methods but also increases the decomposition s rank, thereby enhancing model performance. Our method proves to be a plug-in enhancement for existing low-rank methods, as evidenced by its successful application in Vision Transformers (Vi T), Large Language Models (LLMs), Neural Radiance Fields (Ne RF) and 3D shape modelling. |
| Researcher Affiliation | Academia | Yiping Ji Australian Institute for Machine Learning University of Adelaide DATA61, CSIRO Hemanth Saratchandran* Australian Institute for Machine Learning University of Adelaide Cameron Gordon Australian Institute for Machine Learning University of Adelaide Zeyu Zhang Australian National University Simon Lucey Australian Institute for Machine Learning University of Adelaide |
| Pseudocode | No | The paper describes its methodology using mathematical equations and textual explanations, particularly in Section 3, 'Methodology', and Appendix A.1, 'Theoretical Framework'. It does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is publicly available at https://yipingji.github.io/sine_activated_PEL/. |
| Open Datasets | Yes | Dataset. We evaluate the natural language understanding (NLU) task performance on the Ro BERTa V3 base model (Reimers & Gurevych, 2019). Specifically, we adopt the widely recognized GLUE benchmark (Wang et al., 2018), including Co LA (Warstadt et al., 2018), MRPC (Dolan & Brockett, 2005), QQP, STS-B(Cer et al., 2017), MNLI (Williams et al., 2018), QNLI (Rajpurkar et al., 2016), and RTE (Dagan et al., 2006; Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009). Furthermore, we evaluate sine Lo RA by fine-tuning large scale language models LLa MA 3-8B on commonsense reasoning tasks, which includes Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2019), SIQA (Sap et al., 2019), Hella Swag (HS) (Zellers et al., 2019), Wino Grande (WG) (Sakaguchi et al., 2021), ARC-c, ARC-e (Clark et al., 2018) and OBQA (Mihaylov et al., 2018). Experimental setup. We trained the Vi T-Small and Vi T-Base models from scratch, utilizing the CIFAR-100 and Image Net-1k datasets, respectively, to establish our baseline performance metrics (Deng et al., 2009; Krizhevsky, 2012). Neural Radiance Fields (Ne RFs) represent 3D scene signals by utilizing a set of 2D sparse images (Mildenhall et al., 2020). We evaluate our methods by training a Ne RF model on the standard benchmarks LLFF dataset, which consists of 8 real-world scenes captured by hand-held cameras (Mildenhall et al., 2019). We use the Thai Statue, Dragon and Lucy instance from the Stanford Scanning Repository.1 1Available at https://graphics.stanford.edu/data/3Dscanrep/ |
| Dataset Splits | Yes | Dataset. We evaluate the natural language understanding (NLU) task performance on the Ro BERTa V3 base model... Specifically, we adopt the widely recognized GLUE benchmark (Wang et al., 2018)... Experimental setup. We trained the Vi T-Small and Vi T-Base models from scratch, utilizing the CIFAR-100 and Image Net-1k datasets, respectively... We evaluate our methods by training a Ne RF model on the standard benchmarks LLFF dataset... |
| Hardware Specification | Yes | Finetuning Llama3-8B takes roughly 6 hours using Lo RA, 7 hours using Sine Lo RA, 11 hours using Do RA, and 11 hours using Sine Do RA using a NVIDIA H100 GPU with 96GB of memory. |
| Software Dependencies | No | The paper mentions using established frameworks like LoRA and DoRA, and references the Timm codebase for ConvNeXt experiments, but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Setting. In the Transformer architecture, there are four weight matrices in the self-attention module (Wq, Wk, Wv, Wo) and two in the MLP module(Wup, Wdown). To evaluate Ro BERTA V3, we follow up the Lo RA architecture and implement low-rank adaptation only on Wq and Wv. We study the performance of Lo RA and sine Lo RA in terms of different rank k = 1, 2, 4, 8. For sine Lo RA, we use frequency = 200 across all the ranks. We use different learning rate and epoch for different datasets as shown in Table 4. Implementation details for Llama3-8B We followed the settings in (Liu et al., 2024). We study the performance of Lo RA and sine Lo RA for different rank k = 4, 8, 16, 32 and configurations are as shown in 5. Experimental setup. We trained the Vi T-Small and Vi T-Base models from scratch... We use learning rate 1e-3, batch size 512 and train for 200 epochs. Choices of frequency for different ranks are shown in Table 9. Implementation details for Vi T-Base on Image Net-1k: We followed the settings in (He et al., 2022a). We use batch size of 1024, learning rate of 3e-4 and we train for 300 epochs. Implementation details: We use 8 fully connected layers each with 256 neurons, a learning rate of 5e-4 and train for 500k iterations. Implementation details: We use 2 fully connected layers each with 256 neurons, a learning rate of 1e-3 and train for 200 epochs. |