Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning
Authors: Md Rifat Arefin, Gopeshh Raaj Subbaraj, Nicolas Gontier, Yann LeCun, Irina Rish, Ravid Shwartz-Ziv, Christopher Pal
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our method on challenging arithmetic reasoning tasks. Notably, on the 5 5 integer multiplication task, our approach achieves 99.5% exact match accuracy, surpassing models of the same size (which yield 0% accuracy) and GPT-4 with five-shot Co T prompting (44%). We also demonstrate significant improvements on arithmetic expression and longest increasing subsequence (LIS) datasets. |
| Researcher Affiliation | Collaboration | Md Rifat Arefin1,2,3 Gopeshh Subbaraj1,2, Nicolas Gontier3, Yann Le Cun5,6, Irina Rish1,2, Ravid Shwartz-Ziv6, Christopher Pal3,4 1Universit e de Montr eal, 2Mila, 3Service Now, 4Polytechnique Montreal, 5Meta FAIR, 6New York University |
| Pseudocode | No | The paper describes the methodology using mathematical formulations (Equations 1-5) and textual descriptions, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1https://github.com/rarefin/seq vcr |
| Open Datasets | Yes | We conduct experiments on three tasks: we first consider the multi-digit multiplication task from the BIG-bench benchmark (Srivastava et al., 2022)... Next, we focus on the Arithmetic Expressions Feng et al. (2024) dataset... Longest Increasing Sub-sequence (LIS) as described in the Introduction to Algorithms book (Cormen et al., 2022)... utilizing the training data generated by Deng et al. (2023)... We trained GPT-2-small from scratch on the C4 dataset... We performed further experiments to fine-tune the GPT-2 Small model using an enhanced version of the GSM8K dataset[1]... we finetuned a Code-GPT-2 Small model on the Code XGLUE-text-to-code benchmark4. |
| Dataset Splits | No | The paper mentions that 'Size refers to the training set' in Table 3 for different datasets and refers to 'validation set' in the table's footnote. However, it does not explicitly provide the specific percentages or absolute counts for training, validation, and test splits for the datasets used in the experiments. For example, it lists '808k' for the 4x4 Mult training set, but the sizes of validation and test sets are not specified, nor is the splitting methodology. |
| Hardware Specification | Yes | Total Compute Resources 1 32GB GPU, 6CPU, 32 GB RAM |
| Software Dependencies | No | The paper mentions 'Optimizer Adam W' in Table 2 but does not provide specific version numbers for any software dependencies used, such as programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | Fine-tuning is performed for 40 epochs with a learning rate of 5 10 4 and a batch size of 32. Training is conducted for 100 epochs with a learning rate of 1 10 4 and a batch size of 128. For multiplication tasks λ1 = 1.0 and λ2 = 0.004 and for other tasks we use λ1 = 0.1 and λ2 = 0.5. Other hyperparameters include: Learning Rate 0.0001, Batch Size 128, Optimizer Adam W, Dropout 0.1, Attn. Heads 4, Epochs 100. |