reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning

Authors: Md Rifat Arefin, Gopeshh Raaj Subbaraj, Nicolas Gontier, Yann LeCun, Irina Rish, Ravid Shwartz-Ziv, Christopher Pal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our method on challenging arithmetic reasoning tasks. Notably, on the 5 5 integer multiplication task, our approach achieves 99.5% exact match accuracy, surpassing models of the same size (which yield 0% accuracy) and GPT-4 with five-shot Co T prompting (44%). We also demonstrate significant improvements on arithmetic expression and longest increasing subsequence (LIS) datasets.
Researcher Affiliation	Collaboration	Md Rifat Arefin1,2,3 Gopeshh Subbaraj1,2, Nicolas Gontier3, Yann Le Cun5,6, Irina Rish1,2, Ravid Shwartz-Ziv6, Christopher Pal3,4 1Universit e de Montr eal, 2Mila, 3Service Now, 4Polytechnique Montreal, 5Meta FAIR, 6New York University
Pseudocode	No	The paper describes the methodology using mathematical formulations (Equations 1-5) and textual descriptions, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1https://github.com/rarefin/seq vcr
Open Datasets	Yes	We conduct experiments on three tasks: we first consider the multi-digit multiplication task from the BIG-bench benchmark (Srivastava et al., 2022)... Next, we focus on the Arithmetic Expressions Feng et al. (2024) dataset... Longest Increasing Sub-sequence (LIS) as described in the Introduction to Algorithms book (Cormen et al., 2022)... utilizing the training data generated by Deng et al. (2023)... We trained GPT-2-small from scratch on the C4 dataset... We performed further experiments to fine-tune the GPT-2 Small model using an enhanced version of the GSM8K dataset[1]... we finetuned a Code-GPT-2 Small model on the Code XGLUE-text-to-code benchmark4.
Dataset Splits	No	The paper mentions that 'Size refers to the training set' in Table 3 for different datasets and refers to 'validation set' in the table's footnote. However, it does not explicitly provide the specific percentages or absolute counts for training, validation, and test splits for the datasets used in the experiments. For example, it lists '808k' for the 4x4 Mult training set, but the sizes of validation and test sets are not specified, nor is the splitting methodology.
Hardware Specification	Yes	Total Compute Resources 1 32GB GPU, 6CPU, 32 GB RAM
Software Dependencies	No	The paper mentions 'Optimizer Adam W' in Table 2 but does not provide specific version numbers for any software dependencies used, such as programming languages, libraries, or frameworks.
Experiment Setup	Yes	Fine-tuning is performed for 40 epochs with a learning rate of 5 10 4 and a batch size of 32. Training is conducted for 100 epochs with a learning rate of 1 10 4 and a batch size of 128. For multiplication tasks λ1 = 1.0 and λ2 = 0.004 and for other tasks we use λ1 = 0.1 and λ2 = 0.5. Other hyperparameters include: Learning Rate 0.0001, Batch Size 128, Optimizer Adam W, Dropout 0.1, Attn. Heads 4, Epochs 100.