reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Lens into Interpretable Transformer Mistakes via Semantic Dependency

Authors: Ruo-Jing Dong, Yu Yao, Bo Han, Tongliang Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on models including the BERT series, GPT, and LLa MA, we uncover the following key findings: 1). Most tokens primarily retain their original semantic information even as they propagate through multiple layers. 2). Models can encode truthful semantic dependencies in tokens in the final layer. 3). Mistakes in model answers often stem from specific tokens encoded with incorrect semantic dependencies. Furthermore, we found that addressing the incorrectness by directly adjusting parameters is challenging because the same parameters can encode both correct and incorrect semantic dependencies depending on the context.
Researcher Affiliation	Academia	1Sydney AI Centre, The University of Sydney 2TMLR Group, Hong Kong Baptist University. Correspondence to: Tongliang Liu <EMAIL>.
Pseudocode	Yes	A.5. Pesudocode for Section 5 Algorithm 1 Evaluation of Semantic Dependencies
Open Source Code	No	The paper does not provide any explicit statement about releasing open-source code for the methodology described, nor does it provide a link to a code repository. It mentions using existing tools like SpaCy and Stanza, but not its own implementation code.
Open Datasets	Yes	Experiments We validate model s self-information retention and sequence-level semantic aggregation using various sentences from six datasets, including gsm8k (Cobbe et al., 2021), Yelp (Zhang et al., 2015), GLUE (Wang et al., 2019), CNN/Daily Mail (Hermann et al., 2015), Open Orca (Lian et al., 2023) and Wiki Text (Merity et al., 2016). For each model, over 100,000 token cases were evaluated for each datasets (each token perturbation is treated as one case, 600,000 cases in total). Our analysis involves 10 various Transformer-based models, including BERT (Devlin et al., 2018), Ro BERTa (Liu, 2019), ALBERT (Lan, 2019), Distil BERT(Sanh, 2019), De BERTa (He et al., 2020), Mobile BERT (Sun et al., 2020), Mini LM (Wang et al., 2020), GPT (Radford et al., 2019), and LLa MA (Touvron et al., 2023).
Dataset Splits	Yes	Experiments We validate model s self-information retention and sequence-level semantic aggregation using various sentences from six datasets, including gsm8k (Cobbe et al., 2021), Yelp (Zhang et al., 2015), GLUE (Wang et al., 2019), CNN/Daily Mail (Hermann et al., 2015), Open Orca (Lian et al., 2023) and Wiki Text (Merity et al., 2016). For each model, over 100,000 token cases were evaluated for each datasets (each token perturbation is treated as one case, 600,000 cases in total). Our analysis involves processing over 100,000 QA validation cases across 10 Transformer-based models.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	To identify a semantically dependent token group Gz0 i , we leverage existing semantic dependency parsing methods to obtain the semantically dependent word group Ww0 i of the word w0 i , then convert it into a token group. We leverage existing semantic dependency parsing tools Spa Cy (Honnibal et al., 2020)... We additionally conducted experiments using another widely adopted dependency parser, Stanza (Stanford NLP) (Qi et al., 2020)... The paper mentions software names but not specific version numbers for any software used in their experiments.
Experiment Setup	No	The paper mentions evaluating models over a certain number of cases (e.g., "over 100,000 token cases," "over 10,000 cases") and defining specific parameters for its perturbation method (e.g., "K = 5 in our experiments," "L = 5 when choosing the top 5 semantically dependent tokens"). It also discusses using an F1 score threshold for identifying incorrect answers. However, it does not specify typical hyperparameters for model training or fine-tuning, such as learning rate, batch size, number of epochs, or optimizer settings, which are crucial for reproducing the experimental setup of the models used.