reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploring Model Editing for LLM-based Aspect-Based Sentiment Classification

Authors: Shichen Li, Zhongqing Wang, Zheyu Zhao, Yue Zhang, Peifeng Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our in-domain and out-of-domain experiments demonstrate that this approach achieves competitive results compared to the currently strongest methods with significantly fewer trainable parameters, highlighting a more efficient and interpretable fine-tuning strategy.
Researcher Affiliation	Academia	Shichen Li1 Zhongqing Wang1*, Zheyu Zhao1, Yue Zhang2, Peifeng Li1 1Natural Language Processing Lab, Soochow University, Suzhou, China 2Westlake University EMAIL, EMAIL EMAIL
Pseudocode	No	The paper describes the methodology using textual explanations and figures, but no explicit 'Pseudocode' or 'Algorithm' blocks are provided.
Open Source Code	No	The paper mentions "https://huggingface.co/meta-llama/Llama-2-7b-hf" in footnote 1, which refers to a base LLM used, not the authors' own implementation code for their proposed method. There is no explicit statement about releasing their source code, nor is a link provided to a repository containing their code.
Open Datasets	Yes	The labeled dataset used in our experiments includes reviews from four different domains: Restaurant (R), Laptop (L), Device (D), and Service (S). Restaurant (R) is a combination of the restaurant reviews from Sem Eval 2014/2015/2016 (Pontiki et al. 2014, 2015, 2016). Laptop (L) is sourced from Sem Eval 2014 (Pontiki et al. 2014). Device (D) consists of all the digital device reviews collected by Toprak, Jakob, and Gurevych (2010). Service (S) contains reviews from web services introduced by Hu and Liu (2004).
Dataset Splits	Yes	Table 1: Distribution of reviews across different domains. Device Train 1,394 Test 691; Laptop Train 2,297 Test 631; Restaurant Train 4,284 Test 2,252; Service Train 1,840 Test 886
Hardware Specification	Yes	All comparison experiments are conducted on a single NVIDIA 3090 GPU and we take accuracy as the evaluation metric.
Software Dependencies	No	The paper mentions "Adam W (Loshchilov and Hutter 2018) is used as the optimizer" and "Llama-2-7b (Touvron et al. 2023) as our primary base large language model," but it does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Adam W (Loshchilov and Hutter 2018) is used as the optimizer, with a learning rate of 3 10 4 for the low-rank weight projection part and 1 10 5 for the representation editing part. For the comparison methods, we adopt standard experimental settings and commonly used parameters. Specifically, Lo RA and Dora utilize a learning rate of 1 10 4 with rank of 32. Additionally, we include Lo Reft with a learning rate of 2 10 5 with rank of 8. All comparison experiments are conducted on a single NVIDIA 3090 GPU and we take accuracy as the evaluation metric. The experimental results are obtained by averaging three runs with random initialization. The PEFT methods are trained for one epoch, while the full parameter methods are trained for three epochs.