reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RAZOR: Sharpening Knowledge by Cutting Bias with Unsupervised Text Rewriting

Authors: Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Compared with unsupervised So TA models, RAZOR improves by 3.5% on the FEVER and 6.5% on MNLI and SNLI datasets according to the F1 score. Additionally, RAZOR effectively mitigates specific known biases, reducing bias-related terms by 2 without requiring prior bias information, a result that is on par with So TA models that leverage prior information. ... We train the classifiers using the LoRa algorithm (Hu et al. 2022) and the AdamW optimizer with a batch size of 16 and a learning rate of 3e-5. ... Table 1 shows this effect on the two classifiers mentioned above for RAZOR and So TA methods. ... 4.2 Ablation Studies
Researcher Affiliation	Academia	Shuo Yang, Bardh Prenkaj, Gjergji Kasneci Technical University of Munich EMAIL
Pseudocode	No	The paper does not contain a specific section labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format. It includes mathematical definitions and equations, but not pseudocode.
Open Source Code	Yes	Code https://github.com/ShuoYangtum/RAZOR
Open Datasets	Yes	FEVER dataset (Thorne et al. 2018)... Multi-Genre Natural Language Inference (MNLI) (Williams, Nangia, and Bowman 2018) corpus and the Stanford Natural Language Inference (SNLI) (Bowman et al. 2015) corpus
Dataset Splits	No	Our preprocessing steps involve truncating sequences to 512 tokens. We train BERT, RoBERTa, and DistilBERT classifiers on the original FEVER dataset to represent our baseline (hereafter Original). Then, for each So TA method, we train the two classifiers on the modified version of FEVER according to the proposed algorithms. Notice that RAZOR and other So TA methods only rewrite the training set of the dataset. The test set remains unchanged. ... we exchanged MNLI and SNLI test sets to simulate a real-world data distribution. The paper describes which datasets are used for training and testing across experiments (e.g., training on MNLI and testing on SNLI), and mentions that for FEVER, the training set is modified while the test set is unchanged. However, it does not specify explicit percentages or sample counts for the splits (e.g., 80/10/10) within these datasets, nor does it refer to a specific predefined split with a citation that would contain such details.
Hardware Specification	No	The paper mentions using LLMs like GPT-3.5-Turbo and LLaMA-3.1-8B-Instruct, and training classifiers (BERT, RoBERTa, DistilBERT) with specific algorithms and optimizers. However, it does not specify any hardware details such as GPU models, CPU models, or memory configurations used for running the experiments.
Software Dependencies	No	We rely on three classification models to measure RAZOR s effectiveness, namely BERT, Ro BERTa, and Distil BERT taken from Hugging Face followed by a linear classification layer. We train the classifiers using the Lo Ra algorithm (Hu et al. 2022) and the Adam W optimizer with a batch size of 16 and a learning rate of 3 10 5. In the sentence rewriting, we set Gα = GPT-3.5-Turbo and Gβ = GPT-3.5-Turbo. ... To provide details on the impact of the selected LLM to rewrite the sentences and verify their labels, we change both Gα and Gβ to LLa MA-3.1-8B-Instruct (see Tables 1 and 2). The paper lists several software components and models (BERT, RoBERTa, DistilBERT, LoRa algorithm, AdamW, GPT-3.5-Turbo, LLaMA-3.1-8B-Instruct). While GPT-3.5-Turbo and LLaMA-3.1-8B-Instruct are specific model versions, the general software dependencies (like Python, PyTorch/TensorFlow, Hugging Face libraries) are not listed with their specific version numbers.
Experiment Setup	Yes	Our preprocessing steps involve truncating sequences to 512 tokens. We rely on three classification models to measure RAZOR s effectiveness, namely BERT, Ro BERTa, and Distil BERT taken from Hugging Face followed by a linear classification layer. We train the classifiers using the Lo Ra algorithm (Hu et al. 2022) and the Adam W optimizer with a batch size of 16 and a learning rate of 3 10 5. In the sentence rewriting, we set Gα = GPT-3.5-Turbo and Gβ = GPT-3.5-Turbo. Here, we control the diversity of generated sequences by setting the top-p value to 0.9 and the temperature to 0.7.