reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Semantic similarity prediction is better than other semantic similarity measures

Authors: Steffen Herbold

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Within this paper, we present the results of a confirmatory study that demonstrates that downloading a finetuned Ro BERTa model (Liu et al., 2019) for the STS-B task (Cer et al., 2017) from the GLUE benchmark from Huggingface and using this model to predict the similarity of sentences fulfills our expectations on a robust similarity measure better than the other models we consider. We refer to this approach as STSScorer. To demonstrate this empirically, we compute the similarity score for similarity related GLUE tasks and show that while the predictions with the STSScorer are not perfect, the distribution of the predicted scores is closer to what we would expect given the task description, than for the other measures.
Researcher Affiliation	Academia	Steffen Herbold EMAIL Faculty of Computer Science and Mathematics University of Passau Passau, Germany
Pseudocode	Yes	Listing 1: A simple class to define a fully functional semantic similarity scorer based on a pre-trained model for the STS-B tasks. import transformers class STSScorer : def __init__( s e l f ) : model_name = Will Held / roberta base stsb s e l f . _sts_tokenizer = transformers . Auto Tokenizer . from_pretrained ( , model_name) s e l f . _sts_model = transformers . Auto Model For Sequence Classification . , from_pretrained (model_name) s e l f . _sts_model . eval () def score ( s e l f , sentence1 , sentence2 ) : sts_tokenizer_output = s e l f . _sts_tokenizer ( sentence1 , sentence2 , padding= , True , truncation=True , return_tensors=" pt " ) sts_model_output = s e l f . _sts_model ( sts_tokenizer_output ) # l o g i t s contain regression values # need to divide by f i v e due to scoring approach of STS B between 0 and 5 return sts_model_output [ l o g i t s ] . item () /5
Open Source Code	Yes	All implementations we created for this work are publicly available online: https://github.com/aieng-lab/stsscore
Open Datasets	Yes	There are four such data sets within the GLUE benchmark: the Semantic Textual Similarity Benchmark (STS-B) data we already discussed above; the Microsoft Research Paraphrase Corpus (MRPC, Dolan & Brockett (2005)) data, where the task is to determine if two sentences are paraphrases; the Quora Question Pairs (QQP, Iyer et al. (2017)) data, where the task is to determine if two questions are duplicates; and the Chinese to English translations from the WMT22 metrics challenge (WMT22-ZH-EN, Freitag et al. (2022)), where the translation quality is labeled using the MQM schema Burchardt (2013).
Dataset Splits	Yes	For STS-B, MRPC, and WMT22, we use the test data. Since the labels for QQP s test data are not shared, we use the training data instead.
Hardware Specification	No	The paper does not provide specific hardware details such as CPU, GPU, or TPU models used for running the experiments. It only mentions the software libraries and models utilized.
Software Dependencies	No	The paper mentions several software libraries and packages such as 'Huggingface transformer library', 'Huggingface evaluation library', 'python package provided by Reimers & Gurevych (2019)' for S-BERT, 'python package provided by Zhang et al. (2020)' for BERTScore, 'Seaborn', and 'Pandas'. However, specific version numbers for these components are not provided.
Experiment Setup	Yes	The Ro BERTa model we used for STSScore (Held, 2022) was trained with a learning rate of 2 10 5, a linear scheduler with a warmup ratio of 0.06 for 10 epochs using an Adam optimizer with β1 = 0.9, β2 = 0.999, and ϵ = 10 8 using MSE loss.