Semantic similarity prediction is better than other semantic similarity measures
Authors: Steffen Herbold
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Within this paper, we present the results of a confirmatory study that demonstrates that downloading a finetuned Ro BERTa model (Liu et al., 2019) for the STS-B task (Cer et al., 2017) from the GLUE benchmark from Huggingface and using this model to predict the similarity of sentences fulfills our expectations on a robust similarity measure better than the other models we consider. We refer to this approach as STSScorer. To demonstrate this empirically, we compute the similarity score for similarity related GLUE tasks and show that while the predictions with the STSScorer are not perfect, the distribution of the predicted scores is closer to what we would expect given the task description, than for the other measures. |
| Researcher Affiliation | Academia | Steffen Herbold EMAIL Faculty of Computer Science and Mathematics University of Passau Passau, Germany |
| Pseudocode | Yes | Listing 1: A simple class to define a fully functional semantic similarity scorer based on a pre-trained model for the STS-B tasks. import transformers class STSScorer : def __init__( s e l f ) : model_name = Will Held / roberta base stsb s e l f . _sts_tokenizer = transformers . Auto Tokenizer . from_pretrained ( , model_name) s e l f . _sts_model = transformers . Auto Model For Sequence Classification . , from_pretrained (model_name) s e l f . _sts_model . eval () def score ( s e l f , sentence1 , sentence2 ) : sts_tokenizer_output = s e l f . _sts_tokenizer ( sentence1 , sentence2 , padding= , True , truncation=True , return_tensors=" pt " ) sts_model_output = s e l f . _sts_model ( sts_tokenizer_output ) # l o g i t s contain regression values # need to divide by f i v e due to scoring approach of STS B between 0 and 5 return sts_model_output [ l o g i t s ] . item () /5 |
| Open Source Code | Yes | All implementations we created for this work are publicly available online: https://github.com/aieng-lab/stsscore |
| Open Datasets | Yes | There are four such data sets within the GLUE benchmark: the Semantic Textual Similarity Benchmark (STS-B) data we already discussed above; the Microsoft Research Paraphrase Corpus (MRPC, Dolan & Brockett (2005)) data, where the task is to determine if two sentences are paraphrases; the Quora Question Pairs (QQP, Iyer et al. (2017)) data, where the task is to determine if two questions are duplicates; and the Chinese to English translations from the WMT22 metrics challenge (WMT22-ZH-EN, Freitag et al. (2022)), where the translation quality is labeled using the MQM schema Burchardt (2013). |
| Dataset Splits | Yes | For STS-B, MRPC, and WMT22, we use the test data. Since the labels for QQP s test data are not shared, we use the training data instead. |
| Hardware Specification | No | The paper does not provide specific hardware details such as CPU, GPU, or TPU models used for running the experiments. It only mentions the software libraries and models utilized. |
| Software Dependencies | No | The paper mentions several software libraries and packages such as 'Huggingface transformer library', 'Huggingface evaluation library', 'python package provided by Reimers & Gurevych (2019)' for S-BERT, 'python package provided by Zhang et al. (2020)' for BERTScore, 'Seaborn', and 'Pandas'. However, specific version numbers for these components are not provided. |
| Experiment Setup | Yes | The Ro BERTa model we used for STSScore (Held, 2022) was trained with a learning rate of 2 10 5, a linear scheduler with a warmup ratio of 0.06 for 10 epochs using an Adam optimizer with β1 = 0.9, β2 = 0.999, and ϵ = 10 8 using MSE loss. |