A Survey on Lexical Simplification
Authors: Gustavo H. Paetzold, Lucia Specia
JAIR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this survey we review the literature for each step in this typical Lexical Simplification pipeline and provide a benchmarking of existing approaches for these steps on publicly available datasets. We also provide pointers for datasets and resources available for the task. For each of these sections, we provide a benchmark of existing approaches using publicly available datasets and standard metrics, as well as a critical analysis of the findings. For an overview on the performance of a complete LS pipeline, in section 7 we report a full pipeline evaluation that compares various simplifiers built from combining the approaches described in sections 3 through 6. |
| Researcher Affiliation | Academia | Gustavo H. Paetzold EMAIL Lucia Specia EMAIL The University of Sheffield Western Bank Sheffield United Kingdom |
| Pseudocode | No | The paper describes various algorithms and methods in detail but does not present any of them in a structured pseudocode or algorithm block format. The procedures are explained in paragraph text. |
| Open Source Code | Yes | For those interested in using approaches described in this survey, all of the implementations devised for our benchmarkings can be found in the LEXenstein framework32. 32. http://ghpaetzold.github.io/LEXenstein |
| Open Datasets | Yes | We present a more detailed and up to date survey on the many strategies used to address each step of the LS pipeline. First, in section 2 we introduce datasets and resources that have been used in the creation and evaluation of many of the lexical simplifiers featured in this survey. We hope that this section will shed light on the design decisions made by research in previous work, as well as help foster future work on LS. Datasets of manually annotated LS cases are very useful since they can be used for both training and evaluation. These datasets contain instances composed of a sentence, a target complex word, and a set of suitable substitutions provided and ranked by humans with respect to their simplicity. There are currently seven datasets of this kind: Sem Eval 20121 (Specia, Jauhar, & Mihalcea, 2012): 2,010 instances for English. Contains simplicity rankings produced by non-native English speakers for the datasets of the Lexical Substitution Task of Sem Eval 2007 (Mc Carthy & Navigli, 2007). 1. https://www.cs.york.ac.uk/semeval-2012/task1 |
| Dataset Splits | Yes | The training and test sets used are composed of 2, 237 and 88, 221 instances, respectively, where each instance contains a target word in a sentence. The rankers are evaluated over the datasets from the English Lexical Simplification task of Sem Eval 2012 (Specia et al., 2012). The training set is composed of 300 instances, and the test set, 1, 710 instances. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments or benchmarks, such as GPU or CPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions several software components, tools, and frameworks like "LEXenstein", "Stanford Tagger", "GloVe", "word2vec", "NLTK's Porter stemmer", and "SVM rank". However, it does not provide specific version numbers for these components, which are necessary for full reproducibility. |
| Experiment Setup | Yes | To find t, an exhaustive search was performed on the training set over 10,000 equally distant values in the interval between the minimum and maximum value of each metric. We train word embeddings with 1, 300-dimension vectors with the bag-of-words (CBOW) method from word2vec. We select 10 candidates for each complex word in the dataset. The model used is the exact same linear model described by Paetzold and Specia (2016d). The weights are estimated through 5-fold cross-validation over the set of values {-2, -1, 0, 1, 2}. training the model with SVM rank and 10-fold cross-validation. with three hidden layers with eight nodes each and a model trained for 500 epochs. |