reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explicit Document Modeling through Weighted Multiple-Instance Learning

Authors: Nikolaos Pappas, Andrei Popescu-Belis

JAIR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our model achieves state-of-the-art performance on multi-aspect sentiment analysis, improving over several baselines. Moreover, the predicted saliency weights are close to human estimates obtained by crowdsourcing, and increase the performance of lexical and topical features for review segmentation and summarization. ... Figure 3 displays the performance of the proposed model for aspect rating prediction. ... Table 2 displays the mean squared error (MSE) on a test set with 1,000 reviews from Beer Advocate, using 260k reviews for training. ... Table 3 shows the performance of the models with various MIR assumptions for aspect rating prediction (columns 1 to 5). ... Figure 5: Accuracy on review segmentation (top) and on summarization (bottom) of the CRF models with BOW+MIR features, compared to several baselines.
Researcher Affiliation	Academia	Nikolaos Pappas EMAIL Andrei Popescu-Belis EMAIL Idiap Research Institute, Rue Marconi 19 CH-1920 Martigny, Switzerland
Pseudocode	Yes	Algorithm 1: SGDWeights: jointly learning the parameters of the objective in Eq. 6.
Open Source Code	Yes	Our code is available at https://github.com/idiap/wmil-sgd.
Open Datasets	Yes	We use eight public datasets (Table 1). ... The Beer Advocate, Ratebeer (ES), Rate Beer (FR), Audiobooks and Toys & Games datasets include aspect ratings assigned by the authors of the reviews, with 3 to 5 aspect dimensions. ... on the TED talks that we gathered and released earlier (Pappas & Popescu Belis, 2013),2 we aim to predict the 12-dimensional talk-level emotion ratings assigned by viewers through voting... 2. Available at https://www.idiap.ch/dataset/ted/. ... we designed a new dataset called HATDOC (Pappas & Popescu-Belis, 2016).4 ... 4. We make this dataset available at https://www.idiap.ch/paper/hatdoc/.
Dataset Splits	Yes	We use the same protocol as Mc Auley et al., i.e. a uniform split of the data into 50% for training and 50% for testing. ... All the models are optimized (when applicable) on a development set, i.e. a 25% subset of the training data... We experiment with 5-fold c.-v. on equal-size samples of 1,200 instances per dataset. ... For segmentation and summarization, we report the average scores of each method over ﬁve runs (Section 9.2). We compare our model with the methods used by Mc Auley et al. (2012). As Lei et al. (2016) used a modiﬁed version of Mc Auley s segmentation task to evaluate their word-based selection method, this is not directly comparable with Mc Auley s or our method. ... we evaluate them in Section 9.3 over ﬁve random splits, 80% for training and 20% for testing.
Hardware Specification	No	No specific hardware details such as GPU/CPU models, processor types, or memory amounts are mentioned. The paper only discusses computational complexity and potential for parallelization without specifying the actual hardware used for experiments.
Software Dependencies	No	For the regression models and evaluation, we use the scikit-learn library (Pedregosa et al., 2012). ... we computed sentence features based on 300-dimensional word embeddings trained on Wikipedia with word2vec (Mikolov et al., 2013).
Experiment Setup	Yes	The hyper-parameters to optimize for the various MIR assumptions are the regularization terms λ2 and λ1 of their regression model f. ... The hyper-parameters to optimize for APWeights are the three regularization terms ϵ1, ϵ2, ϵ3 of the ℓ2-norm for the f1, f2 and f3 regression models. ... for the Clustering MIR assumption (Wagstaﬀet al., 2008), we use the f2 regression model, which relies on ϵ2 and the number of clusters k, optimized over {5, ..., 50} with step 5, for its clustering algorithm, which is here the k-Means one. All the regularization terms are optimized over the same range of possible values, noted a 10b with a {1, . . . , 9} and b { 4, . . . , +4}, hence 81 values per term. The hyper-parameters for SGDWeights are the same ones as for APWeights, plus the learning rate or step size ϵ, the minibatch size m, and the gradient step strategy (learning rate decay, ADAGRAD, or ADAM). ... minibatch size (set to 50 here) ... Based on tests over a development subset, our model is trained with SGDWeights and ADAGRAD (see Section 4.2 above), with a step size of 0.001.