reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Time-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems

Authors: Matteo Venanzi, John Guiver, Pushmeet Kohli, Nicholas R. Jennings

JAIR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using two realworld public datasets for entity linking tasks, we show that BCCTime produces up to 11% more accurate classiﬁcations and up to 100% more informative estimates of a task s duration compared to state of the art methods.
Researcher Affiliation	Collaboration	Matteo Venanzi EMAIL Microsoft, 2 Waterhouse Square, London EC1N 2ST UK John Guiver EMAIL Microsoft Research, 21 Station Road, Cambridge CB1 2FB UK Pushmeet Kohli EMAIL Microsoft Research, One Microsoft Way, Redmond WA 98052-6399 US Nicholas R. Jennings EMAIL Imperial College, South Kensington, London SW7 2AZ UK
Pseudocode	No	The paper includes a factor graph (Figure 5) and describes the probabilistic inference process mathematically, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using Infer.NET for its implementation: "Using Infer.NET, we are able to train BCCTime on our largest dataset of 12,190 judgments within seconds using approximately 80MB of RAM on a standard laptop." However, it does not provide any statement about releasing their own source code for BCCTime or a link to a repository.
Open Datasets	Yes	Using two realworld public datasets for entity linking tasks, we show that BCCTime produces up to 11% more accurate classiﬁcations and up to 100% more informative estimates of a task s duration compared to state of the art methods. Zen Crowd India (ZC-IN): contains a set of links between the names of entities extracted from news articles and uniform resource identiﬁers (URIs) describing the entity in Freebase7 and DBpedia8 (Demartini et al., 2012). Zen Crowd USA (ZC-US): This dataset was also provided by Demartini et al. (2012) and contains judgements for the same set of tasks as ZC-IN... Weather Sentiment AMT (WS-AMT): The Weather Sentiment dataset was provided by Crowd Flower for the 2013 Crowdsourcing at Scale shared task challenge.9 It includes 300 tweets with 1,720 judgements from 461 workers and has been used in several experimental evaluations of crowdsourcing models (Simpson et al., 2015; Venanzi et al., 2014; Venanzi, Teacy, Rogers, & Jennings, 2015b). In detail, the workers were asked to classify the sentiment of tweets with respect to the weather into the following categories: negative (0), neutral (1), positive (2), tweet not related to weather (3) and can t tell (4). As a result, this dataset pertains to a multi-class classiﬁcation problem. However, the original dataset used in the Share task challenge did not contain any time information about the collected judgments. Therefore, a new dataset (WS-AMT), was recollected for the same tasks as in the Crowd Flower shared task dataset using the AMT platform, acquiring exactly 20 judgements and recording the elapsed time for each judgment (Venanzi, Rogers, & Jennings, 2015a).
Dataset Splits	No	The paper discusses evaluating performance over sub-samples of judgments (Figure 8) and mentions accuracy metrics like AUC and average recall, which imply evaluation on data. However, it does not explicitly provide details about how the datasets were split into training, validation, or test sets for reproducibility (e.g., percentages, sample counts, or predefined splits).
Hardware Specification	No	Using Infer.NET, we are able to train BCCTime on our largest dataset of 12,190 judgments within seconds using approximately 80MB of RAM on a standard laptop. This mentions a "standard laptop" and 80MB of RAM, but lacks specific details such as CPU or GPU models, or exact processor types to be considered a specific hardware specification.
Software Dependencies	Yes	In particular, we use the well-known EP algorithm (Minka, 2001) that has been shown to provide good quality approximations for BCC models (Venanzi et al., 2014)10. This method leverages a factorised distribution of the joint probability to approximate the marginal posterior distributions through an iterative message passing scheme implemented on the factor graph. Speciﬁcally, we use the EP implementation provided by Infer.NET (Minka, Winn, Guiver, & Knowles, 2014), which is a standard framework for running Bayesian inference in probabilistic models. Bibliography: Minka, T., Winn, J., Guiver, J., & Knowles, D. (2014). Infer.NET 2.6. Microsoft Research Cambridge.
Experiment Setup	Yes	Therefore, the workers confusion matrices are initialised with a slightly higher value on the diagonal (0.6) and lower values on the rest of the matrix. Then, the Dirichlet priors for p and s are set uninformatively with uniform counts12. The priors of the confusion matrices were initialised with a higher diagonal value (0.7) meaning that a priori the workers are assumed to be better than random. The Gaussian priors for the tasks time durations are set with means σ0 = 10 and λ0 = 50 and precisions γ0 = δ0 = 10 1, meaning that a priori each entity linking task is expected to be completed within 10 and 50 seconds. Furthermore, we initialise the Beta prior of ψk as a function of the number of tasks with α0 = 0.7N and β0 = 0.3N to represent the fact that a priori the worker is considered as a reliable if she makes valid labelling attempts for 70% of the tasks. Importantly, given the shape distribution of the worker s time completion data observed in the datasets (see Figure 2), we apply a logarithmic transformation to τ (k) i in order to obtain a more uniform distribution of workers completion time in the training data. Finally, the priors of all the benchmarks were set equivalently to BCCTime.