spark-crowd: A Spark Package for Learning from Crowdsourced Big Data

Authors: Enrique G. Rodrigo, Juan A. Aledo, José A. Gámez

JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, however, we include a comparison for those methods considered in all three packages, i.e. Majority Voting and Dawid Skene3. We used the same environment for t1 (execution using only one core). For tc we used an Apache Spark cluster with 3 executor nodes of 10 cores and 30Gb of memory each. The same code was executed on both platforms. This is, actually, one of the advantages of the library presented here. In Table 2, we show the accuracy and execution time (in seconds) obtained for four data sets of increasing size by the three libraries (the documentation of the package contains the details for these data sets).
Researcher Affiliation Academia Enrique G. Rodrigo EMAIL Juan A. Aledo EMAIL Jos e A. G amez EMAIL University of Castilla-La Mancha Escuela Superior de Ingenier ıa Inform atica de Albacete Avenida de Espa na s/n Albacete (02071), Spain
Pseudocode Yes An example of the library usage can be found in Listing 1. Although this code seems simple, it can be executed both locally or in a distributed environment (using the Apache Spark platform) without any modifications. The user may refer to the documentation in order to find more involved examples. 1 import com . enriquegrodrigo . spark . crowd . methods . Dawid Skene import com . enriquegrodrigo . spark . crowd . types . 3 // Loading f i l e ( any spark compatible format ) val data = spark . read . parquet ( d a t a f i l e . parquet ) . as [ Multiclass Annotation ] 5 // Applying algorithm ( data with columns [ example , annotator , value ] ) val mode = Dawid Skene ( data . as [ Multiclass Annotation ] ) 7 //Get Multiclass Label with the c l a s s p r e d i c t i o n s val pred = mode . get Mu () . as [ Multiclass Label ] 9 // Annotator p r e c i s i o n matrices val annprec = mode . get Annotator Precision () Listing 1: Example of spark-crowd usage
Open Source Code Yes 1. The package is available in https://github.com/enriquegrodrigo/spark-crowd.
Open Datasets No The paper refers to external documentation for dataset details ("the documentation of the package contains the details for these data sets") rather than providing concrete access information (link, DOI, citation with authors/year) within the paper itself. The names "binary1", "binary2", etc., are not well-known public datasets without further context.
Dataset Splits No No specific information about training/test/validation splits (percentages, counts, or explicit split methodologies) is provided in the paper. It refers to 'test data sets' but without specifying how the data was partitioned.
Hardware Specification Yes For tc we used an Apache Spark cluster with 3 executor nodes of 10 cores and 30Gb of memory each.
Software Dependencies No The paper mentions Apache Spark and its own package 'spark-crowd' (version 0.2.1) but does not provide specific version numbers for other ancillary software dependencies, libraries, or solvers used in the experiments.
Experiment Setup No The paper describes the general execution environment for the comparison (e.g., using one core or an Apache Spark cluster with specific node configurations), but it does not provide specific hyperparameters, optimizer settings, or other detailed training configurations for the learning algorithms (Majority Voting, Dawid Skene, etc.) beyond their names.