reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Comprehensive Algorithm Portfolio Evaluation using Item Response Theory

Authors: Sevvandi Kandanaarachchi, Kate Smith-Miles

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test this framework on algorithm portfolios for a wide range of applications, demonstrating the broad applicability of this method as an insightful algorithm evaluation tool. Furthermore, the explainable nature of IRT parameters yield an increased understanding of algorithm portfolios.
Researcher Affiliation	Academia	Sevvandi Kandanaarachchi EMAIL CSIRO s Data61 Research Way, Clayton VIC 3168, Australia Kate Smith-Miles EMAIL School of Mathematics and Statistics University of Melbourne Parkville, VIC 3010, Australia
Pseudocode	Yes	Algorithm 1: AIRT framework. input : The matrix YN n, containing accuracy measures of n algorithms for N datasets/problem instances. output : 1. AIRT indicators of algorithms and dataset/problem diﬃculty 2. The strengths and weaknesses of algorithms 3. airt algorithm portfolio 4. Model goodness measures
Open Source Code	Yes	As a further contribution, we make this work available in the R package airt (Kandanaarachchi, 2020).
Open Datasets	Yes	In Section 5 we illustrate the complete functionality of AIRT including the algorithm metrics, problem space analysis, strengths and weaknesses of algorithms, algorithm portfolio evaluation and model goodness results using the detailed case study of Open ML-Weka classiﬁcation algorithms and test instances available at ASlib repository (Bischl et al., 2016). We refer the reader to Appendix A where further results are summarized on nine more case studies using a variety of ASlib scenarios
Dataset Splits	Yes	For each algorithm scenario we use 10-fold cross validation and report the average cross validated performance gap for Shapley, topset and airt portfolios.
Hardware Specification	No	The paper does not provide specific hardware details used for running its experiments. It focuses on the methodology and datasets rather than computational environment specifications.
Software Dependencies	Yes	As a further contribution, we make this work available in the R package airt (Kandanaarachchi, 2020). The R package airt ﬁts the continuous IRT models described in Section 2.2 using the updated log-likelihood function and assumption. To ﬁt polytomous models airt uses the functionality of the existing R package mirt (Chalmers, 2012).
Experiment Setup	Yes	For all algorithms in the ASlib repository certain hyperparameters and parameters were used which we do not vary. Any conclusions we draw about algorithm performance are therefore dependent on the actual algorithm implementation they use.