reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Estimating Agent Skill in Continuous Action Domains

Authors: Christopher Archibald, Delma Nieves-Rivera

JAIR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results in several domains evaluate the estimation accuracy of the estimators, especially focusing on how robust they are as agents and their decision-making methods are varied. These results demonstrate that reasoning about both types of skill together significantly improves the robustness and accuracy of execution skill estimation. A case study is presented using the proposed methods to estimate the skill of Major League Baseball pitchers, demonstrating how these methods can be applied to real-world data sources.
Researcher Affiliation	Academia	Christopher Archibald EMAIL Brigham Young University, Provo, UT 84604 USA; Delma Nieves-Rivera EMAIL Mississippi State University, Starkville, MS 93444 USA. Both authors are affiliated with universities with .edu email domains.
Pseudocode	No	The paper provides derivations and mathematical equations for the methods (e.g., Equations 1, 2, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) and describes the steps involved in the algorithms (OR, AXE, JEEDS, MEEDS) in paragraph form. However, there are no explicitly labeled pseudocode blocks or algorithm figures with numbered steps.
Open Source Code	No	The paper mentions 'We accessed this data using pybaseball (Le Doux, 2017), a Python package that provides access to baseball data collected from different public sources.' This refers to a third-party tool used, but there is no explicit statement or link indicating that the authors' own code for the methodologies described in the paper is open-source or available.
Open Datasets	Yes	A case study is presented using the proposed methods to estimate the skill of Major League Baseball pitchers, demonstrating how these methods can be applied to real-world data sources. We accessed this data using pybaseball (Le Doux, 2017), a Python package that provides access to baseball data collected from different public sources.
Dataset Splits	No	The paper describes experimental procedures involving processing a sequence of observations (e.g., 'Each experiment consisted of 1,000 observations.', 'the most recent 1000 pitches from the 2021 season were obtained from the data.'). The methods utilize an online approach, producing estimates after each new observation, rather than traditional machine learning training/test/validation splits.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It describes computational domains but lacks explicit hardware information.
Software Dependencies	No	The paper mentions 'pybaseball (Le Doux, 2017), a Python package' for accessing baseball data. While 'Python' is a language, and 'pybaseball' is a package, no specific version numbers for Python, pybaseball, or any other software libraries, frameworks, or operating systems are provided, which is necessary for a reproducible description of software dependencies.
Experiment Setup	Yes	The action space for all agents was discretized with a resolution of 0.01 (1D-Darts) and 5.0 mm (2D-Darts, 201-Darts). For 201-Darts, the Qσ function for each hypothesis execution skill level was precomputed using value iteration... The range of execution skill noise levels was [0.5, 4.5] (1D-Darts) and [3.0, 150.5] mm (2D-Darts, 201-Darts). The agent rationality parameter ranges were: λf, λd [0.0, 1.0] (for all domains), λs [0.001, 100.0] (1D-Darts) and λs [0.001, 32.0] (2D-Darts, 201-Darts). All estimation methods used 17 (1D-Darts) and 33 (2D-Darts, 201-Darts) hypothesis skill levels for execution skill, whereas 33 were used for decision-making skill on all domains. All beliefs were initialized to be uniform over the space of possible skill parameters. For the AXE method, the set of focal actions in a state consisted of the optimal actions for the set of execution skill hypotheses. Experiments were conducted with each β {0.50, 0.75, 0.85, 0.90, 0.95, 0.99}.