Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Debiasing Evaluations That Are Biased by Evaluations

Authors: Jingyan Wang, Ivan Stelmakh, Yuting Wei, Nihar Shah

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical guarantees on the performance of our algorithm, as well as experimental evaluations.
Researcher Affiliation Academia Jingyan Wang EMAIL Toyota Technological Institute at Chicago Chicago, IL 60637, USA; Ivan Stelmakh EMAIL New Economic School Moscow, Russia; Yuting Wei EMAIL Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA 19104, USA; Nihar Shah EMAIL School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA
Pseudocode Yes Algorithm 1: Cross-validation. Inputs: observations Y , partial ordering O, and set Λ.
Open Source Code Yes The code to reproduce our results is available at https://github.com/jingyanw/outcome-induced-debiasing.
Open Datasets Yes We use the grading data from Indiana University Bloomington Indiana University Bloomington (2020), where the possible grades that students receive are A+ through D-, and F. ... https://gradedistribution.registrar.indiana.edu/index.php [Online; accessed 30-Sep2020]. We now move to a real-world data (Kerzendorf et al., 2020) collected for proposal peer review at the European Southern Observatory (ESO).
Dataset Splits Yes In the data-splitting step, our algorithm splits the observations {yij}i [d],j [n] into a training set Ωt [d] [n] and a validation set Ωv [d] [n]. ... For each consecutive pair of elements in this sub-ordering, we assign one element in this pair to the training set and the other element to the validation set uniformly at random (Lines 5-7).
Hardware Specification No The paper does not explicitly describe the hardware used for its experiments. It discusses experimental evaluations but lacks details on specific GPU/CPU models or computing resources.
Software Dependencies No The paper mentions using 'CVXPY package' but does not specify a version number. No other software dependencies are listed with version numbers.
Experiment Setup Yes Throughout the experiments, we use Λ = {2i : 9 i 5, i Z} {0, }. We also plot the error incurred by the best fixed choice of λ Λ, where for each point in the plots, we pick the value of λ Λ which minimizes the empirical ℓ2 error over all fixed choices in Λ. ... Throughout the experiments we set x = 0 without loss of generality, because, as explained in Proposition 18 in Appendix C.2.1, the results remain the same for any value of x . ... We set η = 1 σ, and consider different choices of σ.