Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Debiasing Evaluations That Are Biased by Evaluations
Authors: Jingyan Wang, Ivan Stelmakh, Yuting Wei, Nihar Shah
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical guarantees on the performance of our algorithm, as well as experimental evaluations. |
| Researcher Affiliation | Academia | Jingyan Wang EMAIL Toyota Technological Institute at Chicago Chicago, IL 60637, USA; Ivan Stelmakh EMAIL New Economic School Moscow, Russia; Yuting Wei EMAIL Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA 19104, USA; Nihar Shah EMAIL School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA |
| Pseudocode | Yes | Algorithm 1: Cross-validation. Inputs: observations Y , partial ordering O, and set Λ. |
| Open Source Code | Yes | The code to reproduce our results is available at https://github.com/jingyanw/outcome-induced-debiasing. |
| Open Datasets | Yes | We use the grading data from Indiana University Bloomington Indiana University Bloomington (2020), where the possible grades that students receive are A+ through D-, and F. ... https://gradedistribution.registrar.indiana.edu/index.php [Online; accessed 30-Sep2020]. We now move to a real-world data (Kerzendorf et al., 2020) collected for proposal peer review at the European Southern Observatory (ESO). |
| Dataset Splits | Yes | In the data-splitting step, our algorithm splits the observations {yij}i [d],j [n] into a training set Ωt [d] [n] and a validation set Ωv [d] [n]. ... For each consecutive pair of elements in this sub-ordering, we assign one element in this pair to the training set and the other element to the validation set uniformly at random (Lines 5-7). |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for its experiments. It discusses experimental evaluations but lacks details on specific GPU/CPU models or computing resources. |
| Software Dependencies | No | The paper mentions using 'CVXPY package' but does not specify a version number. No other software dependencies are listed with version numbers. |
| Experiment Setup | Yes | Throughout the experiments, we use Λ = {2i : 9 i 5, i Z} {0, }. We also plot the error incurred by the best fixed choice of λ Λ, where for each point in the plots, we pick the value of λ Λ which minimizes the empirical ℓ2 error over all fixed choices in Λ. ... Throughout the experiments we set x = 0 without loss of generality, because, as explained in Proposition 18 in Appendix C.2.1, the results remain the same for any value of x . ... We set η = 1 σ, and consider different choices of σ. |