reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Predictive Inference with Weak Supervision

Authors: Maxime Cauchois, Suyash Gupta, Alnur Ali, John C. Duchi

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We corroborate the hypothesis that the new coverage definition allows for tighter and more informative (but valid) conﬁdence sets through several experiments. Keywords: Conformal inference, Conﬁdence sets, Coverage validity, Weak supervision, Partial labels 1 Introduction ... To provide some initial insights into the methods and potential applications, we provide experiments on several real-world domains; in the main body (Section 5) we investigate ranking, while the appendices (see Appendix C) provide additional examples with structured prediction, matching for pedestrian tracking in videos, and prediction intervals for county-level voting in the United States.
Researcher Affiliation	Academia	Maxime Cauchois EMAIL Department of Statistics Stanford University Stanford, CA 94305-4020, USA Suyash Gupta EMAIL Department of Statistics Stanford University Stanford, CA 94305-4020, USA Alnur Ali EMAIL Department of Statistics Stanford University Stanford, CA 94305-4020, USA John Duchi EMAIL Department of Statistics and electrical Engineering Stanford University Stanford, CA 94305-4020, USA
Pseudocode	Yes	Algorithm 1 Partially supervised conformalization Algorithm 2 Greedy weakly supervised scoring mechanism Algorithm 3 Sequential partitioning
Open Source Code	No	The paper does not contain an explicit statement about open-sourcing the code for the described methodology, nor does it provide a direct link to a code repository. It mentions the license for the paper itself, but not for software implementation.
Open Datasets	Yes	5.2.2 Ranking experiment with Microsoft LETOR dataset ...Learning to rank with Microsoft LETOR dataset (Qin and Liu, 2013)... C.2 Pedestrian tracking with partial matching information ...Predicting trajectories in the MOT2D15 data set (Leal-Taix e et al., 2015)... C.3 Prediction intervals for weakly supervised regression ...Our data comes from the 2013 2017 American Community Survey 5-Year Estimates...
Dataset Splits	Yes	5.1 A toy classiﬁcation example ...we simulate n = 104 data points, splitting them into training (30%), calibration (20%) and test (50%) sets. 5.2.1 Ranking simulation study ...we simulate n = 104 i.i.d. diﬀerent users, using the same (30,20,50) train/validation/test split as in Section 5.1. C.3 Prediction intervals for weakly supervised regression ...we split it into thirds: 33% of the counties (and their associated fractions of Democratic voters) go into the training set, 33% go into the calibration set, and the rest go into the test set; as our splits are random, they are exchangeable.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It describes models and datasets but lacks hardware specifications.
Software Dependencies	No	The paper mentions several algorithms and procedures like "List Net procedure (Cao et al., 2007)", "structured S-SVM approach (Tsochantaridis et al., 2004)", and "Hungarian algorithm". However, it does not specify any software libraries, frameworks, or operating systems with their version numbers that would be needed to replicate the experiments.
Experiment Setup	Yes	5 Experiments 5.1 A toy classiﬁcation example ...We vary the signal-to-noise ratio σ 1 {10 2, . . . , 102}... we simulate n = 104 data points...We draw each θy uniform on Sd 1, {Xi}n i=1 iid N(0, Id), choosing weak threshold T Uni[miny Y{SOracle y }, maxy Y{SOracle y }]. We repeat the entire process Ntrials = 20 times... 5.2.1 Ranking simulation study ...With K = 7 and d = 2... We use the same scoring model for both the fully supervised conformal (FSC) and weakly supervised conformal (WSC) procedures... scoring mechanism (10) with ψ(x, y) := (y x)+... 5.2.2 Ranking experiment with Microsoft LETOR dataset ...For each split (calibration/test), we ﬁrst sample n = 2000 queries...select K {2, 4, 6, 8, 10, 20} documents...We repeat the entire simulation procedure Ntrials = 20 times...pairwise comparison function ψc(r1, r2) := exp( cr1) (r2 r1)+ with c {0, 2, 5, 8}... C.3 Prediction intervals for weakly supervised regression ...for various values of µ {.01, .05, .1, .15, .2}...We set the miscoverage level α = .05.