reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptive Conformal Regression with Split-Jackknife+ Scores

Authors: Nicolas Deutschmann, Mattia Rigotti, Maria Rodriguez Martinez

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show through empirical validation that our method is more robust to overfitting effects than the original method, while being more sample-efficient than modern ECDF-based methods. This construction yields satisfying empirical results, avoiding issues of the original MADSplit when models overfit and obtaining better results than methods based on conditional score ECDF in low data regimes. We display the calibration-set-size dependence of PI metrics in Fig. 2. These results highlight a more general trend: our method is more robust than LVD in the low data regime but tends to be similar with enough data. We detail our results in Table 1.
Researcher Affiliation	Collaboration	Nicolas Deutschmann IBM Research Mattia Rigotti mrg@ foobar zurich. barfooibm.com IBM Research María Rodríguez Martínez IBM Research Reviewed on Open Review: https: // openreview. net/ forum? id= 1fb TGC3BUD Now at Cradle Corresponding author Now at Yale School of Medicine
Pseudocode	Yes	In this section, we describe the calibration and PI prediction procedures for two versions of our method in the form of pseudocode. The simplest is Algorithm 1, where the kernel is assumed to be fixed. Indeed we ve found in practice that using a KNN kernel with K = 10 is an efficient and performant approach. We however also define Algorithm 2, where we tune N kernels on calibration data, with an objective motivated by the bound of Proposition 4.
Open Source Code	No	All source code required for conducting and analyzing the experiments will be made publicly available upon publication of the paper with a license that allows free usage for research purposes. Yes.
Open Datasets	Yes	Two TDC molecule property datasets (Huang et al., 2021): drug solubility and clearance prediction from respectively 4.2k and 1.1k SMILES sequences. The Alpha Seq Antibody binding dataset (Engelhart et al., 2022): binding score prediction for a SARS-Co V-2 target peptide from 71k amino-acid sequences. Regression-MNIST (Le Cun & Cortes, 2005): Regression of the numeric label of MNIST images with test-time augmentations (e.g. predicting the number "9.0" when presented with a picture of the digit "9").
Dataset Splits	Yes	We use 1000 calibration and 10000 test points to evaluate our method on a random-forest regressor trained on independent data... The datasets are already divided into training, validation and test samples, however, we merge the validation and test data and re-divide it randomly into calibration and test to allow for enough statistics... The data is divided into a training/validation/test+calibration split of sizes 39466/12517/17314. The test+calibration set is randomly subsampled into a test dataset of size 7314 and a calibration dataset of variable size.
Hardware Specification	Yes	All our experiments are run in a Linux HPC cluster. For model training and inference, we used 32-core, 64GB RAM environments with a single NVidia A100 GPU.
Software Dependencies	Yes	Experiments were run in Python 3.9 with pytorch v2.0.0+cu117.
Experiment Setup	Yes	Throughout this section, we fix the kernel for our method to a simple (approximate (Py NNDescent; Dong et al., 2011)) K-nearest-neighbor (KNN) kernel (Kij = 1 if j is among the KNN of i, otherwise Kij = 0), setting K = 10... Our fine-tuned Chem BERTa models are defined Hugging Face Auto Model For Sequence Classification with Deep Chem/Chem BERTa-77M-MTR weights and sequence data is processed with the adapted Auto Tokenizer, with maximum sequence lengths set to the maximum sequence length in the training data. The training is performed with the mean-squared-error loss, using a batch size of 64 and a learning rate of 4.0 10 5 over 100 epochs... This model is trained with the Adam optimizer using a learning rate of 1.0 10 5, a batch size of 128 and is regularized with early stopping, monitoring the validation mean-squared error.