reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distribution-Free Data Uncertainty for Neural Network Regression

Authors: Domokos M. Kelen, Ádám Jung, Péter Kersch, Andras Benczur

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the method on a variety of synthetic and real-world tasks, including uniand multivariate problems, function inverse approximation, and standard regression uncertainty benchmarks. Our approach allows the model to learn wellcalibrated, arbitrary uniand multivariate output distributions. We evaluate our approach on multiple tasks, including synthetic uniand multivariate examples, the inverse problem of MNIST classification, and also on the standard UCI regression uncertainty benchmark. The method consistently exhibits favorable behavior in practice, making it an appealing choice for modeling aleatoric uncertainty. Finally, we extensively test the resulting method, comparing it against prior work such as BNN-s (Gawlikowski et al., 2023) or Mixture Density Networks (MDN) (Bishop, 1994). 5 EXPERIMENTS
Researcher Affiliation	Collaboration	Emails: EMAIL, EMAIL HUN-REN SZTAKI Ericsson Hungary Sz echenyi University, Gy or, Hungary
Pseudocode	No	The paper describes methods and architectures but does not contain explicitly labeled pseudocode or algorithm blocks. It provides formulas and conceptual descriptions of the network.
Open Source Code	Yes	Finally, we make all experiment code publicly available1. 1https://github.com/proto-n/torch-naut. We make all code and hyperparameters used to run the experiments in the paper publicly available in our source code repository at https://github.com/proto-n/torch-naut.
Open Datasets	Yes	We evaluate the method on a variety of synthetic and real-world tasks, including uniand multivariate problems, function inverse approximation, and standard regression uncertainty benchmarks. We reverse the input-output structure of the MNIST classification task (Le Cun, 1998), turning it into a multivariate regression problem. The UCI Benchmark (Hern andez-Lobato & Adams, 2015) has become the standard benchmark to measure regression uncertainty. DISCO Nets is originally measured on the NYU Hand Pose Estimation dataset (Tompson et al., 2014) as preprocessed by Oberweger et al. (2015).
Dataset Splits	Yes	We run our experiments on the exact train-test splits defined by Hern andez-Lobato & Adams (2015), using the setup and code from the repository of Gal & Ghahramani (2016). Early stopping is used with 20% of the training set as validation. We use early stopping with a patience value of 50 on validation sets consisting of 20% of the training set for each split. The processed dataset contains 72,757 training and 8254 testing frames, with the sets consisting of RGBD images of hands, taken from 3 viewpoints.
Hardware Specification	Yes	All of our implementations were run on an AMD EPYC 7F72 workstation with 2x24 CPU cores (48 cores, 96 threads in total) and 384 Mi B of L3 cache. We used a single NVIDIA A100-SXM4-40GB GPU per experiment, with CUDA version 12.2 and driver version 535.183.01, ensuring consistent conditions across experiments.
Software Dependencies	No	The paper mentions "CUDA version 12.2" and "Py Torch framework" but only provides a specific version for CUDA. It also mentions "aw KDE3 Python library" but without a specific version number. The instructions require multiple key software components with their versions for a 'Yes' answer.
Experiment Setup	Yes	We do not optimize hyperparameters individually for each dataset, however do use adaptive batch sizes, and larger L2 regularization for the 4 smallest datasets. Early stopping is used with 20% of the training set as validation. In our experiments we use the Adam W optimizer with a learning-rate of 0.001 and a linear learning rate warm-up schedule starting with the coefficient 0.1 and ending with 1 after 10 epochs. We use early stopping with a patience value of 50 on validation sets consisting of 20% of the training set for each split. We set L2 regularization over the weights to 10 6 on all datasets except the four smallest ones (boston, concrete, energy, yacht), where we use 10 4. Batch size is chosen to be max( m 5, 1) with m being the training set size, for a balance of computational efficiency and predictive accuracy. We run Ensembles and ensembling-based variants of WCRPS using 5 networks. We use 20 heads for all multi-head variants of WCRPS on the UCI benchmark, and 10 layers for all layered-multi-head variants. For evaluating the CRPS formula variants, we take 100 samples during training and 1000 samples for evaluation during early stopping.