reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Pseudo-Metric between Probability Distributions based on Depth-Trimmed Regions

Authors: Guillaume Staerman, Pavlo Mozharovskyi, Pierre Colombo, Stephan Clémençon, Florence d'Alché-Buc

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The quality of this approximation and the performance of the proposed approach are illustrated in numerical experiments. Applications to robust clustering of images and automatic evaluation of natural language generation (NLG) show the benefits of this approach when benchmarked with state-of-the-art probability metrics.
Researcher Affiliation	Collaboration	Guillaume Staerman EMAIL LTCI, Télécom Paris, Institut Polytechnique de Paris; Pavlo Mozharovskyi EMAIL LTCI, Télécom Paris, Institut Polytechnique de Paris; Pierre Colombo EMAIL Equall.ai and MICS, Centrale Supélec, Université Paris-Saclay; Stéphan Clémençon EMAIL LTCI, Télécom Paris, Institut Polytechnique de Paris; Florence d Alché-Buc EMAIL LTCI, Télécom Paris, Institut Polytechnique de Paris
Pseudocode	Yes	Algorithm 1 Approximation of DRp,ε; Algorithm 2 Approximation of the halfspace depth; Algorithm 3 Approximation of the projection depth; Algorithm 4 Approximation of the AI-IRW depth
Open Source Code	No	The text does not contain an explicit statement that the authors' code is released, nor does it provide a specific link to a code repository for the methodology described in the paper.
Open Datasets	Yes	The first dataset (FM) is constructed by taking the 100 first images in each class of the Fashion-MNIST dataset. We follow previous BERT-based metrics and evaluate performances of DRp,ε (with p = 2, ε = 0.01 and using the AI-IRW depth (Staerman et al., 2021b)) on two different NLG tasks namely: data2text generation (using the Web NLG 2020 dataset (Ferreira et al., 2020)) and summarization. We work with the dataset from Bhandari et al. (2020) for this task.
Dataset Splits	No	The paper discusses how two datasets were constructed from Fashion-MNIST (FM and Cont. FM) and how contamination was introduced (5% contamination for Cont. FM), but it does not specify explicit training, testing, or validation splits for these datasets or any other.
Hardware Specification	Yes	The authors thank the Jean Zay supercomputer operated by GENCI IDRIS with the compute grant 2023AD011014668R1 and Adastra with the grant AD010614770, where the NLP experiments have been done.
Software Dependencies	No	The paper mentions the use of 'scikit-learn spectral clustering implementation' and 'Roberta-based model from the Hugging Face hub (Wolf et al., 2019)', but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	We benchmark DRp,ε (using the projection depth) setting p = 2 and ε = 0.1 with the Wasserstein (W), the Sliced-Wasserstein (Sliced-W) and the Maximum Mean Discrepancy (MMD; Gretton et al., 2007) distances. DRp,ε and the Sliced-Wasserstein are approximated by Monte-Carlo using 100 directions while the MMD distance is computed using a Gaussian kernel with a bandwidth equal to 1. As a baseline method, spectral clustering is also applied to images considered as vectors using Euclidean distance. Standard parameters of the scikit-learn spectral clustering implementation are employed with a number of clusters fixed to 10. We follow previous BERT-based metrics and evaluate performances of DRp,ε (with p = 2, ε = 0.01 and using the AI-IRW depth (Staerman et al., 2021b)).