reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contradiction Retrieval via Contrastive Learning with Sparsity

Authors: Haike Xu, Zongyu Lin, Kai-Wei Chang, Yizhou Sun, Piotr Indyk

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct contradiction retrieval experiments on Arguana, MSMARCO, and Hotpot QA, where our method produces an average improvement of 11.0% across different models. We also validate our method on downstream tasks like natural language inference and cleaning corrupted corpora.
Researcher Affiliation	Academia	1MIT 2University of California, Los Angeles. Correspondence to: Haike Xu <EMAIL>.
Pseudocode	No	The paper describes the SPARSECL method and its components in Section 3 and details training procedures in Section 4, but it does not present any formal pseudocode or algorithm blocks.
Open Source Code	No	The paper does not explicitly state that code is being released, nor does it provide a link to a code repository in the main text.
Open Datasets	Yes	We first evaluate our method on the counter-argument detection dataset Arguana (Wachsmuth et al., 2018) and two contradiction retrieval datasets adapted from Hotpot QA (Yang et al., 2018) and MSMARCO (Nguyen et al., 2016).
Dataset Splits	Yes	The dataset is split into the training set (60% of the data), the validation set (20%), and the test set (20%). This ensures that data from each individual debate is included in only one set and that debates from every theme are represented in every set. ... We generate the paraphrases and contradictions for the validation set, test set, and a randomly sampled 10000 documents from the training set. Please refer to Appendix G for details.
Hardware Specification	Yes	Most of our experiments are not so computationally extensive, which can be run by one single A6000 GPU. We run our major experiments on A6000 and A100 GPUs.
Software Dependencies	No	Table 10 mentions specific models like GTE-large-en-v1.5, UAE-Large-V1, and bge-base-en-v1.5, and their backbones (BERT + Ro PE + GLU), but does not provide specific version numbers for underlying software libraries like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	Please refer to Table 10 for our training parameters. ... We set max sequence length to be 512 for Arguana dataset and 256 for Hotpot QA and MSMARCO datasets. ... We select α based on the best NDCG@10 score on the validation set.