reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Refine Knowledge of Large Language Models via Adaptive Contrastive Learning

Authors: Yinghui Li, Haojing Huang, Jiayi Kuang, Yangning Li, Shu-Yu Guo, Chao Qu, Xiaoyu Tan, Hai-Tao Zheng, Ying Shen, Philip Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments and detailed analyses on widely used datasets demonstrate the effectiveness of our method. We conduct experiments and analyses on various advanced LLMs and test them on both in-distribution and out-of-distribution data. The experimental results show that our approach achieves the highest Truthful rate, verifying the effectiveness of our proposed Adaptive Contrastive Learning strategy. Section 4 is dedicated to "EXPERIMENT".
Researcher Affiliation	Collaboration	The authors are affiliated with: 1Tsinghua University (Academic), 2Sun-Yat Sen University (Academic), 3INFLY TECH (Shanghai) Co., Ltd. (Industry), 4Peng Cheng Laboratory (Academic/Public Research), 5University of Illinois Chicago (Academic). The presence of both academic institutions and an industry company (INFLY TECH) indicates a collaboration.
Pseudocode	No	The paper describes its methodology in Section 3, including mathematical formulations for loss functions (Equations 1-7) and a detailed explanation of its strategy. However, it does not contain a distinct, structured pseudocode block or algorithm box.
Open Source Code	No	The paper does not explicitly provide an unambiguous statement of code release or a link to a source code repository.
Open Datasets	Yes	The paper uses and cites several publicly available datasets: Trivia QA (Joshi et al., 2017), Natural Questions (Kwiatkowski et al., 2019), and ALCUNA (Yin et al., 2023a).
Dataset Splits	Yes	For Trivia QA, the paper states: "we use 90% of the training set to construct a training set for comparative learning data and 10% as a validation set. Since there is no standard answer in Trivia QA s test set, we select 11,313 Q&A pairs from the development set to build our final test set." For Natural Questions, it mentions: "The development set containing 3,610 instances is used to build our test set." For ALCUNA, it states: "We randomly sampled 1000 instances from the ALCUNA dataset to serve as our out-of-domain test set."
Hardware Specification	Yes	All experiments are conducted on Nvidia A100 80GB GPUs.
Software Dependencies	No	The paper mentions using specific base models (LLa MA-2-7B-chat, Mistral-7B-Instruct-v0.1) and the 'vllm framework', but it does not specify version numbers for any of these software components or other libraries used for implementation.
Experiment Setup	Yes	The paper provides specific experimental setup details: "During the training of the LLa MA model, we used a batch size of 16, a learning rate of 5e-5, a context length of 1024, and trained for 2 epochs. For the Mistral model, we used a batch size of 16, a learning rate of 1e-5, a context length of 1024, and also trained for 2 epochs. The τ is set to 0.01 and the λ is set to 1."