reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Navigating Conflicting Views: Harnessing Trust for Learning

Authors: Jueqing Lu, Wray Buntine, Yuanyuan Qi, Joanna Dipnall, Belinda Gabbe, Lan Du

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on six real-world datasets using Top-1 Accuracy, Fleiss Kappa, and a new metric, Multi-View Agreement with Ground Truth, to assess prediction reliability. We also assess the effectiveness of uncertainty in indicating prediction correctness via AUROC. Additionally, we test the scalability of our method through end-to-end training on a largescale dataset. The experimental results show that computational trust can effectively resolve conflicts, paving the way for more reliable multi-view classification models in real-world applications.
Researcher Affiliation	Academia	1Department of Data Science & AI, Monash University 2College of Engineering and Computer Science, Vin University 3School of Public Health and Preventive Medicine, Monash University. Correspondence to: Lan Du <EMAIL>.
Pseudocode	Yes	Algorithm 1 Algorithm For Training (simplified version) Algorithm 2 Algorithm For Training Algorithm 3 Algorithm For Testing
Open Source Code	Yes	Codes available at: https://github.com/OverfitFlow/Trust4Conflict
Open Datasets	Yes	Following previous work (Han et al., 2021; 2022; Jung et al., 2022; Xu et al., 2024a), we conducted experiments on six benchmark datasets: Handwritten7, Caltech101 (Fei-Fei et al., 2004), PIE 8, Scene15 (Fei-Fei & Perona, 2005), HMDB (Kuehne et al., 2011) and CUB (Wah et al., 2011) with train-test split of 80% vs. 20%. 7https://archive.ics.uci.edu/ml/datasets/ Multiple+Features 8http://www.cs.cmu.edu/afs/cs/project/ PIE/Multi Pie/Multi-Pie/Home.html
Dataset Splits	Yes	Following previous work (Han et al., 2021; 2022; Jung et al., 2022; Xu et al., 2024a), we conducted experiments on six benchmark datasets: Handwritten7, Caltech101 (Fei-Fei et al., 2004), PIE 8, Scene15 (Fei-Fei & Perona, 2005), HMDB (Kuehne et al., 2011) and CUB (Wah et al., 2011) with train-test split of 80% vs. 20%. Table 11. Summary of Datasets Dataset Size K Dimensions #Train #Test Hand Written 2000 10 240/76/216/47/64/6 1600 400
Hardware Specification	Yes	All methods were run on a single 24GB RTX3090 card for fair comparison. All Experiments are conducted on a single Nvidia RTX 3090 GPU with 24GB of memory.
Software Dependencies	Yes	Specifically, we used Py Torch (Paszke et al., 2019) version 1.13.0, built with CUDA 11.7, to implement our codes. The Python environment version is 3.8, and the operating system is Ubuntu 22.04.4.
Experiment Setup	Yes	Table 10. TF and ETF hyper-parameters Hyper-parameter Handwritten Caltech101 PIE Scene15 HMDB CUB lr 3e-3 1e-4 3e-3 1e-2 3e-4 1e-3 rlr 3e-4 3e-5 1e-3 3e-3 1e-4 3e-4 weight-decay 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 warm-up epochs 1 1 1 1 1 1 The Adam optimizer (Kingma & Ba, 2015) is used for updating model parameters with beta coefficients = (0.9, 0.999) and epsilon = 1e-8.