reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multiple Instance Verification

Authors: Xin Xu, Eibe Frank, Geoffrey Holmes

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through empirical studies on three diﬀerent veriﬁcation tasks, we demonstrate that CAP outperforms adaptations of SOTA MIL methods and the baseline by substantial margins, in terms of both classiﬁcation accuracy and the ability to detect key instances. The superior ability to identify key instances is attributed to the new attention functions by ablation studies.
Researcher Affiliation	Academia	Xin Xu EMAIL Eibe Frank EMAIL Geoﬀrey Holmes EMAIL Department of Computer Science University of Waikato Hamilton, New Zealand
Pseudocode	No	The paper describes the model architectures and components using mathematical equations and textual descriptions in Section 4.1.1 and 4.1.2, and illustrates them with figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We share our code at https://github.com/xxweka/MIV.
Open Datasets	Yes	We collected QMNIST data (BSD-style license, https://github.com/facebookresearch/ qmnist/blob/main/LICENSE) according to Yadav and Bottou (2019) and from links therein. ... We collected raw images of signatures, both authentic and forged, based on Liwicki et al. (2011) and from the link http://www.iapr-tc11.org/mediawiki/index.php/ICDAR_2011_ Signature_Verification_Competition_(Sig Comp2011)... For fact extraction and veriﬁcation (FEVER), we collected the raw data of claims and evidence as in Thorne et al. (2018) and from the FEVER (2018) website https://fever.ai/ dataset/fever.html
Dataset Splits	Yes	For QMNIST: By construction, the sample sizes of the train/dev/test datasets are 21,509/2,408/2,253 respectively... For Signature Verification: The sample size for training is around 82,000, and for validation/test is close to 10,000. For FEVER: we used the full set of raw validation/test data (6616/6613 exemplars respectively) as our validation/test datasets. To construct our training dataset in each round of experiments, we randomly sampled 33,000 exemplars from the raw training set...
Hardware Specification	Yes	All experiments were run on a cluster of four NVIDIA RTX A6000 GPUs, of which the duration varies depending on early stopping triggers.
Software Dependencies	Yes	All models were developed using Tensorﬂow 2.9.3, with some use of Tensor Flow oﬃcial models 2.9.0 (Yu et al. (2020), Apache License 2.0) and scikit-learn 1.2.0 (Pedregosa et al. (2011), BSD License).
Experiment Setup	Yes	For QMNIST: The learning rate of the RMSprop optimizer was piece-wise constant: 1e-4 for the ﬁrst 5 epochs, 5e-5 for the next 15 epochs, and 2e-5 for the remaining epochs. The mini-batch size was 768, and the early-stopping criterion was non-improvement of validation accuracy for 30 epochs. The number of heads was two for multi-head attention, when applicable.