reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Concept Matching with Agent for Out-of-Distribution Detection

Authors: Yuxiao Lee, Xiaofeng Cao, Jingcai Guo, Wei Ye, Qing Guo, Yi Chang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experimental results showcase the superior performance of CMA over both zero-shot and training-required methods in a diverse array of real-world scenarios. We conducted experiments on various datasets with distinct ID scenarios and demonstrated that CMA achieves superior performance across a wide range of real-world tasks.
Researcher Affiliation	Academia	1School of Artificial Intelligence, Jilin University, China 2The Hong Kong Polytechnic University 3College of Electronic and Information Engineering, Tongji University, China 4CFAR and IHPC, Agency for Science, Technology and Research (A*STAR), Singapore 5Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China
Pseudocode	No	The paper describes its method using textual explanations and mathematical formulas (e.g., SCMA(x; Yin, Yntc, T, I) in Section 3.3) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/yuxiao Lee Marks/CMA
Open Datasets	Yes	Datasets We conducted a comprehensive evaluation of the performance of our method across various dimensions and compared it with widely employed OOD detection algorithms. (1) We assessed our approach on the Image Net-1k OOD benchmark. This benchmark utilizes the large-scale visual dataset Image Net-1k (Deng et al. 2009) as the ID data and four OOD datasets (including subsets of i Naturalist (Van Horn et al. 2018), SUN (Xiao et al. 2010), Places(Zhou et al. 2017), and Textures (Cimpoi et al. 2014)... (2) We evaluated our method on various smallscale datasets. Specifically, we considered the following ID datasets: Fashion MNIST (Xiao, Rasul, and Vollgraf 2017), STL10 (Coates, Ng, and Lee 2011), Oxford IIIPet (Parkhi et al. 2012), Food-101 (Bossard, Guillaumin, and Van Gool 2014), CUB-200 (Wah et al. 2011), Plant Village (Hughes, Salath e et al. 2015), LFW (Huang et al. 2012), Stanford-dogs (Khosla et al. 2011), FGVC-Aircraft (Maji et al. 2013), Grocery Store (Klasson, Zhang, and Kjellstr om 2019), and CIFAR-10 (Krizhevsky, Hinton et al. 2009).
Dataset Splits	No	The paper mentions using 'Image Net-1k as the ID data and four OOD datasets' for evaluation and 'subsets of Image Net1k, namely Image Net-10 and Image Net-20'. However, it does not specify explicit training, validation, or testing splits (e.g., percentages or sample counts) for the datasets used in its experiments to allow for reproduction of the data partitioning.
Hardware Specification	No	The paper mentions using 'CLIP-B/16 as the foundational evaluation model' but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	The paper mentions using 'CLIP (Radford et al. 2021)' and 'CLIP-B/16' as the pre-trained model backbone, which uses 'Vi TB/16 transformer (Dosovitskiy et al. 2020)' and 'masked self-attention Transformer (Vaswani et al. 2017)'. These are models and architectures, but no specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) are provided.
Experiment Setup	Yes	Additionally, unless otherwise specified, the temperature coefficient is uniformly set to 1 across all algorithms. In Section 5 Discussion, 'empirical analysis on Image Net-1k reveals optimal performance at k = 1' where k = M/N (M denotes agent count).