reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BCE vs. CE in Deep Feature Learning

Authors: Qiufu Li, Huibin Xiao, Linlin Shen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results are aligned with above analysis, and show that BCE could improve the classiﬁcation and leads to better compactness and distinctiveness among sample features. The codes have be released. We conduct extensive experiments, and ﬁnd that, compared to CE, BCE can more quickly lead to NC on the training dataset and achieves better feature compactness and distinctiveness, resulting in higher classiﬁcation performance on the test dataset.
Researcher Affiliation	Academia	1School of Artiﬁcial Intelligence, Shenzhen University, Shenzhen, 518060, China 2Department of Computer Science, Wenzhou Kean University, Wenzhou, 325060, China 3Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen, 518060, China. Correspondence to: Linlin Shen <EMAIL>.
Pseudocode	No	The paper describes methods using mathematical formulations and descriptive text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The experimental results are aligned with above analysis, and show that BCE could improve the classiﬁcation and leads to better compactness and distinctiveness among sample features. The codes have be released.
Open Datasets	Yes	To compare CE and BCE in deep feature learning, we train deep classiﬁcation models, Res Net18, Res Net50 (He et al., 2016), and Dense Net121 (Huang et al., 2017), on three popular datasets, including MNIST (Le Cun et al., 1998), CIFAR10, and CIFAR100 (Krizhevsky et al., 2009).
Dataset Splits	No	The paper mentions using training and test datasets (e.g., 'The classiﬁcation on the test set of CIFAR10 and CIFAR100.'), but does not explicitly provide specific percentages, sample counts, or a detailed methodology for creating these splits in the main text.
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., GPU, CPU models, or cloud resources) used for running the experiments. It only discusses the models and optimizers used for training.
Software Dependencies	No	The paper mentions deep learning models (Res Net18, Res Net50, Dense Net121), optimizers (SGD and Adam W), and data augmentation techniques (Mixup, Cut Mix), but does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., PyTorch, TensorFlow, Python versions).
Experiment Setup	Yes	We train the models using SGD and Adam W for 100 epochs with batch size of 128. The initial learning rate is set to 0.01 and 0.001 for SGD and Adam W, which is respectively decayed in step and cosine schedulers. In the training, we set λW = λH = λb = 5 10 4, and no weight decay is applied on the other parameters of model M. In the experiments, we take a global weight decay factor λ for the all parameters in the models, including the classiﬁers and biases, and λ = 5 10 4 for SGD, λ = 0.05 for Adam W. The other hyper-parameters are presented in the supplementary.