BCE vs. CE in Deep Feature Learning

Authors: Qiufu Li, Huibin Xiao, Linlin Shen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results are aligned with above analysis, and show that BCE could improve the classification and leads to better compactness and distinctiveness among sample features. The codes have be released. We conduct extensive experiments, and find that, compared to CE, BCE can more quickly lead to NC on the training dataset and achieves better feature compactness and distinctiveness, resulting in higher classification performance on the test dataset.
Researcher Affiliation Academia 1School of Artificial Intelligence, Shenzhen University, Shenzhen, 518060, China 2Department of Computer Science, Wenzhou Kean University, Wenzhou, 325060, China 3Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen, 518060, China. Correspondence to: Linlin Shen <EMAIL>.
Pseudocode No The paper describes methods using mathematical formulations and descriptive text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The experimental results are aligned with above analysis, and show that BCE could improve the classification and leads to better compactness and distinctiveness among sample features. The codes have be released.
Open Datasets Yes To compare CE and BCE in deep feature learning, we train deep classification models, Res Net18, Res Net50 (He et al., 2016), and Dense Net121 (Huang et al., 2017), on three popular datasets, including MNIST (Le Cun et al., 1998), CIFAR10, and CIFAR100 (Krizhevsky et al., 2009).
Dataset Splits No The paper mentions using training and test datasets (e.g., 'The classification on the test set of CIFAR10 and CIFAR100.'), but does not explicitly provide specific percentages, sample counts, or a detailed methodology for creating these splits in the main text.
Hardware Specification No The paper does not explicitly mention any specific hardware (e.g., GPU, CPU models, or cloud resources) used for running the experiments. It only discusses the models and optimizers used for training.
Software Dependencies No The paper mentions deep learning models (Res Net18, Res Net50, Dense Net121), optimizers (SGD and Adam W), and data augmentation techniques (Mixup, Cut Mix), but does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., PyTorch, TensorFlow, Python versions).
Experiment Setup Yes We train the models using SGD and Adam W for 100 epochs with batch size of 128. The initial learning rate is set to 0.01 and 0.001 for SGD and Adam W, which is respectively decayed in step and cosine schedulers. In the training, we set λW = λH = λb = 5 10 4, and no weight decay is applied on the other parameters of model M. In the experiments, we take a global weight decay factor λ for the all parameters in the models, including the classifiers and biases, and λ = 5 10 4 for SGD, λ = 0.05 for Adam W. The other hyper-parameters are presented in the supplementary.