BCE vs. CE in Deep Feature Learning
Authors: Qiufu Li, Huibin Xiao, Linlin Shen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results are aligned with above analysis, and show that BCE could improve the classification and leads to better compactness and distinctiveness among sample features. The codes have be released. We conduct extensive experiments, and find that, compared to CE, BCE can more quickly lead to NC on the training dataset and achieves better feature compactness and distinctiveness, resulting in higher classification performance on the test dataset. |
| Researcher Affiliation | Academia | 1School of Artificial Intelligence, Shenzhen University, Shenzhen, 518060, China 2Department of Computer Science, Wenzhou Kean University, Wenzhou, 325060, China 3Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen, 518060, China. Correspondence to: Linlin Shen <EMAIL>. |
| Pseudocode | No | The paper describes methods using mathematical formulations and descriptive text, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The experimental results are aligned with above analysis, and show that BCE could improve the classification and leads to better compactness and distinctiveness among sample features. The codes have be released. |
| Open Datasets | Yes | To compare CE and BCE in deep feature learning, we train deep classification models, Res Net18, Res Net50 (He et al., 2016), and Dense Net121 (Huang et al., 2017), on three popular datasets, including MNIST (Le Cun et al., 1998), CIFAR10, and CIFAR100 (Krizhevsky et al., 2009). |
| Dataset Splits | No | The paper mentions using training and test datasets (e.g., 'The classification on the test set of CIFAR10 and CIFAR100.'), but does not explicitly provide specific percentages, sample counts, or a detailed methodology for creating these splits in the main text. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware (e.g., GPU, CPU models, or cloud resources) used for running the experiments. It only discusses the models and optimizers used for training. |
| Software Dependencies | No | The paper mentions deep learning models (Res Net18, Res Net50, Dense Net121), optimizers (SGD and Adam W), and data augmentation techniques (Mixup, Cut Mix), but does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., PyTorch, TensorFlow, Python versions). |
| Experiment Setup | Yes | We train the models using SGD and Adam W for 100 epochs with batch size of 128. The initial learning rate is set to 0.01 and 0.001 for SGD and Adam W, which is respectively decayed in step and cosine schedulers. In the training, we set λW = λH = λb = 5 10 4, and no weight decay is applied on the other parameters of model M. In the experiments, we take a global weight decay factor λ for the all parameters in the models, including the classifiers and biases, and λ = 5 10 4 for SGD, λ = 0.05 for Adam W. The other hyper-parameters are presented in the supplementary. |