Structural-Entropy-Based Sample Selection for Efficient and Effective Learning

Authors: Tianchi Xie, Jiangning Zhu, Guozu Ma, Minzhi Lin, Wei Chen, Weikai Yang, Shixia Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments in three learning scenarios supervised learning, active learning, and continual learning clearly demonstrate the effectiveness of our method.
Researcher Affiliation Collaboration 1BNRist, Tsinghua University 2China Telecom Wanwei Information Technology Co., Ltd 3Microsoft Research 4Hong Kong University of Science and Technology (Guangzhou)
Pseudocode No The paper describes the method's steps in prose, for example, in Section 4.2: 'Our sampling process contains two steps: 1) identifying the candidate sample with the highest importance score, 2) rejecting the sample if its similarity with any selected neighboring samples exceeds a threshold θ; otherwise, accepting it as a selected sample. These two steps are performed iteratively until no more samples can be selected.'
Open Source Code Yes The implementation is available at https://github.com/thu-vis/SE-based sample selection.
Open Datasets Yes For image classification, we use the widely used datasets, CIFAR10, CIFAR100 (Krizhevsky, 2009), and Image Net-1K (Deng et al., 2009). For text classification, we use the ANLI dataset (Nie et al., 2020)... and the IMDB Review dataset (Maas et al., 2011). For object detection, we use the PASCAL VOC dataset (Everingham et al., 2010). For visual question answering, we use the CC SBU Align dataset (Zhu et al., 2024). We use the datasets commonly used in continual learning, including Permuted MNIST, Split MNIST, Split CIFAR10, Split CIFAR100, and Split Tiny-Image Net.
Dataset Splits Yes The CIFAR10 and CIFAR100 datasets each consist of 50, 000 images ... for the training set, with an additional 10, 000 images for testing. The Image Net-1K dataset includes 1, 281, 167 images ... for training, along with 50, 000 images for validation. The ANLI dataset... consists of 100, 459 training samples and 1, 200 test samples. The IMDB Review dataset contains 25, 000 movie reviews each in the training and test splits.
Hardware Specification Yes Specifically, our method reduces the fine-tuning time on a single Nvidia Tesla V100 GPU from approximately 30 minutes to 3 minutes, adding only a negligible selection overhead of 2 seconds.
Software Dependencies No The paper mentions various models and frameworks like ResNet-18, RoBERTa, SSD, MiniGPT4, CLIP, and Sentence-BERT, but does not specify any software libraries with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x, CUDA 11.x).
Experiment Setup Yes For image classification... Res Net-18 for 200 epochs on CIFAR10 and CIFAR100... The batch size is set to 64. We use an SGD optimizer with an initial learning rate of 0.1, momentum of 0.9, and weight decay of 0.0002. We use a cosine annealing learning rate scheduler with a minimum learning rate of 0.0001.