reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Synthesizing Minority Samples for Long-tailed Classification via Distribution Matching

Authors: Zhuo Li, He Zhao, Jinke Ren, Anningzhe Gao, Dandan Guo, Xiang Wan, Hongyuan Zha

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on several standard benchmark datasets demonstrate the effectiveness of our method in both long-tailed classification and synthesizing high-quality synthetic minority samples.
Researcher Affiliation	Academia	Zhuo Li EMAIL Shenzhen International Center for Industrial and Applied Mathematics, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, He Zhao EMAIL CSIRO s Data61, Australia Jinke Ren EMAIL Shenzhen Future Network of Intelligence Institute School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen Guangdong Provincial Key Laboratory of Future Networks of Intelligence Anningzhe Gao EMAIL Shenzhen Research Institute of Big Data Dandan Guo EMAIL Jilin University Xiang Wan EMAIL Shenzhen Research Institute of Big Data Hongyuan Zha EMAIL The Chinese University of Hong Kong, Shenzhen
Pseudocode	Yes	We give a whole training paradigm in Alg. 3 in Appendix. B. Algorithm 1: Oversampling Minority Samples via Our Method (In-Distribution). ... Algorithm 2: Oversampling minority samples via our framework (Out-of-Distribution). ... Algorithm 3: Joint Training Paradigm with Synthetic Sample Generation
Open Source Code	Yes	Corresponding Author. Code is available on https://github.com/BIRlz/TMLR_Syn-LT.
Open Datasets	Yes	We evaluate our method on CIFAR-LT-10 / CIFAR-LT-100, Image Net-LT and Places-LT. We build CIFAR-LT-10 / CIFAR-LT-100 from the standard CIFAR-10/CIFAR-100 datasets (Krizhevsky et al., 2009) with IF {50, 100, 200} (Kim et al., 2020; Kang et al., 2019; Li et al., 2021). Image Net-LT is a subset of the Image Net-2012 dataset (Deng et al., 2009) with 1000 classes and IF = 1280/5 (Kim et al., 2020; Ren et al., 2020). Places-LT is a subset from the Places-365 dataset (Zhou et al., 2017) with 365 classes and IF = 4980/5 (Cao et al., 2019; Ren et al., 2020).
Dataset Splits	Yes	The original CIFAR-10/CIFAR-100 datasets (Krizhevsky et al., 2009) include 60,000 images and 10/100 classes with a size of 32 32, where there are 50,000 images for training and 10,000 for testing. By following (Kim et al., 2020), we create CIFAR-LT-10 and CIFAR-LT-100 by randomly under-sampling in the original datasets with IF = {200, 100, 50}. We use the original test dataset to evaluate our method.
Hardware Specification	Yes	We use SGD with momentum 0.9 and weight decay 5e 4 and conduct all the experiments on 8 Tesla-V100 GPUs.
Software Dependencies	No	The paper mentions 'SGD with momentum 0.9 and weight decay 5e-4' and refers to neural networks and optimization, implying the use of machine learning frameworks. However, it does not specify any particular software dependencies with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, Python 3.x, CUDA 11.x).
Experiment Setup	Yes	Unless otherwise stated, we set the imbalance factor as IF = N1/NK and use T = 5 iterations with a step size of η=0.1 to optimize the synthetic samples at each training iteration. The hyper-parameter for the OT entropy constraint is γ =0.1 and the maximum iteration number in the Sinkhorn algorithm is 200. ... We employ 200 epochs for training f with an initial learning rate α of 0.1, which is decayed by 1e 2 at 160-th epoch and 180-th epoch. We set batch size as 32 and start our method at 160-th epoch, where we set λ1 and λ2 as 0.5, β as 0.999 and τ as 0.9.