reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning from Noisy Pairwise Similarity and Unlabeled Data

Authors: Songhua Wu, Tongliang Liu, Bo Han, Jun Yu, Gang Niu, Masashi Sugiyama

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we experimentally investigate the behavior of the proposed method and the baselines for n SU classiﬁcation on both synthetic and benchmark datasets. All experiments were conducted with 3.10GHz Intel(R) Core(TM) i9-9900 CPU and NVIDIA 2080Ti.
Researcher Affiliation	Academia	Sydney AI Centre Te University of Sydney Sydney, Australia Tongliang Liu EMAIL Sydney AI Centre Te University of Sydney Sydney, Australia Bo Han EMAIL Department of Computer Science Hong Kong Baptist University Hong Kong, China Jun Yu EMAIL Department of Automation University of Science and Technology of China Hefei, China Gang Niu EMAIL Center for Advanced Intelligence Project RIKEN Tokyo, Japan Masashi Sugiyama EMAIL Center for Advanced Intelligence Project RIKEN Tokyo, Japan Graduate School of Frontier Sciences Te University of Tokyo Chiba, Japan
Pseudocode	Yes	Algorithm 1 n SU classiﬁcation. Input: Noisy similar data pairs Ds and unlabeled data Du; Output: Te classiﬁer ˆf; Stage 1. Estimate the similar rate πs and noise rate ρd Intermediate parameters (γ, κ) = MPE( Ds, Du); Compute (πs, ρd, π+, π ) from (γ, κ); Stage 2. Obtain classiﬁer ˆf if Squared loss then Compute the analytical solution ˆ w by Eq. (19); end if if Logistic loss then Approximate the optimal classiﬁer f by the SGD; end if return ˆf;
Open Source Code	No	The paper does not provide any specific links to code repositories, nor does it explicitly state that the code for the described methodology is open-source or available in supplementary materials.
Open Datasets	Yes	Here datasets were obtained from the LIBSVM data (Chang and Lin, 2011) and UCI Machine Learning Repository (Dua and Graﬀ, 2017). ... SMS Spam (Almeida et al., 2011) is a public set of short message service (SMS) labeled messages... News20 is a collection of approximately 20,000 newsgroup documents... CIFAR-10 (Krizhevsky et al., 2009) has 32 32 3 color images including 50,000 training images and 10,000 test images of 10 classes.
Dataset Splits	Yes	To obtain n SU data, ﬁrst, we collected raw binary classiﬁcation datasets which consist of positive and negative data, while leaving 10% of the data as test data. ... For all the experiments, the sample size of the noisy similar data pairs was ﬁxed to 4000, while the sample size of the unlabeled data was ﬁxed to 2000.
Hardware Specification	Yes	All experiments were conducted with 3.10GHz Intel(R) Core(TM) i9-9900 CPU and NVIDIA 2080Ti.
Software Dependencies	No	For unsupervised learning methods, we directly used the implementations on scikit-learn (Pedregosa et al., 2011). - While scikit-learn is a software, it does not specify a version number. No other specific software versions are mentioned.
Experiment Setup	Yes	The regularization parameter λ was ﬁxed to 10 4. For the deep model, we employed a 3-layer MLP (multilayer perceptron) with the sofsign active function (Saf(x) = x/(1 + \|x\|)). We used the stochastic gradient descent (SGD) optimizer with an initial learning rate of 0.002, which decays every 40 epochs by a factor of 0.1 with 200 epochs in total.