reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Linear Separation Capacity of Self-Supervised Representation Learning

Authors: Shulei Wang

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical analysis further underscores that the performance of downstream linear classiﬁers primarily hinges on the linear separability of data representations rather than the size of the labeled data set, reaﬃrming the viability of constructing eﬃcient classiﬁers with limited labeled data amid an expansive unlabeled data set. These theoretical ﬁndings are also validated through numerical examples in Section 6. Speciﬁcally, the numerical experiments suggest that 1) the data augmentation can improve the linear separability of learned representation and 2) the performance of the downstream linear model highly relies on the linear separability of representation rather than the number of labeled data. In this section, we present a series of numerical experiments to validate the theoretical results from the previous sections and to compare the performance of unsupervised and self-supervised learning methods. Speciﬁcally, we utilize the MNIST data set (Le Cun et al., 1998), which consists of 60,000 training and 10,000 testing images of 28 28 gray-scale handwritten digits.
Researcher Affiliation	Academia	Shulei Wang EMAIL Department of Statistics University of Illinois at Urbana-Champaign Champaign, IL 61820, USA
Pseudocode	Yes	Appendix A. Algorithm for Augmentation Invariant Manifold Learning This section presents the stochastic optimization algorithm for augmentation invariant manifold learning introduced in Wang (2025), summarized in Algorithm 1. Algorithm 1 Augmentation Invariant Manifold Learning Input: A set of data {X1, . . . , Xn}, batch size n , encoder Θβ, stochastic data augmentation transformation T , tuning parameters (r, λ1, λ2). 1: for sampled minibatch {Xi : i S} do 2: Generate two independent augmented copies of each sample X i = T (Xi) and X i = T (Xi) for i S. 3: Evaluate representation of each augmented sample Z = {Θβ(X i)}i S and Z = {Θβ(X i )}i S, where Z , Z Rn S. 4: Evaluate the kernel matrix W = {I( X i X j r)}i,j S Rn n and corresponding Laplacian matrix L. 5: Evaluate the loss ˆℓ(β) = tr(Z T LZ ) + λ1 Z Z 2 F + λ2 Z T Z IS 2 F , where IS is a S S identify matrix and F is Frobenius norm of a matrix. 6: Update Θβ to minimize ˆℓ(β). 7: end for Output: Encoder Θβ
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets	Yes	In this section, we present a series of numerical experiments to validate the theoretical results from the previous sections and to compare the performance of unsupervised and self-supervised learning methods. Speciﬁcally, we utilize the MNIST data set (Le Cun et al., 1998), which consists of 60,000 training and 10,000 testing images of 28 28 gray-scale handwritten digits.
Dataset Splits	Yes	Speciﬁcally, we utilize the MNIST data set (Le Cun et al., 1998), which consists of 60,000 training and 10,000 testing images of 28 28 gray-scale handwritten digits. For our self-supervised learning approach, we adopt the transformation introduced by Wang (2025) to generate augmented data. This involves random resizing, cropping, and rotation of the images. In all the numerical experiments, we compare the performance of Augmentation Invariant Manifold Learning (AIML) as deﬁned in (5), with that of a continuous version of the classical graph Laplacian-based method (CML). To comprehensively study the impact of data augmentation, we formulate the unsupervised graph Laplacian-based method as a similar optimization problem to (5), but with the removal of all components related to data augmentation: We conduct two sets of numerical experiments to evaluate the inﬂuence of sample size on representation learning (via unsupervised or self-supervised methods) and classiﬁer training in the downstream task. In the ﬁrst set of experiments, we use all training samples (without labels) to learn data representations, and the linear classiﬁer is trained using 20%, 40%, 60%, 80%, and 100% of labeled samples. The left ﬁgure in Figure 3 presents the misclassiﬁcation rates for AIML and CML representations. It s evident that the performance of AIML and CML remains stable as the sample size increases from 40% to 100%, conﬁrming the ﬁndings in Theorem 6. In the second set of experiments, we learn the representation from 40%, 60%, 80%, and 100% of unlabeled training samples, and the downstream classiﬁer is trained with 40% of labeled samples.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies	No	The paper mentions using a 'convolutional neural network encoder' and 'tuning parameters and optimization algorithms' but does not specify any software libraries or their version numbers (e.g., PyTorch, TensorFlow, scikit-learn versions).
Experiment Setup	No	For a fair comparison, we employ the same convolutional neural network encoder with two convolution+Re LU layers, two pooling layers, and a fully connected layer. Additionally, we use identical tuning parameters and optimization algorithms for both AIML and CML.