reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

URLOST: Unsupervised Representation Learning without Stationarity or Topology

Authors: Zeyu Yun, Juexiao Zhang, Yann LeCun, Yubei Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate its effectiveness on three diverse data modalities including simulated biological vision data, neural recordings from the primary visual cortex, and gene expressions. Compared to state-of-the-art unsupervised learning methods like Sim CLR and MAE, our model excels at learning meaningful representations across diverse modalities without knowing their stationarity or topology. It also outperforms other methods that are not dependent on these factors, setting a new benchmark in the field.
Researcher Affiliation	Collaboration	Zeyu Yun1, Juexiao Zhang3, Yann Lecun3,4, Yubei Chen2 1 UC Berkeley, 2 UC Davis, 3 New York University, 4 FAIR at Meta
Pseudocode	Yes	We follow the steps from [51] to perform spectral clustering with a modification to adjust the density: 1. Define D to be the diagonal matrix whose (i,i)-element is the sum of A s i-th row. Construct the matrix L = P 1 2 D 1 2 P 1 2 . 2. Find x1, x2, , xk, the k largest eigenvectors of L, and form the matrix X = [x1, x2, , xk] Rn k by stacking the eigenvectors in columns. 3. Form the matrix Y from X by renormalizing each of X s rows to have unit norms. (i.e. Yij = Xij/(P i X2 ij) 1 2 ) 4. Treating each row of Y as a point in Rk, cluster them into k clusters via K-means or other algorithms.
Open Source Code	Yes	Code is available at this repository.
Open Datasets	Yes	The synthetic dataset is referred to as Foveated CIFAR-10. To make a comprehensive comparison, we also conduct experiments on the original CIFAR-10, and a Permuted CIFAR-10 dataset obtained by randomly permuting the image. ... V1 neural response dataset. The dataset, published by [57], contains responses from over 10,000 V1 neurons captured via two-photon calcium imaging. ... Gene expression dataset. The dataset comes from The Cancer Genome Atlas (TCGA) [79; 84]
Dataset Splits	Yes	We also randomly partition the dataset do five-fold cross-validation and report the average performance in Table 2.
Hardware Specification	Yes	The experiment is performed using a single RTX 2080TI and is averaged over 500 trials.
Software Dependencies	No	The paper mentions 'Adam optimizer' and 'cosine annealing learning rate scheduler' but does not specify any software libraries or their version numbers. For example, it doesn't mention specific Python packages like PyTorch or TensorFlow with versions.
Experiment Setup	Yes	β-VAE. β-VAE was trained for 1000 epochs and 300 epochs on the V1 neura l and TCGA gene expression respectively. We use the Adam optimizer with a learning rate of 0.001 and a cosine annealing learning rate scheduler. The encoder is composed of a 2-layer MLP with batch normalization and Leaky Re LU activation. We use hidden dimensions 2048 and 1024 for V1 and Gene datasets respectively. ... For CIFAR10, we ran our model for 10,000 epochs. We use Adam optimizer with a learning rate of 0.00015 and a cosine annealing. To fit in our tasks, we use 8 8-layer encoders and 4-layer decoders with hidden dimension 192.