reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias

Authors: Yuanzhe Hu, Kinshuk Goel, Vlad Killiakov, Yaoqing Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate the effectiveness of FARMS, we conduct experiments across various application domains, including CV, Sci ML, and LLM pruning. In addition, these experiments are conducted on different parameter settings and model architectures. We compare FARMS with several prior HTSR approaches that use weight eigenspectrum analysis for layer-wise hyperparameter assignment (Zhou et al., 2024; Lu et al., 2024; Liu et al., 2024). We also apply FARMS to measure various post-training and pruned models, making FARMS useful for model compression. Our findings demonstrate that models optimized using FARMS exhibit lower mean and variation in HT-SR metrics across layers, a sign of good-quality training as reported in prior work (Martin et al., 2021; Liu et al., 2024).
Researcher Affiliation	Academia	1Department of Computer Science and Engineering, University of California, San Diego, CA, USA 2Department of Computer Science, Dartmouth College, NH, USA 3Department of Computer Science & Engineering, SRM Institute of Science & Technology, India 4Independent Researcher, University of California, Berkeley, CA, USA.
Pseudocode	No	The paper includes a flowchart in Figure 3 titled 'Main Steps in FARMS', but it does not contain any structured pseudocode or algorithm blocks with step-by-step instructions in a programmatic format.
Open Source Code	Yes	Our code is available here1. 1https://github.com/HUST-AI-HYZ/FARMS
Open Datasets	Yes	Datasets. For image classification, we consider the CIFAR-100 dataset (Krizhevsky, 2012). CIFAR100 consists of 50K pictures for the training set and 10K pictures for the testing set with 100 categories. For evaluating LLM pruning methods, we calculate model perplexity on the held-out Wiki Text (Merity et al., 2017) validation set and use seven tasks, including Bool Q (Clark et al., 2019), RTE (Wang et al., 2019), Hella Swag (Zellers et al., 2019), Wino Grande (Sakaguchi et al., 2021), ARC Easy and Challenge (Clark et al., 2018) and Openbook QA (Mihaylov et al., 2018) for downstream zero-shot evaluation (Gao et al., 2021). For Sci ML, we fine-tune the models on simulated solutions of time-dependent PDE dataset 2D Compressible Navier-Stokes (CFD) 3 from PDEBench (Takamoto et al., 2022).
Dataset Splits	Yes	For image classification, we consider the CIFAR-100 dataset (Krizhevsky, 2012). CIFAR100 consists of 50K pictures for the training set and 10K pictures for the testing set with 100 categories. For evaluating LLM pruning methods, we calculate model perplexity on the held-out Wiki Text (Merity et al., 2017) validation set and use seven tasks, including Bool Q (Clark et al., 2019), RTE (Wang et al., 2019), Hella Swag (Zellers et al., 2019), Wino Grande (Sakaguchi et al., 2021), ARC Easy and Challenge (Clark et al., 2018) and Openbook QA (Mihaylov et al., 2018) for downstream zero-shot evaluation (Gao et al., 2021).
Hardware Specification	Yes	In the LLM pruning task, we use 8 L40 GPUs for weight analysis and record the PL Alpha Hill values for each layer. ... In CV and Sci ML experiments, we use a single L40 GPU and do weight analysis every epoch during the training and fine-tuning.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies used in their experiments, such as deep learning frameworks (e.g., PyTorch, TensorFlow) or other libraries.
Experiment Setup	Yes	First, we report the common hyperparameters shared by Image Classification experiments (Section 4.3): the optimizer is SGD, batch size 128, number of total training epochs 200, weight decay 5e-4, and momentum 0.9. For each experiment setting, we repeat our experiments with three random seeds {43, 37, 13}. We also report the mean and standard deviation of the test accuracy across these seeds. In Table 13, we report the details of experiments for each model and method. We use the same learning rate range from (Zhou et al., 2024) and we expand the scaling ratio range into [(0.1, 1.9), (0.2, 1.8), (0.3, 1.7), (0.4, 1.6), (0.5, 1.5), (0.6, 1.4), (0.7, 1.3), (0.8, 1.2), (0.9, 1.1)] nine choices. Second, we provide the hyperparameters used in experiments of LLM pruning and Sci ML. We follow the common hyperparameter settings as described in Lu et al. (2024); Liu et al. (2024). See more details for other hyperparameters like τ in LLM pruning and scaling ratios in Sci ML in Table 14.