reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Persistent Topological Features in Large Language Models

Authors: Yuri Gardinazzi, Karthik Viswanathan, Giada Panerai, Alessio Ansuini, Alberto Cazzaniga, Matteo Biagetti

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the expressivity and applicability of our framework, we highlight how sensitive these descriptors are to different models and a variety of datasets. As a showcase application to a downstream task, we use zigzag persistence to establish a criterion for layer pruning, achieving results comparable to state-of-the-art methods while preserving the system-level perspective.
Researcher Affiliation	Collaboration	*Equal contribution 1Area Science Park 2University of Trieste 3University of Amsterdam. Correspondence to: Matteo Biagetti <EMAIL>.
Pseudocode	Yes	Algorithm 1 Zigzag algorithm Require: model, dataset, k NN, m reps extract Representations(model, dataset) K [] for i 1 to model.get Num Layers() do graph k Nearest Neighbors Graph(reps[i], k NN) K.append(graph Expansion(graph, m) end for Kint compute Intersection Layers(K) f, times compute Filtration Times(K, Kint) Φ Fast Zig Zag(f, times)
Open Source Code	Yes	All the results contained in this work are reproducible by means of a Git Hub repository that can be found at this link https://github.com/Rit Area Science Park/Zig Zag LLMs.
Open Datasets	Yes	We consider the following datasets: 1) The Standford Sentiment Treebank (SST) dataset (Socher et al., 2013). 2) The Pile dataset (Gao et al., 2020) from which we take a subset of 10K prompts, accessible on Hugging Face.73) A dataset of mathematical problems (Hendrycks et al., 2021b). 4) A dataset of codes retrieved from Git Hub.8
Dataset Splits	Yes	Additionally, we divide the datasets into incremental subsets of {100, 200, ..., 1000} prompts and compute the mean and standard deviation across subsets to systematically evaluate the scalability of our descriptors and to quantify their sensitivity to changes in point cloud size. For our experiments, we consider the 500 prompts subset, amounting to 16 subsets.
Hardware Specification	Yes	with 10K points embedded in a space with dimension d = 4096, a number of neighbors for the k NN graph of k NN = 10, and a maximum homology dimension of m = 10 on an AMD EPYC 7H12 it takes approximately 2 hours.
Software Dependencies	No	The zigzag algorithm is schematically described below. It exploits two existing public codes that were developed for zigzag computations: DIONYSUS2 (Morozov) and FASTZIGZAG (Dey & Hou, 2022). DIONYSUS2 is a C++ library for computing persistent homology, with a specific library for zigzag persistence. In our case, it has the role of extracting the filtration f and computing the times array, i.e. the list of layer indices to be associated with the birth and death of features. FASTZIGZAG allows to calculate efficiently the persistence diagram Persp(Φ) by converting the input zigzag filtration to a non-zigzag filtration of an equivalent complex with the same length, and it then converts the obtained persistence intervals back to zigzag. The benchmarks are evaluated for the models with the use of the library lm-eval-harness by (Gao et al., 2024) with a 5-shot setup. No specific version numbers are provided for Dionysus2, FastZigzag, or lm-eval-harness.
Experiment Setup	Yes	We generate zigzag diagrams for each model and dataset and for each homology dimension up to p = 3, for a range of values of k NN [1, 15]. We find that 0-, 2 and 3-dimensional holes are relatively lower in number, while 1-dimensional holes reach tens of thousands of elements per layer. This behavior might be expected for a k NN-graph-based construction since connections are dense even for low values of k NN, especially if points are concentrated in low dimensional regions of the representation space. We examine this behavior in detail to make sure that our construction is stable for different choices on the k NN graph, see Appendix D for details. The choice of the hyperparameter k NN is done so as to maximize the total number of holes. Therefore, in what follows, we show results for our topological descriptors for 1-dimensional holes and k NN = 4 only. [...] We use 3 benchmarks for layer pruning performance evaluation: MMLU (Hendrycks et al., 2021a), Hella Swag (Zellers et al., 2019), and Winogrande (Sakaguchi et al., 2019), which have been widely used for similar purposes in previous analyses. The benchmarks are evaluated for the models with the use of the library lm-eval-harness by (Gao et al., 2024) with a 5-shot setup.