reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Logits to Hierarchies: Hierarchical Clustering made Simple

Authors: Emanuele Palumbo, Moritz Vandenhirtz, Alain Ryser, Imant Daunhawer, Julia E Vogt

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present the experimental results of our study. In the first part, we empirically demonstrate that existing specialized deep hierarchical clustering models face significant limitations in realistic scenarios. These limitations stem from their high computational demands and their difficulty in handling a large number of clusters. In contrast, our proposed method demonstrates compelling results on challenging vision datasets, achieving substantially better performance compared to these specialized models. We show our results in Table 1, including metrics to evaluate models at the leaf level and metrics to evaluate the quality of the produced hierarchy.
Researcher Affiliation	Academia	1ETH AI Center, Zurich. 2Department of Computer Science, ETH Zurich. Correspondence to: Emanuele Palumbo <EMAIL>.
Pseudocode	Yes	Algorithm 1 Logits to Hierarchies (L2H). Given aggregation function Λ, and functions gθ, gm θ defined as above for pre-trained K-clustering model fθ. Input: Dataset D. Output: Hierarchy H.
Open Source Code	Yes	In Figure 4, we provide a Python implementation of the L2H algorithm proposed in this work using standard scientific computing libraries (Num Py, Sci Py).
Open Datasets	Yes	In this work, we run experiments on five challenging vision datasets, namely CIFAR-10 and CIFAR-100 (Lake et al., 2015), Food-101 (Bossard et al., 2014), Image Net1K (Deng et al., 2009), and as well INaturalist21(Van Horn et al., 2021) introduced in Appendix C.
Dataset Splits	Yes	The CIFAR-10 dataset consists of 60000 32x32 colored images, divided in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The train/test splits contain 50000 and 10000 images respectively. Similarly, also the CIFAR-100 dataset consists of 60000 32x32 colored images. However, they are organized into 100 classes. In addition, the 100 classes are grouped into 20 superclasses. As for CIFAR-10, the train/test splits also contain 50000 and 10000 images respectively. The Food101 dataset is a fine-grained classification dataset of food images, consisting of 101000 images for 101 classes. Images are high-resolution, up to 512 pixels side length. Images are split between 75750 training samples and 25250 test images. The Image Net-1k dataset, widely used in computer vision, consists of 1000 classes organized according to the Word Net hierarchy (Miller, 1995), with 1281167 training and 50000 test samples, respectively.
Hardware Specification	No	The text only mentions "on a CPU" and "on a GPU" generally, without providing specific models, manufacturers, or detailed specifications (e.g., NVIDIA A100, Intel Xeon).
Software Dependencies	No	The text mentions Python libraries such as Num Py, Sci Py, and scikit-learn but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We train both TEMI and TURTLE with a number of clusters K equal to the true number of classes in each dataset (unless explicitly stated otherwise, e.g. in the sensitivity analysis reported in Figure 5). For each dataset, we train models on the training set, then report metrics on the test set. Note that the L2H algorithm takes as input logits from the training set to infer the hierarchy, while metrics that evaluate the quality of the hierarchy are computed on the test set. As the aggregation function Λ in the L2H algorithm (see Section 3) we employ [...] which we find to work well experimentally. However, other choices are possible (see also Table 7).