From Logits to Hierarchies: Hierarchical Clustering made Simple
Authors: Emanuele Palumbo, Moritz Vandenhirtz, Alain Ryser, Imant Daunhawer, Julia E Vogt
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present the experimental results of our study. In the first part, we empirically demonstrate that existing specialized deep hierarchical clustering models face significant limitations in realistic scenarios. These limitations stem from their high computational demands and their difficulty in handling a large number of clusters. In contrast, our proposed method demonstrates compelling results on challenging vision datasets, achieving substantially better performance compared to these specialized models. We show our results in Table 1, including metrics to evaluate models at the leaf level and metrics to evaluate the quality of the produced hierarchy. |
| Researcher Affiliation | Academia | 1ETH AI Center, Zurich. 2Department of Computer Science, ETH Zurich. Correspondence to: Emanuele Palumbo <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Logits to Hierarchies (L2H). Given aggregation function Λ, and functions gθ, gm θ defined as above for pre-trained K-clustering model fθ. Input: Dataset D. Output: Hierarchy H. |
| Open Source Code | Yes | In Figure 4, we provide a Python implementation of the L2H algorithm proposed in this work using standard scientific computing libraries (Num Py, Sci Py). |
| Open Datasets | Yes | In this work, we run experiments on five challenging vision datasets, namely CIFAR-10 and CIFAR-100 (Lake et al., 2015), Food-101 (Bossard et al., 2014), Image Net1K (Deng et al., 2009), and as well INaturalist21(Van Horn et al., 2021) introduced in Appendix C. |
| Dataset Splits | Yes | The CIFAR-10 dataset consists of 60000 32x32 colored images, divided in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The train/test splits contain 50000 and 10000 images respectively. Similarly, also the CIFAR-100 dataset consists of 60000 32x32 colored images. However, they are organized into 100 classes. In addition, the 100 classes are grouped into 20 superclasses. As for CIFAR-10, the train/test splits also contain 50000 and 10000 images respectively. The Food101 dataset is a fine-grained classification dataset of food images, consisting of 101000 images for 101 classes. Images are high-resolution, up to 512 pixels side length. Images are split between 75750 training samples and 25250 test images. The Image Net-1k dataset, widely used in computer vision, consists of 1000 classes organized according to the Word Net hierarchy (Miller, 1995), with 1281167 training and 50000 test samples, respectively. |
| Hardware Specification | No | The text only mentions "on a CPU" and "on a GPU" generally, without providing specific models, manufacturers, or detailed specifications (e.g., NVIDIA A100, Intel Xeon). |
| Software Dependencies | No | The text mentions Python libraries such as Num Py, Sci Py, and scikit-learn but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We train both TEMI and TURTLE with a number of clusters K equal to the true number of classes in each dataset (unless explicitly stated otherwise, e.g. in the sensitivity analysis reported in Figure 5). For each dataset, we train models on the training set, then report metrics on the test set. Note that the L2H algorithm takes as input logits from the training set to infer the hierarchy, while metrics that evaluate the quality of the hierarchy are computed on the test set. As the aggregation function Λ in the L2H algorithm (see Section 3) we employ [...] which we find to work well experimentally. However, other choices are possible (see also Table 7). |