Correlation Clustering with Active Learning of Pairwise Similarities
Authors: Linus Aronsson, Morteza Haghir Chehreghani
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our framework and the proposed query strategies via several experimental studies. ... In this section, we describe our experimental studies, where additional results are presented in Appendix C. ... Figures 1 and 2 illustrate the results for different real-world datasets with a random initialization of σ0 and noise levels γ = 0.2 and γ = 0.4, respectively. We observe that with even a fairly small amount of noise, all the baselines (n COBRAS, COBRAS and QECC) perform very poorly. |
| Researcher Affiliation | Academia | Linus Aronsson EMAIL Chalmers University of Technology Morteza Haghir Chehreghani EMAIL Chalmers University of Technology |
| Pseudocode | Yes | Algorithm 1 Active clustering procedure ... Algorithm 2 Max Correlation Clustering Algorithm A (dynamic k) |
| Open Source Code | No | Implementations of COBRAS and n COBRAS are publicly available and are thus used in our experiments.5 Finally, we note that there exist other active semi-supervised clustering methods developed in the constraint clustering setting such as NPU (Xiong et al., 2014) used as a baseline in (Soenen et al., 2021). ... Link to the open-source implementations of COBRAS and n COBRAS: https://github.com/jonassoenen/noise_robust_cobras |
| Open Datasets | Yes | 2. 20newsgroups: consists of 18846 newsgroups posts (in the form of text) on 20 topics (clusters). ... 3. CIFAR10: consists of 60000 32 32 color images in 10 classes, with 6000 images per class. ... 4. MNIST: consist of 60000 28 28 grayscale images of handwritten digits. |
| Dataset Splits | No | For a dataset of N objects, the number of pairwise similarities is |E| = (N (N 1))/2, which implies the huge querying space that active learning needs to deal with. We use a batch size of B = |E|/1000 for all datasets, unless otherwise specified. ... For random initialization, we randomly assign each of the objects to one of the ten different clusters resulting in a clustering C. Then, for each (u, v) E, the initial similarity σ0(u, v) is set to +0.1 if u and v are in the same cluster according to C, and 0.1 otherwise. |
| Hardware Specification | No | The computations and data handling were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) and the Swedish National Infrastructure for Computing (SNIC) at Chalmers Centre for Computational Science and Engineering (C3SE), High Performance Computing Center North (HPC2N) and Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) partially funded by the Swedish Research Council through grant agreements no. 2022-06725 and no. 2018-05973. |
| Software Dependencies | No | We use the distilbert-base-uncased transformer model loaded from the Flair Python library (Akbik et al., 2018) in order to embed each of the 1000 documents (data points) into a 768-dimensional latent space, in which k-means is performed. ... We use a Res Net18 model (He et al., 2015) trained on the full CIFAR10 dataset in order to embed the 1000 images into a 512-dimensional space, in which k-means is performed. ... We use a simple CNN model trained on the MNIST dataset in order to embed the 1000 images into a 128-dimensional space, in which k-means is performed. |
| Experiment Setup | Yes | Initial pairwise similarities. For each experiment, we are given a dataset with ground-truth labels, where the ground-truth labels are only used for evaluations. Then, for each (u, v) E in a dataset, we set σ (u, v) to +1 if u and v belong to the same class, and 1 otherwise. ... For random initialization, we randomly assign each of the objects to one of the ten different clusters resulting in a clustering C. Then, for each (u, v) E, the initial similarity σ0(u, v) is set to +0.1 if u and v are in the same cluster according to C, and 0.1 otherwise. ... Query strategies. We consider five different query strategies: uniform, uncertainty (Eq. 8), frequency (Eq. 9), maxmin (Eq. 10) and maxexp (Eq. 13). We set ϵ = 0.3, τ = 5 and β = 1 for all experiments unless otherwise specified. ... We use a batch size of B = |E|/1000 for all datasets, unless otherwise specified. ... In our experiments, we set T = 3, η = 2 52 (double precision machine epsilon) and k = |z| in the first iteration of Algorithm 1, and then k = |Ci| for all remaining iterations where |Ci| denotes the number of clusters in the current clustering Ci (in Algorithm 1). |