Accelerating Spectral Clustering under Fairness Constraints

Authors: Francesco Tonin, Alex Lambert, Johan Suykens, Volkan Cevher

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments demonstrate the effectiveness of our approach on both synthetic and real-world benchmarks, showing significant speedups in computation time over prior art, especially as the problem size grows. This work thus represents a considerable step forward towards the adoption of fair clustering in real-world applications.
Researcher Affiliation Academia 1LIONS, EPFL, Switzerland 2ESAT-STADIUS, KU Leuven, Belgium. Correspondence to: Francesco Tonin <EMAIL>.
Pseudocode Yes Our algorithm is summarized in Algorithm 1. The penalty α(i+1) is updated according to the standard rule suggested in (Boyd et al., 2011) and detailed in Appendix B.
Open Source Code No The paper does not explicitly state that the source code for the described methodology is being released, nor does it provide any links to a code repository.
Open Datasets Yes We consider the following synthetic datasets: m SBM, Rand Laplace, and Elliptical. The m-SBM benchmark (Kleindessner et al., 2019) is a modification of the stochastic block model (Holland et al., 1983) to take fairness into account. ... Elliptical has k = 2 clusters and h = 2 groups, as defined in (Feng et al., 2024). We consider the following real-world datasets, summarized in Table 10 in Appendix. Last FMNet (Rozemberczki & Sarkar, 2020) ... Thyroid (Quinlan, 1987) ... Census (Kohavi et al., 1996) ... 4area (Ahmadian et al., 2019).
Dataset Splits No The paper describes how synthetic datasets are generated (e.g., 'Rand Laplace dataset is a graph induced by a random n n symmetric adjacency matrix W with each node randomly assigned to one of h = 2 groups') and details parameter settings for the experiments (e.g., 'm-SBM benchmark with k = 50, h = 5 and varying n'). However, it does not explicitly provide training/test/validation dataset splits as typically used in supervised machine learning tasks for reproducibility. The evaluation focuses on the clustering results and runtime.
Hardware Specification Yes Experiments are implemented in Python 3.10 on a machine with a 3.6GHz Intel i7-9700K processor and 64GB RAM.
Software Dependencies No Experiments are implemented in Python 3.10... Regarding optimization of the dual DC problem, we employ the LBFGS algorithm in the scipy implementation... The compared methods o-FSC (Kleindessner et al., 2019) and s-FSC (Wang et al., 2023) employ the scipy eigendecomposition routine. The paper mentions Python 3.10 as a specific version, but for other software like `scipy`, it only mentions the library name without a version number, which is insufficient for full reproducibility.
Experiment Setup Yes Regarding optimization of the dual DC problem, we employ the LBFGS algorithm in the scipy implementation with backtracking line search using the strong Wolfe conditions with initialization from the standard normal distribution; the stopping condition is given by gtol = 10 3, ftol = 10 4. The ADMM algorithm is run with α(0) = 0.005, T = 10. The compared methods o-FSC (Kleindessner et al., 2019) and s-FSC (Wang et al., 2023) employ the scipy eigendecomposition routine. For s-FSC, we use ˆL 1 as shift, as suggested in their paper. The k-means algorithm is run to obtain the clustering indicators for all compared methods. ... The update rule for α in Algorithm 1 is standard in ADMM (Boyd et al., 2011) and is given by τα(i) if R(i) F > µ S(i) F α(i)/τ if S(i) F > µ R(i) F α(i) otherwise , with primal and dual residuals R(i) = MH(i) Y (i) and S(i) = α(i)(Y (i) Y (i+1)). The parameters τ and µ are set to 2 and 10, respectively.