Accelerating Spectral Clustering under Fairness Constraints
Authors: Francesco Tonin, Alex Lambert, Johan Suykens, Volkan Cevher
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments demonstrate the effectiveness of our approach on both synthetic and real-world benchmarks, showing significant speedups in computation time over prior art, especially as the problem size grows. This work thus represents a considerable step forward towards the adoption of fair clustering in real-world applications. |
| Researcher Affiliation | Academia | 1LIONS, EPFL, Switzerland 2ESAT-STADIUS, KU Leuven, Belgium. Correspondence to: Francesco Tonin <EMAIL>. |
| Pseudocode | Yes | Our algorithm is summarized in Algorithm 1. The penalty α(i+1) is updated according to the standard rule suggested in (Boyd et al., 2011) and detailed in Appendix B. |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is being released, nor does it provide any links to a code repository. |
| Open Datasets | Yes | We consider the following synthetic datasets: m SBM, Rand Laplace, and Elliptical. The m-SBM benchmark (Kleindessner et al., 2019) is a modification of the stochastic block model (Holland et al., 1983) to take fairness into account. ... Elliptical has k = 2 clusters and h = 2 groups, as defined in (Feng et al., 2024). We consider the following real-world datasets, summarized in Table 10 in Appendix. Last FMNet (Rozemberczki & Sarkar, 2020) ... Thyroid (Quinlan, 1987) ... Census (Kohavi et al., 1996) ... 4area (Ahmadian et al., 2019). |
| Dataset Splits | No | The paper describes how synthetic datasets are generated (e.g., 'Rand Laplace dataset is a graph induced by a random n n symmetric adjacency matrix W with each node randomly assigned to one of h = 2 groups') and details parameter settings for the experiments (e.g., 'm-SBM benchmark with k = 50, h = 5 and varying n'). However, it does not explicitly provide training/test/validation dataset splits as typically used in supervised machine learning tasks for reproducibility. The evaluation focuses on the clustering results and runtime. |
| Hardware Specification | Yes | Experiments are implemented in Python 3.10 on a machine with a 3.6GHz Intel i7-9700K processor and 64GB RAM. |
| Software Dependencies | No | Experiments are implemented in Python 3.10... Regarding optimization of the dual DC problem, we employ the LBFGS algorithm in the scipy implementation... The compared methods o-FSC (Kleindessner et al., 2019) and s-FSC (Wang et al., 2023) employ the scipy eigendecomposition routine. The paper mentions Python 3.10 as a specific version, but for other software like `scipy`, it only mentions the library name without a version number, which is insufficient for full reproducibility. |
| Experiment Setup | Yes | Regarding optimization of the dual DC problem, we employ the LBFGS algorithm in the scipy implementation with backtracking line search using the strong Wolfe conditions with initialization from the standard normal distribution; the stopping condition is given by gtol = 10 3, ftol = 10 4. The ADMM algorithm is run with α(0) = 0.005, T = 10. The compared methods o-FSC (Kleindessner et al., 2019) and s-FSC (Wang et al., 2023) employ the scipy eigendecomposition routine. For s-FSC, we use ˆL 1 as shift, as suggested in their paper. The k-means algorithm is run to obtain the clustering indicators for all compared methods. ... The update rule for α in Algorithm 1 is standard in ADMM (Boyd et al., 2011) and is given by τα(i) if R(i) F > µ S(i) F α(i)/τ if S(i) F > µ R(i) F α(i) otherwise , with primal and dual residuals R(i) = MH(i) Y (i) and S(i) = α(i)(Y (i) Y (i+1)). The parameters τ and µ are set to 2 and 10, respectively. |