Information-Theoretic Generative Clustering of Documents

Authors: Xin Du, Kumiko Tanaka-Ishii

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy. We empirically show that our proposed method achieved consistent improvement on four document-clustering datasets, often with large margins. The clustering results are summarized in Table 2, with the rows representing methods or models and the columns representing datasets and evaluation metrics. Our method consistently outperformed the others across all datasets, often by significant margins. For instance, on the R2 dataset, GC achieved 96.1% accuracy, reducing the error rate from 8.0% to 3.9%. The NMI and ARI scores also improved to 77.8 (from 65.6) and 84.9 (from 70.5), respectively.
Researcher Affiliation Academia Xin Du and Kumiko Tanaka-Ishii Waseda University 4-6-1 Okubo, Shinjuku-ku, Tokyo, 169-8555 Japan EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Generative clustering of documents. Algorithm 2: Hierarchical clustering.
Open Source Code Yes Code https://github.com/kduxin/lmgc
Open Datasets Yes We conducted an evaluation of our method using four document clustering datasets: R2, R5, AG News, and Yahoo! Answers, as summarized in Table 1. R2 and R5 are subsets of Reuters-21587 and respectively contain documents from the largest two (EARN, ACQ) and largest five (also CRUDE, TRADE, MONEY-FX) topics, following Guan et al. (2022). For AG News (Gulli 2005), we used the version provided by Zhang, Zhao, and Le Cun (2015). Yahoo! Answers is a more challenging dataset with more documents and clusters. Two datasets, NQ320K (Kwiatkowski et al. 2019) and MS Marco Lite (Du, Xiu, and Tanaka-Ishii 2024), were used for evaluation, with the numbers of documents listed in the top row of Table 3.
Dataset Splits Yes For AG News and Yahoo! Answers, we merged the training and test splits, as our method is unsupervised. Clustering algorithms are sensitive to initialization and affected by randomness. To mitigate this, we performed model selection by running each method 10 times with different seeds and selecting the run with the lowest total distortion. We repeated this process 100 times, and report the mean performance of the selected runs.
Hardware Specification No On the R2 dataset, for example, calculation of P finishes within 10 minutes on a single GPU.
Software Dependencies No For the language model, we used the pretrained doc2query model all-with prefix-t5-base-v1.1. For BERT-based baselines, we tested the original BERT model (Devlin et al. 2019) and multiple SBERT models (Reimers and Gurevych 2019) that were fine-tuned for document representation and clustering. The pretrained SBERT models are available at https://sbert.net/ docs/sentence transformer/pretrained models.html.
Experiment Setup Yes We set to 0.25 and J to 1024 by default. We set K to the number of clusters in each dataset, as seen in the rightmost column of Table 1. Clustering algorithms are sensitive to initialization and affected by randomness. To mitigate this, we performed model selection by running each method 10 times with different seeds and selecting the run with the lowest total distortion. We repeated this process 100 times, and report the mean performance of the selected runs. We set K = 30 and J = 4096. We used a small retrieval model with about 30M parameters to highlight the clustering method s advantage.