reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Information-Theoretic Generative Clustering of Documents

Authors: Xin Du, Kumiko Tanaka-Ishii

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy. We empirically show that our proposed method achieved consistent improvement on four document-clustering datasets, often with large margins. The clustering results are summarized in Table 2, with the rows representing methods or models and the columns representing datasets and evaluation metrics. Our method consistently outperformed the others across all datasets, often by signiﬁcant margins. For instance, on the R2 dataset, GC achieved 96.1% accuracy, reducing the error rate from 8.0% to 3.9%. The NMI and ARI scores also improved to 77.8 (from 65.6) and 84.9 (from 70.5), respectively.
Researcher Affiliation	Academia	Xin Du and Kumiko Tanaka-Ishii Waseda University 4-6-1 Okubo, Shinjuku-ku, Tokyo, 169-8555 Japan EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Generative clustering of documents. Algorithm 2: Hierarchical clustering.
Open Source Code	Yes	Code https://github.com/kduxin/lmgc
Open Datasets	Yes	We conducted an evaluation of our method using four document clustering datasets: R2, R5, AG News, and Yahoo! Answers, as summarized in Table 1. R2 and R5 are subsets of Reuters-21587 and respectively contain documents from the largest two (EARN, ACQ) and largest ﬁve (also CRUDE, TRADE, MONEY-FX) topics, following Guan et al. (2022). For AG News (Gulli 2005), we used the version provided by Zhang, Zhao, and Le Cun (2015). Yahoo! Answers is a more challenging dataset with more documents and clusters. Two datasets, NQ320K (Kwiatkowski et al. 2019) and MS Marco Lite (Du, Xiu, and Tanaka-Ishii 2024), were used for evaluation, with the numbers of documents listed in the top row of Table 3.
Dataset Splits	Yes	For AG News and Yahoo! Answers, we merged the training and test splits, as our method is unsupervised. Clustering algorithms are sensitive to initialization and affected by randomness. To mitigate this, we performed model selection by running each method 10 times with different seeds and selecting the run with the lowest total distortion. We repeated this process 100 times, and report the mean performance of the selected runs.
Hardware Specification	No	On the R2 dataset, for example, calculation of P ﬁnishes within 10 minutes on a single GPU.
Software Dependencies	No	For the language model, we used the pretrained doc2query model all-with preﬁx-t5-base-v1.1. For BERT-based baselines, we tested the original BERT model (Devlin et al. 2019) and multiple SBERT models (Reimers and Gurevych 2019) that were ﬁne-tuned for document representation and clustering. The pretrained SBERT models are available at https://sbert.net/ docs/sentence transformer/pretrained models.html.
Experiment Setup	Yes	We set to 0.25 and J to 1024 by default. We set K to the number of clusters in each dataset, as seen in the rightmost column of Table 1. Clustering algorithms are sensitive to initialization and affected by randomness. To mitigate this, we performed model selection by running each method 10 times with different seeds and selecting the run with the lowest total distortion. We repeated this process 100 times, and report the mean performance of the selected runs. We set K = 30 and J = 4096. We used a small retrieval model with about 30M parameters to highlight the clustering method s advantage.