Fine-Grained Change Point Detection for Topic Modeling with Pitman-Yor Process

Authors: Feifei Wang, Zimeng Zhao, Ruimin Ye, Xiaoge Gu, Xiaoling Lu

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations on both synthetic and real-world datasets demonstrate the effectiveness of TOPIC-PYP in accurately detecting change points and generating high-quality topics. [...] In Section 4, the finite sample performance of the TOPIC-PYP model is demonstrated through various experiments on synthetic data. In Section 5 and Section 6, the TOPIC-PYP model is applied to two real datasets.
Researcher Affiliation Collaboration Feifei Wang EMAIL Center for Applied Statistics School of Statistics Renmin University of China Beijing, 100872, China [...] Ruimin Ye EMAIL Game Security Department Tencent Games Shenzhen, 518057, China
Pseudocode Yes Overall, the generative process of TOPIC-PYP is presented below, which is illustrated in Figure 2. The generative process contains three stages. Stage 1 describes the process of determining the number and locations of the change points, Stage 2 employs the Pitman Yor process to model the changing patterns of topic meanings given the identified change points, and Stage 3 is the final process of generating the documents. 1. Stage 1:Generation of Change Points. For topic k with 1 k K: (a) Generate the topic shift probability πk: πk Beta(λ0, λ1) . (b) For the t-th moment with 1 t T: i. Generate the topic shift indicator: when t = 1, set ik,t = 0; when t > 1, generate ik,t Bernoulli(πk). ii. Compute the index of segment: sk,t = Pt j=1 ik,j + 1. (c) Compute the total number of segments: Sk = PT t=1 ik,t + 1. 2. Stage 2: Generation of Topics. For topic k with 1 k K: (a) Generate the basis prior topic-word distribution from a homogeneous Dirichlet distribution: hk Dir(γ). (b) For each specific segment s {1, . . . , Sk}: i. Generate the topic-word distribution using a Pitman-Yor process: φk,s PYP(a, b, hk) 3. Stage 3: Generation of Documents. For document d with 1 d Dt and 1 t T: (a) Generate its document-topic distribution over K topics: θt,d Dir(α) (b) For word n {1, . . . , Nd}: i. Generate the word topic indicator: zt,d,n Multinomial(θt,d), and denote zt,d,n by k for easy illustration. ii. Find the index segment for topic k : s = sk ,t. iii. Generate the specific word: wt,d,n Multinomial(φk ,s ).
Open Source Code No No explicit statement about the public availability of the TOPIC-PYP model's source code is provided in the paper. The paper only mentions that codes for *competing methods* are publicly available.
Open Datasets No No concrete access information (link, DOI, repository, or formal citation with authors/year) is provided for the collected Journal or Twitter datasets. The synthetic data is generated following the described process, but the generated data itself is not made openly available.
Dataset Splits No The paper describes how synthetic data is generated (e.g., number of documents, words per document, total moments, number of topics) and the data collection process for real-world datasets, but it does not specify any training/test/validation dataset splits for experimental reproduction.
Hardware Specification Yes All methods are implemented on a server with 8 CPUs and 16 GB memory.
Software Dependencies No The paper mentions using the Python package `scweet` for data collection and `gensim` for the DTM comparison method, and the `R package rollinglda` for the Rolling LDA comparison method. However, it does not provide specific version numbers for these or any other software dependencies crucial for reproducing TOPIC-PYP's implementation.
Experiment Setup Yes In all settings, the hyperparameters in PYP are set as a = 0.5 and b = 5. Other hyperparameters used in TOPIC-PYP are set as γ = 0.1, α = 0.2, and λ = {λ0, λ1} = {2, 5}. [...] we set the number of topics as K = 20. As for the hyperparameters, we set γ = 0.1, a = 0.5, b = 5, α = 0.1, and λ = {2, 5} for illustration purpose. [...] We set the number of topics as K = 5, since the whole Twitter dataset focuses on Burger King and the tweets are relatively short. The hyperparameters are set as the same as those used in the Journal dataset.