Fine-Grained Change Point Detection for Topic Modeling with Pitman-Yor Process
Authors: Feifei Wang, Zimeng Zhao, Ruimin Ye, Xiaoge Gu, Xiaoling Lu
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluations on both synthetic and real-world datasets demonstrate the effectiveness of TOPIC-PYP in accurately detecting change points and generating high-quality topics. [...] In Section 4, the finite sample performance of the TOPIC-PYP model is demonstrated through various experiments on synthetic data. In Section 5 and Section 6, the TOPIC-PYP model is applied to two real datasets. |
| Researcher Affiliation | Collaboration | Feifei Wang EMAIL Center for Applied Statistics School of Statistics Renmin University of China Beijing, 100872, China [...] Ruimin Ye EMAIL Game Security Department Tencent Games Shenzhen, 518057, China |
| Pseudocode | Yes | Overall, the generative process of TOPIC-PYP is presented below, which is illustrated in Figure 2. The generative process contains three stages. Stage 1 describes the process of determining the number and locations of the change points, Stage 2 employs the Pitman Yor process to model the changing patterns of topic meanings given the identified change points, and Stage 3 is the final process of generating the documents. 1. Stage 1:Generation of Change Points. For topic k with 1 k K: (a) Generate the topic shift probability πk: πk Beta(λ0, λ1) . (b) For the t-th moment with 1 t T: i. Generate the topic shift indicator: when t = 1, set ik,t = 0; when t > 1, generate ik,t Bernoulli(πk). ii. Compute the index of segment: sk,t = Pt j=1 ik,j + 1. (c) Compute the total number of segments: Sk = PT t=1 ik,t + 1. 2. Stage 2: Generation of Topics. For topic k with 1 k K: (a) Generate the basis prior topic-word distribution from a homogeneous Dirichlet distribution: hk Dir(γ). (b) For each specific segment s {1, . . . , Sk}: i. Generate the topic-word distribution using a Pitman-Yor process: φk,s PYP(a, b, hk) 3. Stage 3: Generation of Documents. For document d with 1 d Dt and 1 t T: (a) Generate its document-topic distribution over K topics: θt,d Dir(α) (b) For word n {1, . . . , Nd}: i. Generate the word topic indicator: zt,d,n Multinomial(θt,d), and denote zt,d,n by k for easy illustration. ii. Find the index segment for topic k : s = sk ,t. iii. Generate the specific word: wt,d,n Multinomial(φk ,s ). |
| Open Source Code | No | No explicit statement about the public availability of the TOPIC-PYP model's source code is provided in the paper. The paper only mentions that codes for *competing methods* are publicly available. |
| Open Datasets | No | No concrete access information (link, DOI, repository, or formal citation with authors/year) is provided for the collected Journal or Twitter datasets. The synthetic data is generated following the described process, but the generated data itself is not made openly available. |
| Dataset Splits | No | The paper describes how synthetic data is generated (e.g., number of documents, words per document, total moments, number of topics) and the data collection process for real-world datasets, but it does not specify any training/test/validation dataset splits for experimental reproduction. |
| Hardware Specification | Yes | All methods are implemented on a server with 8 CPUs and 16 GB memory. |
| Software Dependencies | No | The paper mentions using the Python package `scweet` for data collection and `gensim` for the DTM comparison method, and the `R package rollinglda` for the Rolling LDA comparison method. However, it does not provide specific version numbers for these or any other software dependencies crucial for reproducing TOPIC-PYP's implementation. |
| Experiment Setup | Yes | In all settings, the hyperparameters in PYP are set as a = 0.5 and b = 5. Other hyperparameters used in TOPIC-PYP are set as γ = 0.1, α = 0.2, and λ = {λ0, λ1} = {2, 5}. [...] we set the number of topics as K = 20. As for the hyperparameters, we set γ = 0.1, a = 0.5, b = 5, α = 0.1, and λ = {2, 5} for illustration purpose. [...] We set the number of topics as K = 5, since the whole Twitter dataset focuses on Burger King and the tweets are relatively short. The hyperparameters are set as the same as those used in the Journal dataset. |