reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fine-Grained Change Point Detection for Topic Modeling with Pitman-Yor Process

Authors: Feifei Wang, Zimeng Zhao, Ruimin Ye, Xiaoge Gu, Xiaoling Lu

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental evaluations on both synthetic and real-world datasets demonstrate the eﬀectiveness of TOPIC-PYP in accurately detecting change points and generating high-quality topics. [...] In Section 4, the ﬁnite sample performance of the TOPIC-PYP model is demonstrated through various experiments on synthetic data. In Section 5 and Section 6, the TOPIC-PYP model is applied to two real datasets.
Researcher Affiliation	Collaboration	Feifei Wang EMAIL Center for Applied Statistics School of Statistics Renmin University of China Beijing, 100872, China [...] Ruimin Ye EMAIL Game Security Department Tencent Games Shenzhen, 518057, China
Pseudocode	Yes	Overall, the generative process of TOPIC-PYP is presented below, which is illustrated in Figure 2. The generative process contains three stages. Stage 1 describes the process of determining the number and locations of the change points, Stage 2 employs the Pitman Yor process to model the changing patterns of topic meanings given the identiﬁed change points, and Stage 3 is the ﬁnal process of generating the documents. 1. Stage 1:Generation of Change Points. For topic k with 1 k K: (a) Generate the topic shift probability πk: πk Beta(λ0, λ1) . (b) For the t-th moment with 1 t T: i. Generate the topic shift indicator: when t = 1, set ik,t = 0; when t > 1, generate ik,t Bernoulli(πk). ii. Compute the index of segment: sk,t = Pt j=1 ik,j + 1. (c) Compute the total number of segments: Sk = PT t=1 ik,t + 1. 2. Stage 2: Generation of Topics. For topic k with 1 k K: (a) Generate the basis prior topic-word distribution from a homogeneous Dirichlet distribution: hk Dir(γ). (b) For each speciﬁc segment s {1, . . . , Sk}: i. Generate the topic-word distribution using a Pitman-Yor process: φk,s PYP(a, b, hk) 3. Stage 3: Generation of Documents. For document d with 1 d Dt and 1 t T: (a) Generate its document-topic distribution over K topics: θt,d Dir(α) (b) For word n {1, . . . , Nd}: i. Generate the word topic indicator: zt,d,n Multinomial(θt,d), and denote zt,d,n by k for easy illustration. ii. Find the index segment for topic k : s = sk ,t. iii. Generate the speciﬁc word: wt,d,n Multinomial(φk ,s ).
Open Source Code	No	No explicit statement about the public availability of the TOPIC-PYP model's source code is provided in the paper. The paper only mentions that codes for competing methods are publicly available.
Open Datasets	No	No concrete access information (link, DOI, repository, or formal citation with authors/year) is provided for the collected Journal or Twitter datasets. The synthetic data is generated following the described process, but the generated data itself is not made openly available.
Dataset Splits	No	The paper describes how synthetic data is generated (e.g., number of documents, words per document, total moments, number of topics) and the data collection process for real-world datasets, but it does not specify any training/test/validation dataset splits for experimental reproduction.
Hardware Specification	Yes	All methods are implemented on a server with 8 CPUs and 16 GB memory.
Software Dependencies	No	The paper mentions using the Python package `scweet` for data collection and `gensim` for the DTM comparison method, and the `R package rollinglda` for the Rolling LDA comparison method. However, it does not provide specific version numbers for these or any other software dependencies crucial for reproducing TOPIC-PYP's implementation.
Experiment Setup	Yes	In all settings, the hyperparameters in PYP are set as a = 0.5 and b = 5. Other hyperparameters used in TOPIC-PYP are set as γ = 0.1, α = 0.2, and λ = {λ0, λ1} = {2, 5}. [...] we set the number of topics as K = 20. As for the hyperparameters, we set γ = 0.1, a = 0.5, b = 5, α = 0.1, and λ = {2, 5} for illustration purpose. [...] We set the number of topics as K = 5, since the whole Twitter dataset focuses on Burger King and the tweets are relatively short. The hyperparameters are set as the same as those used in the Journal dataset.