Estimation of a Low-rank Topic-Based Model for Information Cascades

Authors: Ming Yu, Varun Gupta, Mladen Kolar

JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on synthetic and real data demonstrate the improved performance and better interpretability of our model compared to existing state-of-the-art methods.
Researcher Affiliation Academia Ming Yu EMAIL Varun Gupta EMAIL Mladen Kolar EMAIL Booth School of Business The University of Chicago Chicago, IL 60637, USA
Pseudocode Yes Algorithm 1 Proximal gradient descent for (14) with regularizer g1( ) Algorithm 2 Gradient descent with hard thresholding for (14) with regularizer g2( )
Open Source Code Yes 1. The codes are available at https://github.com/ming93/Influence_Receptivity_Network
Open Datasets Yes The first data set is the Meme Tracker data set (Leskovec et al., 2009).2 This data set contains 172 million news articles and blog posts from 1 million online sources over a period of one year from September 1, 2008 till August 31, 2009. ... 2. Data available at http://www.memetracker.org/data.html The second data set is the Ar Xiv high-energy physics theory citation network data set (Leskovec et al., 2005; Gehrke et al., 2003).3 This data set includes all papers published in Ar Xiv high-energy physics theory section from 1992 to 2003. ... 3. Data available at http://snap.stanford.edu/data/cit-HepTh.html
Dataset Splits Yes For all three models, we fit the model on a training data set and choose the regularization parameter λ on a validation data set. Each setting of n is repeated 5 times and we report the average value. We consider two metrics to compare our model with Net Rate (Gomez-Rodriguez et al., 2011) and Topic Cascade (Du et al., 2013b): (1) We generate independent n = 5000 test data and calculate negative log-likelihood function on test data for the three models. Finally we check the performance of our method on about 1500 test cascades and compare with Netrate and Topic Cascade. Finally we check the performance of our method on about 1200 test cascades and compare with Netrate and Topic Cascade.
Hardware Specification No We run the three methods on 12 kernels. For Netrate and Topic Cascade, since they are separable in each column, we run 12 columns in parallel; for our method, we calculate the gradient in parallel. We use our Algorithm 1 for our method and the proximal gradient algorithm for the other two methods, as suggested in Gomez-Rodriguez et al. (2016). This work was completed in part with resources provided by the University of Chicago Research Computing Center.
Software Dependencies No For our problem, we choose to develop a gradient-based algorithm. For the regularizer g1, since the ℓ1 norm is non-smooth, we develop a proximal gradient descent algorithm (Parikh and Boyd, 2014); for the regularizer g2, we use an iterative hard thresholding algorithm (Yu et al., 2020). Topic Modeling (LDA) with the text information of each cascade.
Experiment Setup Yes In simulation we set p = 200 nodes, K = 10 topics. We generate the true matrices B1 and B2 row by row. For each row, we randomly pick 2-3 topics and assign a random number Unif(0.8, 1.8) ζ, where ζ = 3 with probability 0.3 and ζ = 1 with probability 0.7. We make 30% of the values 3 times larger to capture the large variability in interests. All other values are set to be 0 and we scale B1 and B2 to have the same column sum. To generate cascades, we randomly choose a node j as the source. We vary the number of cascades n {300, 500, 1000, 2000, 5000, 10000}. For all three models, we fit the model on a training data set and choose the regularization parameter λ on a validation data set. For fair comparison, for each method we set the step size, initialization, penalty λ, and tolerance level to be the same.