Online Tensor Methods for Learning Latent Variable Models

Authors: Furong Huang, U. N. Niranjan, Mohammad Umar Hakeem, Animashree Anandkumar

JMLR 2015 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct optimization of multilinear operations in SGD and avoid directly forming the tensors, to save computational and storage costs. We present optimized algorithm in two platforms. Our GPU-based implementation exploits the parallelism of SIMD architectures to allow for maximum speed-up by a careful optimization of storage and data transfer, whereas our CPU-based implementation uses efficient sparse matrix computations and is suitable for large sparse data sets. For the community detection problem, we demonstrate accuracy and computational efficiency on Facebook, Yelp and DBLP data sets, and for the topic modeling problem, we also demonstrate good performance on the New York Times data set. We compare our results to the state-of-the-art algorithms such as the variational method, and report a gain of accuracy and a gain of several orders of magnitude in the execution time.
Researcher Affiliation Academia Furong Huang EMAIL U. N. Niranjan EMAIL Mohammad Umar Hakeem EMAIL Animashree Anandkumar EMAIL Electrical Engineering and Computer Science Dept. University of California, Irvine Irvine, USA 92697, USA
Pseudocode Yes Algorithm 1 Overall approach for learning latent variable models via a moment-based approach. Input: Observed data: social network graph or document samples. Output: Learned latent variable model and infer hidden attributes. (...) Algorithm 2 Randomized Tall-thin SVD Input: Second moment matrix M2. Output: Whitening matrix W. (...) Algorithm 3 Randomized Pseudoinverse Input: Pairs matrix Pairs (B, C). Output: Pseudoinverse of the pairs matrix (Pairs (B, C)) .
Open Source Code Yes The code is available at http://github.com/Furong Huang/Fast-Detection-of-Overlapping-Communities-via-Online-Tensor-Methods
Open Datasets Yes We learn interesting hidden topics in New York Times corpus from UCI bag-of-words data set1 with around 100, 000 words and 300, 000 documents in about two minutes. 1. https://archive.ics.uci.edu/ml/datasets/Bag+of+Words (...) The DBLP data contains bibliographic records7 with various publication venues, such as journals and conferences, which we model as communities. 7. http://dblp.uni-trier.de/xml/Dblp.xml (...) Facebook Dataset: A snapshot of the Facebook network of UNC (Traud et al., 2010) is provided with user attributes.
Dataset Splits No The paper focuses on learning latent variable models from observed data and evaluates performance against ground truth or other methods. It does not describe specific training, validation, or testing dataset splits, percentages, or sample counts. The evaluation metrics like recovery ratio and error function are applied to the learned models from the full datasets rather than held-out splits.
Hardware Specification Yes Table 3: System specifications. Hardware / software Version CPU Dual 8-core Xeon @ 2.0GHz Memory 64GB DDR3 GPU Nvidia Quadro K5000 CUDA Cores 1536 Global memory 4GB GDDR5 Cent OS Release 6.4 (Final) GCC 4.4.7 CUDA Release 5.0 CULA-Dense R16a
Software Dependencies Yes Table 3: System specifications. Hardware / software Version CPU Dual 8-core Xeon @ 2.0GHz Memory 64GB DDR3 GPU Nvidia Quadro K5000 CUDA Cores 1536 Global memory 4GB GDDR5 Cent OS Release 6.4 (Final) GCC 4.4.7 CUDA Release 5.0 CULA-Dense R16a
Experiment Setup Yes We choose θ = 1 in our experiments to ensure that there is sufficient penalty for non-orthogonality, which prevents us from obtaining degenerate solutions. (...) For the mixed membership model, we set the concentration parameter α0 = 1. (...) Table 7: Yelp, Facebook and DBLP main quantitative evaluation of the tensor method versus the variational method: bk is the community number specified to our algorithm, Thre is the threshold for picking significant estimated membership entries.