Stabilized Sparse Online Learning for Sparse Data

Authors: Yuting Ma, Tian Zheng

JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments show that our algorithm compares favorably with the original truncated gradient SGD in terms of prediction accuracy, achieving both better sparsity and stability. In Section 6, we evaluate the performance of the proposed algorithm on several real-world high-dimensional data sets with varying sparsity levels.
Researcher Affiliation Academia Yuting Ma EMAIL Department of Statistics Columbia University New York, NY, USA Tian Zheng EMAIL Department of Statistics Columbia University New York, NY, USA
Pseudocode Yes Algorithm 1 B0(w0, g0): A burst of K updates with truncation using a universal gravity as in truncated SGD by Langford et al. (2009). Algorithm 2 B1(w0, g0): A burst of K updates with informative truncation. Algorithm 3 B2(w0, g0, ˆΩ, D(m)): Informative truncated burst with stability selection in thread m. Algorithm 4 Ψ(D): Stabilized truncated stochastic gradient descent for sparse learning.
Open Source Code No The paper does not provide any explicit statement about releasing code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes The first four data sets were constructed for NIPS 2003 Feature Selection Challenge4 (see Guyon et al., 2004, for details)... Reuters CV1 (RCV1) is a popular text classification data set with a bag-of-words representation. We use the binary version from the LIBSVM data set collection5 introduced by Cai and He (2012). 4. The data source can be found on https://archive.ics.uci.edu/ml/index.html. 5. The data source can be found on https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
Dataset Splits Yes We create the training and validation set using a 70-30 random splits. All datasets are normalized such that each feature has variance 1. The information of experiment data sets are summarized in Table 1. Table 1 includes 'Training Size' and 'Validation Size' columns for each dataset (RCV1, Dexter, Dorothea, Gisette, Arcene).
Hardware Specification No M = 16 is used running in parallel on a high-performance computing cluster. We acknowledge computing resources from Columbia University’s Shared Research Computing Facility project, which is supported by NIH Research Facility Improvement Grant 1G20RR03089301, and associated funds from the New York State Empire State Development, Division of Science Technology and Innovation (NYSTAR) Contract C090171, both awarded April 15, 2010. While a computing facility and a 'high-performance computing cluster' are mentioned, no specific hardware models (e.g., CPU, GPU, memory) are provided.
Software Dependencies No The paper mentions several algorithms (standard stochastic gradient descent, truncated gradient, RDA, FOBOS) and programming constructs (L1 regularization) but does not provide specific software names with version numbers for their implementation (e.g., Python 3.x, PyTorch 1.x, scikit-learn 0.x).
Experiment Setup Yes For implementing all five stochastic methods, we allow the algorithms to run on the training data for {5, 10, 20, 30, 40, 50, 60} passes. In the standard stochastic gradient descent algorithm, we choose the optimal learning rate between 0.1 to 0.5. The base gravity parameter in the truncated gradient algorithm is chosen from the range [0.001, 0.01]. For RDA, we follow the suggestions in (Xiao, 2009) which sets γ = 5000 and ρ = 0.005 (effectively γρ = 25). We tune the parameter λ from the set of values {5 × 10−5, 1 × 10−4, 5 × 10−4, 1 × 10−3, 0.01, 0.05, 0.1, 0.5, 1, 5, 10} for both RDA and FOBOS. As instructed by Duchi and Singer (2009), ηt is set to be 1/t in FOBOS. For the proposed stabilized truncated SGD algorithm, the size of each burst (the truncation frequency) is fixed at K = 5 and nK = 5 for all data sets. In the proposed algorithm, we initiate the rejection rate at β0 = 0.7 with annealing rate chosen from {−7, −5, −3, −1, 0, 1, 3}. The stability selection threshold π0 is tuned within the range [0.5, 0.6, . . . , 0.9]. For multi-thread implementation in Algorithm 4, M = 16 is used running in parallel on a high-performance computing cluster. We consider two commonly used convex loss functions in machine learning tasks... Hinge loss... Logistic loss.