reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Approximation Vector Machines for Large-scale Online Learning

Authors: Trung Le, Tu Dinh Nguyen, Vu Nguyen, Dinh Phung

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted extensive experiments for classiﬁcation and regression tasks in batch and online modes using several benchmark datasets. The quantitative results show that our proposed AVM obtained comparable predictive performances with current state-of-the-art methods while simultaneously achieving signiﬁcant computational speed-up due to the ability of the proposed AVM in maintaining the model size.
Researcher Affiliation	Academia	Trung Le EMAIL Tu Dinh Nguyen EMAIL Vu Nguyen EMAIL Dinh Phung EMAIL Centre for Pattern Recognition and Data Analytics, School of Information Technology, Deakin University, Australia, Waurn Ponds Campus
Pseudocode	Yes	Algorithm 1 Stochastic Gradient Descent algorithm. Algorithm 2 Approximation Vector Machine. Algorithm 3 Constructing hypersphere δ-coverage. Algorithm 4 Constructing hyperrectangle δ-coverage. Algorithm 5 Multiclass Approximation Vector Machine.
Open Source Code	Yes	The source code and experimental scripts are published for reproducibility5. https://github.com/tund/avm.
Open Datasets	Yes	Except the airlines, all of the datasets can be downloaded from LIBSVM6 and UCI7 websites. The airlines dataset is provided by American Statistical Association (ASA8). The data can be downloaded from http://stat-computing.org/dataexpo/2009/.
Dataset Splits	Yes	In batch classiﬁcation experiments, we follow the original divisions of training and testing sets in LIBSVM and UCI sites wherever available. For KDDCup99, covtype and airlines datasets, we split the data into 90% for training and 10% for testing. In online classiﬁcation and regression tasks, we either use the entire datasets or concatenate training and testing parts into one. The online learning algorithms are then trained in a single pass through the data. In both batch and online settings, for each dataset, the models perform 10 runs on diﬀerent random permutations of the training data samples.
Hardware Specification	Yes	All experiments are conducted using a Windows machine with 3.46GHz Xeon processor and 96GB RAM.
Software Dependencies	No	Our models are implemented in Python with Numpy package. Their C++ implementations with Matlab interfaces are published as a part of LIBSVM, Budgeted SVM9 and LSOKL10 toolboxes.
Experiment Setup	Yes	The hyperparameters are varied in certain ranges and selected for the best performance on the validation set. The ranges are given as follows: C {2−5, 2−3, ..., 215}, λ {2−4/N, 2−2/N, ..., 216/N}, γ {2−8, 2−4, 2−2, 20, 22, 24, 28}, η {16.0, 8.0, 4.0, 2.0, 0.2, 0.02, 0.002, 0.0002} where N is the number of data points. The coverage diameter δ of AVM is selected following the approach described in Section 9.2. For the budget size B in NOGD and Pegasos algorithm, and the feature dimension D in FOGD for each dataset, we use identical values to those used in Section 7.1.1 of Lu et al. (2015). We utilize RBF kernel, i.e., K(x,x) = exp(−γ \|\|x-x'\|\|^2) for all algorithms including ours. For a fair comparison, these hyperparameters are speciﬁed using cross-validation on training subset. Particularly, we further partition the training set into 80% for learning and 20% for validation. For large-scale databases, we use only 1% of training set, so that the searching can ﬁnish within an acceptable time budget.