Certifiably Optimal Low Rank Factor Analysis

Authors: Dimitris Bertsimas, Martin S. Copenhaver, Rahul Mazumder

JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our proposal on an array of synthetic and real-life datasets. To our knowledge, this is the first paper that demonstrates how a previously intractable rank-constrained optimization problem can be solved to provable optimality by coupling developments in convex analysis and in global and discrete optimization. ... In Section 5, we present computational results demonstrating the effectiveness of our proposed method in terms of (a) modeling flexibility in the choice of the number of factors r and the parameter q, (b) scalability, and (c) the quality of solutions obtained in a wide array of real and synthetic datasets comparisons with several existing methods for FA are considered.
Researcher Affiliation Academia Dimitris Bertsimas EMAIL Sloan School of Management MIT Cambridge, MA 02139, USA Martin S. Copenhaver EMAIL Operations Research Center MIT Cambridge, MA 02139, USA Rahul Mazumder EMAIL Sloan School of Management MIT Cambridge, MA 02139, USA
Pseudocode Yes Algorithm 1 A CG based algorithm for formulation (20) Algorithm 2 Spatial branch and bound scheme to solve (39). The inputs are as follows: (a) upper bounds u0 such that for any i and any Φ FΣ, we have Φi u0 i (see Section 4.3); (b) optimality tolerance TOL; and (c) initial feasible solution Φf FΣ. Algorithm 3 A CG based algorithm for Problem (16)
Open Source Code Yes An implementation of our approach is available at https://github.com/copenhaver/factoranalysis.
Open Datasets Yes The bfidata set has 2800 observations on 28 variables (25 personality self-reported items and 3 demographic variables). The neo data set has 1000 measurements for p = 30 dimensions. The Harman data set is a correlation matrix of 24 psychological tests given to 145 seventhand eighth-grade children. The geomorphology data set is a collection of geomorphological data collected across p = 10 variables and 75 observations. (The geomorphology data set originally has p = 11, but we remove the one categorical feature, leaving p = 10.) The JO data set records athletic performance in Olympic events, and involves 24 observations with p = 58. This example is distinctive because the corresponding correlation matrix Σ is not full rank (having more variables than observations). We used the implementations available in the R package psych (Revelle, 2015) available from CRAN.
Dataset Splits No The paper describes generating synthetic datasets and using several real-world benchmark datasets. However, it does not explicitly detail any training, validation, or test splits for these datasets. For synthetic data, it describes generation, not splitting. For real data, it references the datasets and their properties but does not specify how they were partitioned for experimental evaluation, beyond stating "In the finite sample setting, we set Σ to be the sample covariance matrix.".
Hardware Specification No All computational experiments are performed in a shared cluster computing environment with highly variable demand, and therefore runtimes are not necessarily a reliable measure of problem complexity; hence, the number of nodes considered is always displayed.
Software Dependencies No The paper mentions several software tools and packages, including "R package psych (Revelle, 2015)", "n Factors (Raiche and Magis, 2011)", "GPArotation (Bernaards and Jennrich, 2005)", "julian using SDO solvers MOSEK", "SCS (O Donoghue et al., 2016)", and "MATLAB". However, it does not provide specific version numbers for Julia, MOSEK, SCS, MATLAB, or the R packages, which are required for a reproducible description of ancillary software.
Experiment Setup Yes For the problems (CFAq), q {1, 2}, we present the results of Algorithm 1. (Results obtained by the approach in Appendix B were similar.) In all the examples, with the exception of MTFA, we set the number of factors to be r = (R 1), one less than the rank of the covariance matrix for the common underlying factors; in other words, the remaining rank one component can be considered as noise. ... we took the convergence thresholds for Algorithm 1 as TOL = 10 5 and ADMM as TOL α = 10 9. ... We always use default tolerance TOL = 0.1 for algorithm termination. Parameters for branching, pruning, node selection, etc., are detailed throughout Section 4. ... For the computational experiments, we set ϵ = 0.4. ... We set β = 0.9 for all computational experiments.