reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Certifiably Optimal Low Rank Factor Analysis

Authors: Dimitris Bertsimas, Martin S. Copenhaver, Rahul Mazumder

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the eﬀectiveness of our proposal on an array of synthetic and real-life datasets. To our knowledge, this is the ﬁrst paper that demonstrates how a previously intractable rank-constrained optimization problem can be solved to provable optimality by coupling developments in convex analysis and in global and discrete optimization. ... In Section 5, we present computational results demonstrating the eﬀectiveness of our proposed method in terms of (a) modeling ﬂexibility in the choice of the number of factors r and the parameter q, (b) scalability, and (c) the quality of solutions obtained in a wide array of real and synthetic datasets comparisons with several existing methods for FA are considered.
Researcher Affiliation	Academia	Dimitris Bertsimas EMAIL Sloan School of Management MIT Cambridge, MA 02139, USA Martin S. Copenhaver EMAIL Operations Research Center MIT Cambridge, MA 02139, USA Rahul Mazumder EMAIL Sloan School of Management MIT Cambridge, MA 02139, USA
Pseudocode	Yes	Algorithm 1 A CG based algorithm for formulation (20) Algorithm 2 Spatial branch and bound scheme to solve (39). The inputs are as follows: (a) upper bounds u0 such that for any i and any Φ FΣ, we have Φi u0 i (see Section 4.3); (b) optimality tolerance TOL; and (c) initial feasible solution Φf FΣ. Algorithm 3 A CG based algorithm for Problem (16)
Open Source Code	Yes	An implementation of our approach is available at https://github.com/copenhaver/factoranalysis.
Open Datasets	Yes	The bﬁdata set has 2800 observations on 28 variables (25 personality self-reported items and 3 demographic variables). The neo data set has 1000 measurements for p = 30 dimensions. The Harman data set is a correlation matrix of 24 psychological tests given to 145 seventhand eighth-grade children. The geomorphology data set is a collection of geomorphological data collected across p = 10 variables and 75 observations. (The geomorphology data set originally has p = 11, but we remove the one categorical feature, leaving p = 10.) The JO data set records athletic performance in Olympic events, and involves 24 observations with p = 58. This example is distinctive because the corresponding correlation matrix Σ is not full rank (having more variables than observations). We used the implementations available in the R package psych (Revelle, 2015) available from CRAN.
Dataset Splits	No	The paper describes generating synthetic datasets and using several real-world benchmark datasets. However, it does not explicitly detail any training, validation, or test splits for these datasets. For synthetic data, it describes generation, not splitting. For real data, it references the datasets and their properties but does not specify how they were partitioned for experimental evaluation, beyond stating "In the ﬁnite sample setting, we set Σ to be the sample covariance matrix.".
Hardware Specification	No	All computational experiments are performed in a shared cluster computing environment with highly variable demand, and therefore runtimes are not necessarily a reliable measure of problem complexity; hence, the number of nodes considered is always displayed.
Software Dependencies	No	The paper mentions several software tools and packages, including "R package psych (Revelle, 2015)", "n Factors (Raiche and Magis, 2011)", "GPArotation (Bernaards and Jennrich, 2005)", "julian using SDO solvers MOSEK", "SCS (O Donoghue et al., 2016)", and "MATLAB". However, it does not provide specific version numbers for Julia, MOSEK, SCS, MATLAB, or the R packages, which are required for a reproducible description of ancillary software.
Experiment Setup	Yes	For the problems (CFAq), q {1, 2}, we present the results of Algorithm 1. (Results obtained by the approach in Appendix B were similar.) In all the examples, with the exception of MTFA, we set the number of factors to be r = (R 1), one less than the rank of the covariance matrix for the common underlying factors; in other words, the remaining rank one component can be considered as noise. ... we took the convergence thresholds for Algorithm 1 as TOL = 10 5 and ADMM as TOL α = 10 9. ... We always use default tolerance TOL = 0.1 for algorithm termination. Parameters for branching, pruning, node selection, etc., are detailed throughout Section 4. ... For the computational experiments, we set ϵ = 0.4. ... We set β = 0.9 for all computational experiments.