reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CAREER: A Foundation Model for Labor Sequence Data

Authors: Keyon Vafa, Emil Palikot, Tianyu Du, Ayush Kanodia, Susan Athey, David Blei

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To study this model empirically, we pretrain CAREER on a dataset of 24 million passively-collected resumes provided by Zippia, a career planning company. We then fine-tune CAREER s representations of job sequences to make predictions on three widely-used economic datasets: the National Longitudinal Survey of Youth 1979 (NLSY79), another cohort from the same survey (NLSY97), and the Panel Study of Income Dynamics (PSID). In this study, we find that CAREER outperforms standard econometric models for predicting and forecasting occupations on these survey datasets.
Researcher Affiliation	Academia	Keyon Vafa EMAIL Harvard University Emil Palikot Stanford University Tianyu Du Stanford University Ayush Kanodia Stanford University Susan Athey Stanford University David M. Blei Columbia University
Pseudocode	No	The paper describes the model's architecture and computations using mathematical equations and textual explanations (e.g., Section 2.3 CAREER Model, and Appendix C Transformer Details), along with a computation graph (Figure 1), but it does not present a distinct block of pseudocode or a formal algorithm.
Open Source Code	Yes	We release code so that practitioners can train CAREER for their own problems.1 1https://github.com/keyonvafa/career-code
Open Datasets	Yes	We pretrain CAREER on a large dataset of resumes provided by Zippia, a career planning company. This dataset contains resumes from 23.7 million working Americans. Each job is encoded into one of 330 occupational codes, using the coding scheme of Autor & Dorn (2013). We transfer CAREER to three widely-used survey datasets: two cohorts from the National Longitudinal Survey of Youth (NLSY79 and NLSY97) and the Panel Study of Income Dynamics (PSID). These datasets have been carefully constructed to be representative of the general population, and they are widely used by economists for estimating economic quantities. NLSY79 is a longitudinal panel survey following a cohort of Americans who were between 14 and 22 when the survey began in 1979, while NLSY97 follows a different cohort of individuals who were between 12 and 17 when the survey began in 1997. PSID is a longitudinal survey following a sample of American families, with individuals added over the years. These surveys are publicly available and allow for linking individuals over time. See Appendix F for more information about how the data is formed.
Dataset Splits	Yes	We divide all survey datasets into 70/10/20 train/validation/test splits, and train all models by optimizing the log-likelihood with Adam (Kingma & Ba, 2015).
Hardware Specification	No	Pretraining CAREER on the resumes data takes 18 hours on a single GPU. (See Appendix G for more details on the model and hyperparameters.) ... fine-tuning on one GPU takes 13 minutes on NLSY79, 7 minutes on NLSY97, and 23 minutes on PSID.
Software Dependencies	No	All models were trained using Fairseq (Ott et al., 2019).
Experiment Setup	Yes	In total, we use 16,000 total tokens per minibatch, varying the batch size depending on the largest sequence length in the batch. We use the Adam learning rate scheduler (Kingma & Ba, 2015). All experiments on the resumes data warm up the learning rate from 10 7 to 0.0005 over 4,000 steps, after which the inverse square root schedule is used (Vaswani et al., 2017). For the survey datasets, we also used the inverse square root scheduler, but experimented with various learning rates and warmup updates, using the one we found to work best for each model. For CAREER with pretrained representations, we used a learning rate of 0.0001 and 500 warmup updates; for CAREER without pretraining, we used a learning rate of 0.0005 and 500 warmup updates; for the bag of jobs model, we used a learning rate of 0.0005 and 5,000 warmup updates; for the regression model, we used a learning rate of 0.0005 and 4,000 warmup updates. We use a learning rate of 0.005 for job representation learning and Job2Vec, with 5,000 warmup updates. All models besides were also trained with 0.01 weight decay. All models were trained using Fairseq (Ott et al., 2019).