CAREER: A Foundation Model for Labor Sequence Data
Authors: Keyon Vafa, Emil Palikot, Tianyu Du, Ayush Kanodia, Susan Athey, David Blei
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To study this model empirically, we pretrain CAREER on a dataset of 24 million passively-collected resumes provided by Zippia, a career planning company. We then fine-tune CAREER s representations of job sequences to make predictions on three widely-used economic datasets: the National Longitudinal Survey of Youth 1979 (NLSY79), another cohort from the same survey (NLSY97), and the Panel Study of Income Dynamics (PSID). In this study, we find that CAREER outperforms standard econometric models for predicting and forecasting occupations on these survey datasets. |
| Researcher Affiliation | Academia | Keyon Vafa EMAIL Harvard University Emil Palikot Stanford University Tianyu Du Stanford University Ayush Kanodia Stanford University Susan Athey Stanford University David M. Blei Columbia University |
| Pseudocode | No | The paper describes the model's architecture and computations using mathematical equations and textual explanations (e.g., Section 2.3 CAREER Model, and Appendix C Transformer Details), along with a computation graph (Figure 1), but it does not present a distinct block of pseudocode or a formal algorithm. |
| Open Source Code | Yes | We release code so that practitioners can train CAREER for their own problems.1 1https://github.com/keyonvafa/career-code |
| Open Datasets | Yes | We pretrain CAREER on a large dataset of resumes provided by Zippia, a career planning company. This dataset contains resumes from 23.7 million working Americans. Each job is encoded into one of 330 occupational codes, using the coding scheme of Autor & Dorn (2013). We transfer CAREER to three widely-used survey datasets: two cohorts from the National Longitudinal Survey of Youth (NLSY79 and NLSY97) and the Panel Study of Income Dynamics (PSID). These datasets have been carefully constructed to be representative of the general population, and they are widely used by economists for estimating economic quantities. NLSY79 is a longitudinal panel survey following a cohort of Americans who were between 14 and 22 when the survey began in 1979, while NLSY97 follows a different cohort of individuals who were between 12 and 17 when the survey began in 1997. PSID is a longitudinal survey following a sample of American families, with individuals added over the years. These surveys are publicly available and allow for linking individuals over time. See Appendix F for more information about how the data is formed. |
| Dataset Splits | Yes | We divide all survey datasets into 70/10/20 train/validation/test splits, and train all models by optimizing the log-likelihood with Adam (Kingma & Ba, 2015). |
| Hardware Specification | No | Pretraining CAREER on the resumes data takes 18 hours on a single GPU. (See Appendix G for more details on the model and hyperparameters.) ... fine-tuning on one GPU takes 13 minutes on NLSY79, 7 minutes on NLSY97, and 23 minutes on PSID. |
| Software Dependencies | No | All models were trained using Fairseq (Ott et al., 2019). |
| Experiment Setup | Yes | In total, we use 16,000 total tokens per minibatch, varying the batch size depending on the largest sequence length in the batch. We use the Adam learning rate scheduler (Kingma & Ba, 2015). All experiments on the resumes data warm up the learning rate from 10 7 to 0.0005 over 4,000 steps, after which the inverse square root schedule is used (Vaswani et al., 2017). For the survey datasets, we also used the inverse square root scheduler, but experimented with various learning rates and warmup updates, using the one we found to work best for each model. For CAREER with pretrained representations, we used a learning rate of 0.0001 and 500 warmup updates; for CAREER without pretraining, we used a learning rate of 0.0005 and 500 warmup updates; for the bag of jobs model, we used a learning rate of 0.0005 and 5,000 warmup updates; for the regression model, we used a learning rate of 0.0005 and 4,000 warmup updates. We use a learning rate of 0.005 for job representation learning and Job2Vec, with 5,000 warmup updates. All models besides were also trained with 0.01 weight decay. All models were trained using Fairseq (Ott et al., 2019). |