ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation

Authors: Yupeng Hou, Jianmo Ni, Zhankui He, Noveen Sachdeva, Wang-Cheng Kang, Ed H. Chi, Julian Mcauley, Derek Zhiyuan Cheng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use three categories from the Amazon Reviews dataset (Mc Auley et al., 2015) for our experiments: Sports and Outdoors (Sports), Beauty (Beauty), and CDs and Vinyl (CDs). We compare the performance of Action Piece with the following methods: We use Recall@K and NDCG@K as metrics to evaluate the methods.
Researcher Affiliation Collaboration 1University of California, San Diego 2Google Deep Mind. Correspondence to: Yupeng Hou and Jianmo Ni <EMAIL, EMAIL>.
Pseudocode Yes Algorithm 1 Action Piece Vocabulary Construction Algorithm 2 Action Piece Vocabulary Construction Count (Figure 2) Algorithm 3 Action Piece Vocabulary Construction Update (Figure 3) Algorithm 4 Segmentation via Set Permutation Regularization (SPR) (Section 3.2.2) Figure 7. Pseudocode for a single iteration of the efficient vocabulary construction algorithm, illustrating how a max-heap with lazy updates is used to track and merge frequent token pairs.
Open Source Code Yes Our code is available at: https://github.com/google-deepmind/action_piece.
Open Datasets Yes We use three categories from the Amazon Reviews dataset (Mc Auley et al., 2015) for our experiments: Sports and Outdoors (Sports), Beauty (Beauty), and CDs and Vinyl (CDs).
Dataset Splits Yes To evaluate the models, we adopt the widely used leave-last-out protocol (Kang & Mc Auley, 2018; Zhao et al., 2022; Rajput et al., 2023), where the last item and second-to-last item in each action sequence are used for testing and validation, respectively.
Hardware Specification Yes Each model is trained on a single 40G NVIDIA A100 GPU.
Software Dependencies No We implement BERT4Rec, SASRec, FDSA, and S3-Rec using the open-source recommendation library Rec Bole (Zhao et al., 2021). For other methods, we implement them ourselves with Hugging Face Transformers (Wolf et al., 2020) and Py Torch (Paszke et al., 2019). We use FAISS (Douze et al., 2024) to quantize sentence representations.
Experiment Setup Yes We train the GR models from scratch for up to 200 epochs, using early stopping if the model does not achieve a better NDCG@10 on the validation set for 20 consecutive epochs. The training batch size is set to 256. The learning rate is selected from {1 10 3, 3 10 3, 5 10 3} with a warmup step of 10,000. We use a dropout rate of 0.1 and tune the weight decay from {0.07, 0.1, 0.15, 0.2}. For all methods implemented by us, we conduct five repeated experiments using random seeds {2024, 2025, 2026, 2027, 2028}. The model checkpoints with the best average NDCG@10 on the validation set are selected for evaluation on the test set, and we report these results. Table 7. Hyperparameter settings of Action Piece for each dataset.