ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation
Authors: Yupeng Hou, Jianmo Ni, Zhankui He, Noveen Sachdeva, Wang-Cheng Kang, Ed H. Chi, Julian Mcauley, Derek Zhiyuan Cheng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use three categories from the Amazon Reviews dataset (Mc Auley et al., 2015) for our experiments: Sports and Outdoors (Sports), Beauty (Beauty), and CDs and Vinyl (CDs). We compare the performance of Action Piece with the following methods: We use Recall@K and NDCG@K as metrics to evaluate the methods. |
| Researcher Affiliation | Collaboration | 1University of California, San Diego 2Google Deep Mind. Correspondence to: Yupeng Hou and Jianmo Ni <EMAIL, EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Action Piece Vocabulary Construction Algorithm 2 Action Piece Vocabulary Construction Count (Figure 2) Algorithm 3 Action Piece Vocabulary Construction Update (Figure 3) Algorithm 4 Segmentation via Set Permutation Regularization (SPR) (Section 3.2.2) Figure 7. Pseudocode for a single iteration of the efficient vocabulary construction algorithm, illustrating how a max-heap with lazy updates is used to track and merge frequent token pairs. |
| Open Source Code | Yes | Our code is available at: https://github.com/google-deepmind/action_piece. |
| Open Datasets | Yes | We use three categories from the Amazon Reviews dataset (Mc Auley et al., 2015) for our experiments: Sports and Outdoors (Sports), Beauty (Beauty), and CDs and Vinyl (CDs). |
| Dataset Splits | Yes | To evaluate the models, we adopt the widely used leave-last-out protocol (Kang & Mc Auley, 2018; Zhao et al., 2022; Rajput et al., 2023), where the last item and second-to-last item in each action sequence are used for testing and validation, respectively. |
| Hardware Specification | Yes | Each model is trained on a single 40G NVIDIA A100 GPU. |
| Software Dependencies | No | We implement BERT4Rec, SASRec, FDSA, and S3-Rec using the open-source recommendation library Rec Bole (Zhao et al., 2021). For other methods, we implement them ourselves with Hugging Face Transformers (Wolf et al., 2020) and Py Torch (Paszke et al., 2019). We use FAISS (Douze et al., 2024) to quantize sentence representations. |
| Experiment Setup | Yes | We train the GR models from scratch for up to 200 epochs, using early stopping if the model does not achieve a better NDCG@10 on the validation set for 20 consecutive epochs. The training batch size is set to 256. The learning rate is selected from {1 10 3, 3 10 3, 5 10 3} with a warmup step of 10,000. We use a dropout rate of 0.1 and tune the weight decay from {0.07, 0.1, 0.15, 0.2}. For all methods implemented by us, we conduct five repeated experiments using random seeds {2024, 2025, 2026, 2027, 2028}. The model checkpoints with the best average NDCG@10 on the validation set are selected for evaluation on the test set, and we report these results. Table 7. Hyperparameter settings of Action Piece for each dataset. |