reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation

Authors: Yupeng Hou, Jianmo Ni, Zhankui He, Noveen Sachdeva, Wang-Cheng Kang, Ed H. Chi, Julian Mcauley, Derek Zhiyuan Cheng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use three categories from the Amazon Reviews dataset (Mc Auley et al., 2015) for our experiments: Sports and Outdoors (Sports), Beauty (Beauty), and CDs and Vinyl (CDs). We compare the performance of Action Piece with the following methods: We use Recall@K and NDCG@K as metrics to evaluate the methods.
Researcher Affiliation	Collaboration	1University of California, San Diego 2Google Deep Mind. Correspondence to: Yupeng Hou and Jianmo Ni <EMAIL, EMAIL>.
Pseudocode	Yes	Algorithm 1 Action Piece Vocabulary Construction Algorithm 2 Action Piece Vocabulary Construction Count (Figure 2) Algorithm 3 Action Piece Vocabulary Construction Update (Figure 3) Algorithm 4 Segmentation via Set Permutation Regularization (SPR) (Section 3.2.2) Figure 7. Pseudocode for a single iteration of the efficient vocabulary construction algorithm, illustrating how a max-heap with lazy updates is used to track and merge frequent token pairs.
Open Source Code	Yes	Our code is available at: https://github.com/google-deepmind/action_piece.
Open Datasets	Yes	We use three categories from the Amazon Reviews dataset (Mc Auley et al., 2015) for our experiments: Sports and Outdoors (Sports), Beauty (Beauty), and CDs and Vinyl (CDs).
Dataset Splits	Yes	To evaluate the models, we adopt the widely used leave-last-out protocol (Kang & Mc Auley, 2018; Zhao et al., 2022; Rajput et al., 2023), where the last item and second-to-last item in each action sequence are used for testing and validation, respectively.
Hardware Specification	Yes	Each model is trained on a single 40G NVIDIA A100 GPU.
Software Dependencies	No	We implement BERT4Rec, SASRec, FDSA, and S3-Rec using the open-source recommendation library Rec Bole (Zhao et al., 2021). For other methods, we implement them ourselves with Hugging Face Transformers (Wolf et al., 2020) and Py Torch (Paszke et al., 2019). We use FAISS (Douze et al., 2024) to quantize sentence representations.
Experiment Setup	Yes	We train the GR models from scratch for up to 200 epochs, using early stopping if the model does not achieve a better NDCG@10 on the validation set for 20 consecutive epochs. The training batch size is set to 256. The learning rate is selected from {1 10 3, 3 10 3, 5 10 3} with a warmup step of 10,000. We use a dropout rate of 0.1 and tune the weight decay from {0.07, 0.1, 0.15, 0.2}. For all methods implemented by us, we conduct five repeated experiments using random seeds {2024, 2025, 2026, 2027, 2028}. The model checkpoints with the best average NDCG@10 on the validation set are selected for evaluation on the test set, and we report these results. Table 7. Hyperparameter settings of Action Piece for each dataset.