Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

Authors: Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Kumar Avinava Dubey, Ayzaan Wahid, Sumeet Singh, René Wagner, Tianli Ding, Chuyuan Fu, Arunkumar Byravan, Jake Varley, Alexey A. Gritsenko, Matthias Minderer, Dmitry Kalashnikov, Jonathan Tompson, Vikas Sindhwani, Krzysztof Marcin Choromanski

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods. Videos of STRING-based robotics controllers can be found here. We introduce STRING, a new family of position encodings for multidimensional token coordinates that respect both separability and translational invariance. We rigorously analyse STRING s theoretical properties (Sec. 3), proving that it is more general than Ro PE. We provide computationally efficient implementations. We show strong accuracy gains across varied models using Transformers with STRING, on a range of robotics and general vision tasks (see Fig. 1 and Sec. 4).
Researcher Affiliation Collaboration 1Google Deep Mind 2University of Cambridge 3Google Research. Correspondence to: Krzysztof Choromanski <EMAIL>.
Pseudocode No The paper describes mathematical definitions and theoretical proofs, as well as experimental setups and results, but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No Videos of STRING-based robotics controllers can be found here: https://sites.google.com/view/string-robotics. This link points to videos, not source code for the methodology.
Open Datasets Yes We tested STRING for image classification tasks on the Image Net2012 (Deng et al., 2009) and Places365 datasets... Next, we lift Web LI (Chen et al., 2023)... We demonstrate the efficacy of STRING on open-vocabulary object detection for localizing a 2D bounding box on standard RGB image benchmarks... We train on a simulated dataset of 4 million images of indoor and tabletop scenes with groundtruth 3D bounding box labels (Lin et al., 2025)... We utilize the dataset described in (Lin et al., 2025), which we briefly describe here. We use open-sourced 3D assets, specifically a subset of assets from the Amazon Berkeley Objects (Collins et al., 2022) (ABO) dataset for background and tabletop objects and the YCB (Calli et al., 2015) and Google Scanned Objects (Downs et al., 2022) for tabletop clutter placement.
Dataset Splits Yes We hold out 80 images for evaluation. We evaluate both Vi T and Vi TD variants... We generated four different 1-million image datasets using the procedure above... We held out the first 20 images for each of these datasets for evaluation, and used the rest for training.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for running the experiments. It mentions using a real industrial KUKA robot arm for some experiments, but not the computing hardware for model training.
Software Dependencies No The paper mentions several software tools and platforms used or referenced, such as Unity (Unity Technologies, 2023) for rendering, but does not provide specific version numbers for key libraries, frameworks, or programming languages used in their methodology to ensure reproducibility.
Experiment Setup Yes For Image Net2012 and Places365, we used 224 224 image resolution, batch size 4096, and trained for 300 epochs. Training used the cosine decay learning rate schedule with 0.001 base learning rate and 10,000 warm-up steps. For Image Net2012 there were a total of about 94k training steps, and for Places365 there were about 130k. For Circulant-STRING, block size 16 yielded the best results. All experiments were trained from scratch and used 256 256 image resolution... We trained with batch size 8192 for 20 epochs, amounting to about 155k training steps using the Sig LIP (Zhai et al., 2023) pretraining setup. Training used the cosine decay learning rate schedule with 0.001 base learning rate and 5% warm-up steps (about 8k). For Circulant-STRING, block size 32 yielded the best results... We train with a batch size of 1,024 for 250k iterations with an initial learning rate of 1e 4... The policy is trained with Adam optimizer with 1e 4 learning rate and 1e 4 weight decay. We use a linear learning rate warm-up for first 10000 steps of training. The policy is trained for a total of 500000 steps with batch size 256.