reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

Authors: Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Kumar Avinava Dubey, Ayzaan Wahid, Sumeet Singh, René Wagner, Tianli Ding, Chuyuan Fu, Arunkumar Byravan, Jake Varley, Alexey A. Gritsenko, Matthias Minderer, Dmitry Kalashnikov, Jonathan Tompson, Vikas Sindhwani, Krzysztof Marcin Choromanski

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods. Videos of STRING-based robotics controllers can be found here. We introduce STRING, a new family of position encodings for multidimensional token coordinates that respect both separability and translational invariance. We rigorously analyse STRING s theoretical properties (Sec. 3), proving that it is more general than Ro PE. We provide computationally efﬁcient implementations. We show strong accuracy gains across varied models using Transformers with STRING, on a range of robotics and general vision tasks (see Fig. 1 and Sec. 4).
Researcher Affiliation	Collaboration	1Google Deep Mind 2University of Cambridge 3Google Research. Correspondence to: Krzysztof Choromanski <EMAIL>.
Pseudocode	No	The paper describes mathematical definitions and theoretical proofs, as well as experimental setups and results, but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	Videos of STRING-based robotics controllers can be found here: https://sites.google.com/view/string-robotics. This link points to videos, not source code for the methodology.
Open Datasets	Yes	We tested STRING for image classiﬁcation tasks on the Image Net2012 (Deng et al., 2009) and Places365 datasets... Next, we lift Web LI (Chen et al., 2023)... We demonstrate the efﬁcacy of STRING on open-vocabulary object detection for localizing a 2D bounding box on standard RGB image benchmarks... We train on a simulated dataset of 4 million images of indoor and tabletop scenes with groundtruth 3D bounding box labels (Lin et al., 2025)... We utilize the dataset described in (Lin et al., 2025), which we brieﬂy describe here. We use open-sourced 3D assets, speciﬁcally a subset of assets from the Amazon Berkeley Objects (Collins et al., 2022) (ABO) dataset for background and tabletop objects and the YCB (Calli et al., 2015) and Google Scanned Objects (Downs et al., 2022) for tabletop clutter placement.
Dataset Splits	Yes	We hold out 80 images for evaluation. We evaluate both Vi T and Vi TD variants... We generated four different 1-million image datasets using the procedure above... We held out the ﬁrst 20 images for each of these datasets for evaluation, and used the rest for training.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for running the experiments. It mentions using a real industrial KUKA robot arm for some experiments, but not the computing hardware for model training.
Software Dependencies	No	The paper mentions several software tools and platforms used or referenced, such as Unity (Unity Technologies, 2023) for rendering, but does not provide specific version numbers for key libraries, frameworks, or programming languages used in their methodology to ensure reproducibility.
Experiment Setup	Yes	For Image Net2012 and Places365, we used 224 224 image resolution, batch size 4096, and trained for 300 epochs. Training used the cosine decay learning rate schedule with 0.001 base learning rate and 10,000 warm-up steps. For Image Net2012 there were a total of about 94k training steps, and for Places365 there were about 130k. For Circulant-STRING, block size 16 yielded the best results. All experiments were trained from scratch and used 256 256 image resolution... We trained with batch size 8192 for 20 epochs, amounting to about 155k training steps using the Sig LIP (Zhai et al., 2023) pretraining setup. Training used the cosine decay learning rate schedule with 0.001 base learning rate and 5% warm-up steps (about 8k). For Circulant-STRING, block size 32 yielded the best results... We train with a batch size of 1,024 for 250k iterations with an initial learning rate of 1e 4... The policy is trained with Adam optimizer with 1e 4 learning rate and 1e 4 weight decay. We use a linear learning rate warm-up for ﬁrst 10000 steps of training. The policy is trained for a total of 500000 steps with batch size 256.