Vintix: Action Model via In-Context Reinforcement Learning

Authors: Andrei Polubarov, Lyubaykin Nikita, Alexander Derevyagin, Ilya Zisman, Denis Tarasov, Alexander Nikulin, Vladislav Kurenkov

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate that Algorithm Distillation, a framework designed to facilitate ICRL, offers a compelling and competitive alternative to expert distillation to construct versatile action models. These findings highlight the potential of ICRL as a scalable approach for generalist decision-making systems. ... We empirically demonstrate that the proposed model, Vintix, can self-correct to attain demonstrator-level performance on training tasks (Figure 1) and adapt to controlled parametric task variations at inference-time.
Researcher Affiliation Academia 1AIRI 2Skoltech 3Research Center for Trusted Artificial Intelligence, ISP RAS 4Innopolis University 5HSE 6MIPT. Correspondence to: Vladislav Kurenkov <EMAIL>. Work done by dunnolab.ai.
Pseudocode Yes Algorithm 1 Noise distillation for continuous action spaces Require: Demonstrator policy πD, task environment, noise schedule E, number of time steps in the trajectory T, trajectory buffer D, action space lower and upper bounds amin, amax 1: Sample s0 from task environment 2: for t T do 3: Noise magnitude: ϵi = E(t) 4: Noise: u Uniform(amin, amax) 5: Current action: ai = (1 ϵi) πD(si) + ϵi u 6: Obtain {si+1, ri, ti} by executing ai in task environment 7: Append {si, ai, si+1, ri, ti} to D 8: end for
Open Source Code No Code to be released at dunnolab/vintix.
Open Datasets Yes Open Tools and Datasets for ICRL (Section 2.2): We publicly release datasets for 87 tasks across four domains (Meta-World, Mu Jo Co, Bi-Dex Hands, Industrial-Benchmark), along with data collection tools and instrumentation to support the development of action models eliciting ICRL behavior.
Dataset Splits Yes To validate the inference-time optimization capability of our model, we divided the overall set of 102 tasks into two disjoint subsets. The validation subset was excluded from the training dataset. Below, we provide details of the split for each domain. A.2.2. META-WORLD The standard ML45 split was selected, with 45 tasks assigned to the training set and 5 tasks reserved for validation: bin-picking, box-close, door-lock, door-unlock, and hand-insert. A.2.3. BI-DEXHANDS We adopted the ML20 benchmark setting proposed by the original authors (Chen et al., 2022), in which 15 tasks are assigned to the training set, while 5 tasks are reserved for validation, including: door-close-outward, door-open-inward, door-open-outward, hand-kettle, and hand-over. A.2.4. INDUSTRIAL-BENCHMARK For this domain, a global split based on the setpoint parameter was made, with setpoints ranging from 0 to 75 assigned to the training set, and setpoints from 80 to 100 assigned to the validation set.
Hardware Specification Yes Training is conducted on 8 H100 GPUs with a batch size of 64 and 2 gradient accumulation steps.
Software Dependencies No The paper mentions using "Flash Attention library (Dao et al., 2022)" and "Stable-Baselines3 (Raffin et al., 2021)" for training demonstrators, but it does not specify exact version numbers for these software libraries.
Experiment Setup Yes Training is conducted on 8 H100 GPUs with a batch size of 64 and 2 gradient accumulation steps. The input sequence length L is set to 8192. For more detailed hyperparameter information, refer to Appendix D. (Appendix D, Table 2 lists: Learning Rate 0.0003, Optimizer Adam, Beta 1 0.9, Beta 2 0.99, Batch Size 64, Gradient Accumulation Steps 2, Transformer Layers 20, Transformer Heads 16, Context Length 8192, Transformer Hidden Dim 1024, FF Hidden Size 4096, MLP Type Gpt Neox, MLP Normalization Type Layer Norm, Training Precision bf16, Parameters 332100768)