AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors

Authors: Ruoxuan Feng, Jiangyu Hu, Wenke Xia, Tianci Gao, Ao Shen, Yuhao Sun, Bin Fang, Di Hu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct both quantitative and qualitative experiments to analyze the transferability of multi-sensor data and assess the impact of our framework on the multi-sensor representation space. Building on this, we comprehensively evaluate the static and dynamic tactile perception capabilities of Any Touch across various tactile datasets and through a real-world experiment: fine-grained pouring. The experimental results demonstrate the static and dynamic perception abilities and cross-sensor transferability of Any Touch.
Researcher Affiliation Academia 1Renmin University of China 2Wuhan University of Science and Technology 3Beijing University of Posts and Telecommunications
Pseudocode No The paper describes the framework components and training paradigm in text and diagrams (e.g., Figure 2) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes The code, Tac Quad dataset and Any Touch model are fully available at gewu-lab.github.io/Any Touch/.
Open Datasets Yes The code, Tac Quad dataset and Any Touch model are fully available at gewu-lab.github.io/Any Touch/. We use 9 different tactile datasets for training, including: Touch and Go (TAG) (Yang et al., 2022), Vis Gel (Li et al., 2019), Cloth (Yuan et al., 2018), Object Folder Real (OF Real) (Gao et al., 2023) , TVL (Fu et al., 2024), YCB-Slide (Suresh et al., 2023) and SSVTP (Kerr et al., 2022), Octopi (Yu et al., 2024) and the coarse-grained subset of our Tac Quad.
Dataset Splits Yes We follow the data split in (Yang et al., 2024; Cheng et al., 2024) for Feel. Object Folder 1.0 and Oject Folder 2.0 are two simulated object datasets using TACTO (Wang et al., 2022a) and Taxim (Si & Yuan, 2022). We use them as unseen datasets from unseen sensors, and follow the data split in Yang et al. (2024).
Hardware Specification Yes We train the first stage for 20 epochs and the second stage for 12 epochs on 4 NVIDIA A800 GPUs.
Software Dependencies No We base our encoders on Open CLIP-Large (Cherti et al., 2023). No explicit version numbers for software components like Python, PyTorch, CUDA, or Open CLIP-Large are provided.
Experiment Setup Yes We use the Adam W (Loshchilov, 2017) optimizer with a learning rate of 2e-4. After a warm-up period of 1 epoch, we implement linear learning rate decay. For each tactile video clip, we use T = 3 frames. We train the first stage for 20 epochs and the second stage for 12 epochs... We use a mask ratio ρ = 0.75. During the alignment, we use the text modality as the anchor, freezing the text encoder while performing Lo RA fine-tuning on the vision encoder. We set the alignment strength αTV = αTL = 1.0 and αVL = 0.2, and set the weight of cross-sensor matching λ = 0.1. Following (Yang et al., 2024), we use L = 5 sensor tokens for each type of sensor. In both stages, we set the probability of using universal sensor tokens pu to increase linearly from 0 to 0.75.