AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors
Authors: Ruoxuan Feng, Jiangyu Hu, Wenke Xia, Tianci Gao, Ao Shen, Yuhao Sun, Bin Fang, Di Hu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct both quantitative and qualitative experiments to analyze the transferability of multi-sensor data and assess the impact of our framework on the multi-sensor representation space. Building on this, we comprehensively evaluate the static and dynamic tactile perception capabilities of Any Touch across various tactile datasets and through a real-world experiment: fine-grained pouring. The experimental results demonstrate the static and dynamic perception abilities and cross-sensor transferability of Any Touch. |
| Researcher Affiliation | Academia | 1Renmin University of China 2Wuhan University of Science and Technology 3Beijing University of Posts and Telecommunications |
| Pseudocode | No | The paper describes the framework components and training paradigm in text and diagrams (e.g., Figure 2) but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code, Tac Quad dataset and Any Touch model are fully available at gewu-lab.github.io/Any Touch/. |
| Open Datasets | Yes | The code, Tac Quad dataset and Any Touch model are fully available at gewu-lab.github.io/Any Touch/. We use 9 different tactile datasets for training, including: Touch and Go (TAG) (Yang et al., 2022), Vis Gel (Li et al., 2019), Cloth (Yuan et al., 2018), Object Folder Real (OF Real) (Gao et al., 2023) , TVL (Fu et al., 2024), YCB-Slide (Suresh et al., 2023) and SSVTP (Kerr et al., 2022), Octopi (Yu et al., 2024) and the coarse-grained subset of our Tac Quad. |
| Dataset Splits | Yes | We follow the data split in (Yang et al., 2024; Cheng et al., 2024) for Feel. Object Folder 1.0 and Oject Folder 2.0 are two simulated object datasets using TACTO (Wang et al., 2022a) and Taxim (Si & Yuan, 2022). We use them as unseen datasets from unseen sensors, and follow the data split in Yang et al. (2024). |
| Hardware Specification | Yes | We train the first stage for 20 epochs and the second stage for 12 epochs on 4 NVIDIA A800 GPUs. |
| Software Dependencies | No | We base our encoders on Open CLIP-Large (Cherti et al., 2023). No explicit version numbers for software components like Python, PyTorch, CUDA, or Open CLIP-Large are provided. |
| Experiment Setup | Yes | We use the Adam W (Loshchilov, 2017) optimizer with a learning rate of 2e-4. After a warm-up period of 1 epoch, we implement linear learning rate decay. For each tactile video clip, we use T = 3 frames. We train the first stage for 20 epochs and the second stage for 12 epochs... We use a mask ratio ρ = 0.75. During the alignment, we use the text modality as the anchor, freezing the text encoder while performing Lo RA fine-tuning on the vision encoder. We set the alignment strength αTV = αTL = 1.0 and αVL = 0.2, and set the weight of cross-sensor matching λ = 0.1. Following (Yang et al., 2024), we use L = 5 sensor tokens for each type of sensor. In both stages, we set the probability of using universal sensor tokens pu to increase linearly from 0 to 0.75. |