ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for approximately 2.5 minutes.

Our hardware setup includes two Realman robot arms, each mounted with a 5-finger, 17-DoF SharpaWave dexterous hand (development version). Visual sensing combines wrist-mounted fisheye cameras for local views and a top-down ZED Mini stereo camera for global perception. Each fingertip is equipped with a high-resolution (320×240) tactile sensor.

To collect high-quality visuo-tactile demonstrations, we use a custom exoskeleton-based teleoperation system. Operators wear mechanical gloves mechanically linked to the robot hands and receive immersive first-person feedback via a VR headset. The interface integrates stereo top-down, local wrist views, and real-time tactile overlays, enabling intuitive control in contact-rich tasks.

Overview of ViTacFormer. We propose ViTacFormer, a unified visuo-tactile framework for dexterous manipulation. At its core is a cross-modal representation built using cross-attention layers that fuse visual and tactile signals throughout the policy network. To enhance action relevance, ViTacFormer introduces a tactile-prediction head that encourages the latent space to capture meaningful touch dynamics. It auto-regressively predicts future tactile feedback and leverages it for action generation, moving beyond passive perception of current touch signals.

We present qualitative comparisons on four short-horizon tasks between ViTacFormer and baseline methods such as ACT and DP. The Success case demonstrates our method's effectiveness, while Failure 1 and Failure 2 are failure cases from baseline methods. Our model significantly outperforms the baselines in terms of task success rate and robustness.

To demonstrate the effectiveness of ViTacFormer on long-horizon tasks, we evaluate it on a challenging 11-stage hamburger-making task. Our method successfully completes all stages in sequence, sustaining continuous, high-precision control with an anthropomorphic hand for about 2.5 minutes.

ViTacFormer

Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Summary

Hardware Setup

Framework: ViTacFormer

Method Pipeline

Experiment Results

Qualitative Comparison

Long-horizon Demonstration

Our Team