As renowned author Margaret Atwood once wrote, "Touch comes before sight, before speech. It's the first language and the last, and it always tells the truth." This profound observation highlights the fundamental nature of human sensory perception.
While our tactile sense provides a direct channel to experience the physical world, our visual system instantly helps us comprehend the complete context of these tactile sensations. However, robots programmed with either visual or touch capabilities have historically struggled to integrate these sensory inputs effectively.
To address this significant challenge, researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a groundbreaking predictive artificial intelligence system capable of learning to see through touching and to feel through seeing. This innovative approach represents a major leap forward in AI multisensory integration technology.
The team's sophisticated system can generate realistic tactile signals from visual inputs while simultaneously identifying objects and their specific parts being touched through tactile data alone. For their research, they utilized a KUKA robot arm equipped with a special tactile sensor called GelSight, an innovative technology developed by another MIT research group.
Employing a standard web camera, the research team captured nearly 200 different objects—including tools, household products, and various fabrics—being touched more than 12,000 times. By breaking down these 12,000 video clips into static frames, the researchers compiled "VisGel," an extensive dataset containing over 3 million visually and tactilely paired images.
"By examining a scene, our model can imagine the sensation of touching a smooth surface or a sharp edge," explains Yunzhu Li, a CSAIL PhD student and lead author of the research paper. "Conversely, by exploring through touch alone, our model can predict environmental interactions purely from tactile sensations. Integrating these two senses could significantly enhance robotic capabilities while reducing the data requirements for object manipulation and grasping tasks."
Recent efforts to equip robots with more human-like physical senses, such as MIT's 2016 project using deep learning to visually indicate sounds, or models that predict objects' responses to physical forces, have typically relied on large datasets that aren't available for understanding vision-touch interactions.
The MIT team's innovative approach overcomes this limitation by leveraging the VisGel dataset alongside generative adversarial networks (GANs), a powerful machine learning technique.
GANs work by using either visual or tactile images to generate corresponding images in the other sensory modality. They employ a "generator" and "discriminator" that compete with each other, where the generator aims to create realistic images to deceive the discriminator. Each time the discriminator identifies the generator's artificial creation, it must reveal its internal reasoning, enabling the generator to continuously improve its output quality.
Vision to Touch Translation
Humans naturally infer how an object feels simply by looking at it. To replicate this ability in machines, the system first had to identify the precise location of touch, then deduce information about the shape and texture of that specific area.
Reference images—captured without any robot-object interaction—helped the system encode detailed information about objects and their environments. During robot operation, the model could then compare the current frame with its reference image, easily identifying the location and scale of the touch interaction.
This capability enables the system to examine an image of a computer mouse and then "visualize" the optimal areas for grasping the object—potentially revolutionizing how machines plan safer and more efficient actions.
Touch to Vision Translation
For touch-to-vision conversion, the objective was for the model to generate visual images based solely on tactile data. The model analyzed tactile information to determine the shape and material properties of the contact area, then referenced the visual dataset to "hallucinate" the interaction.
For instance, when provided with tactile data from a shoe during testing, the model could generate an image showing where that shoe was most likely being touched.
This capability could prove invaluable for tasks where visual information is unavailable, such as in darkness or when a robot must blindly reach into an unknown space or container.
Future Applications
The current dataset contains only examples from controlled environments. The research team hopes to enhance this by collecting data in unstructured settings or by utilizing a new MIT-designed tactile glove to expand the dataset's size and diversity.
Certain details remain challenging to infer through cross-modal translation, such as determining an object's color through touch alone or assessing how soft a sofa is without physically pressing it. The researchers suggest these limitations could be addressed by developing more robust models for uncertainty quantification, expanding the range of possible outcomes.
Looking ahead, this type of machine learning cross-modal perception technology could foster a more harmonious relationship between vision and robotics, particularly for object recognition, grasping, enhanced scene understanding, and facilitating seamless human-robot integration in assistive or manufacturing environments.
"This is the first method that can convincingly translate between visual and touch signals," notes Andrew Owens, a postdoc at the University of California at Berkeley. "Approaches like this hold tremendous potential for robotics, where you need to answer questions like 'is this object hard or soft?' or 'if I lift this mug by its handle, how good will my grip be?' This is an exceptionally challenging problem, since the signals are so different, and this model has demonstrated remarkable capability."
Li authored the paper alongside MIT professors Russ Tedrake and Antonio Torralba, and MIT postdoc Jun-Yan Zhu. The research was presented at The Conference on Computer Vision and Pattern Recognition in Long Beach, California.