If you've spent time browsing social media lately, you've likely noticed how GANs (generative adversarial networks) have become incredibly skilled at creating human faces. These advanced AI systems can predict your appearance in old age or even transform you into a celebrity lookalike. However, when tasked with generating broader environmental scenes, these same systems encounter significant limitations and produce strange results.
A fascinating new demonstration from the MIT-IBM Watson AI Lab reveals what happens when a model trained on church and monument imagery attempts to recreate famous landmarks like Paris's Pantheon or Rome's Piazza di Spagna. The comprehensive research study, titled Seeing What a GAN Cannot Generate, was recently presented at the prestigious International Conference on Computer Vision.
"Most researchers concentrate on characterizing and enhancing what machine-learning systems can accomplish—their attention patterns and how specific inputs produce particular outputs," explains David Bau, a graduate student at MIT's Department of Electrical Engineering and Computer Science and Computer Science and Artificial Science Laboratory (CSAIL). "Through this work, we hope to encourage equal attention to characterizing the data these systems systematically ignore."
Within a GAN framework, two neural networks collaborate to produce hyper-realistic images based on provided examples. Bau became fascinated with GANs as a method for examining black-box neural networks to understand their decision-making processes. An earlier tool he developed with his advisor, MIT Professor Antonio Torralba, and IBM researcher Hendrik Strobelt, enabled identification of artificial neuron clusters responsible for categorizing images into real-world elements like doors, trees, and clouds. A related application, GANPaint, allows amateur artists to add or remove these features from their personal photographs.
While assisting an artist using GANPaint, Bau encountered a revealing issue. "As usual, we were focused on optimizing numerical reconstruction loss to recreate the photo perfectly," he recalls. "But my advisor has always encouraged us to look beyond the numbers and carefully examine the actual images. When we did, the phenomenon was immediately obvious: People were being selectively removed from the generated scenes."
Just as GANs and other neural networks identify patterns in massive datasets, they also develop patterns of omission. Bau and his team trained various GAN types on both indoor and outdoor scenes. Regardless of the setting, these AI systems consistently omitted crucial elements such as people, vehicles, signs, fountains, and furniture—even when these objects featured prominently in the original images. In one striking GAN reconstruction, a newlywed couple kissing on church steps vanished completely, leaving only an eerie wedding-dress texture on the cathedral door.
"When GANs encounter objects they cannot generate, they seem to imagine how the scene would appear without them," notes Strobelt. "Sometimes people transform into bushes or vanish entirely into the structures behind them."
The researchers hypothesize that 'machine laziness' might explain these omissions; although GANs are trained to create convincing images, they may learn that focusing on buildings and landscapes is easier than representing complex elements like people and vehicles. While researchers have long known that GANs tend to overlook some statistically significant details, this may be the first study demonstrating how state-of-the-art GANs systematically eliminate entire object categories within images.
An AI that excludes certain objects from its representations might achieve its numerical objectives while missing details most important to humans, Bau explains. As engineers increasingly rely on GANs to generate synthetic images for training automated systems like self-driving cars, there's a significant risk that people, signs, and other critical information could be omitted without human operators realizing. This demonstrates why model performance shouldn't be measured by accuracy alone, Bau emphasizes. "We need to understand both what networks are and aren't doing to ensure they're making the choices we intend them to make."
Bau's research team included Jun-Yan Zhu, Jonas Wulff, William Peebles, and Torralba from MIT; Strobelt from IBM; and Bolei Zhou from the Chinese University of Hong Kong.