Researchers have developed a deep-learning system that can do the very human task of interpreting what's happening in a photo and guessing what's likely to happen next.
Better yet, the system, developed by machine-learning researchers at MIT, can express its idea of a plausible future by adding animations to still images, such as waves that would ultimately crash, people who might move in a field, or a train that might roll forward on its tracks.
The work could provide a new direction for exploring computer vision by giving machines the ability to understand how objects move in the real world.
The researchers achieved their objective by training two deep networks on thousands of hours of unlabelled video. As the researchers note, annotating video is expensive, but there's no shortage of unlabeled video to begin training machines to read other signals about the world.
Carl Vondrick, a PhD student at MIT, who specializes in machine learning and computer vision, told New Scientist that the ability to predict movements in a scene could ensure tomorrow's domestic helper robots don't become a hindrance. You wouldn't, for example, appreciate a robot pulling a chair out from under you as you're about to sit down, he said.
The model was trained on two million videos from Flickr amounting to 5,000 hours of content covering four main scene types, including golf courses, beaches, train stations, and hospital rooms, consisting mostly of images of babies.
As New Scientist notes, the videos the model produces are grainy and short, lasting about one second, but they do capture the right general movement of a given scene, such as a train moving forward or a baby scrunching up its face.
However, the model still has much to learn about how the world works, such as that a train doesn't infinitely depart from a scene. Still, it did show that machines can be taught to dream up brief plausible futures.
The model is also able to "hallucinate" fictional but reasonable motions for each of the scene categories they explored.
The model was based on a machine-learning technique called adversarial learning, where two deep networks compete against each other. One network generates synthetic video while the other tries to discriminate between generated video and real videos.
Vondrick has previously also trained deep-learning models on hundreds of hours of unlabeled YouTube videos and TV programs such as 'The Office' to predict human interactions and gestures, such as a handshake, hug or kiss.