To operate in augmented and virtual reality, Facebook believes artificial intelligence will need to develop an "egocentric perspective."
To that end, the company on Thursday announcedEgo4D, a data set of 2,792 hours of first-person video, and a set of benchmark tests for neural nets, designed to encourage the development of AI that is savvier about what it's like to move through virtual worlds from a first-person perspective.
The project is a collaboration between Facebook Reality Labs and scholars from 13 research institutions, including academic institutions and research labs. The details are laid out in a paper lead-authored by Facebook'sKristen Grauman, "Ego4D: Around the World in 2.8K Hours of Egocentric Video."
Grauman is a scientist with the company's Facebook AI Research unit. Her background as a professor at UT Austin has been focused on computer vision and machine learning in related topics.
The idea is that the data set will propel researchers to develop neural nets that excel at performing tasks from a first-person perspective -- in the same way that big datasets such as ImageNet propelled existing AI programs from a "spectator" perspective.
The point of egocentric perception is to try to fix the problems a neural network has with basic tasks, such as image recognition when the point of view of an image shifts from third-person to first-person, said Facebook.
"These benchmarks will catalyze research on the building blocks necessary to develop smarter AI assistants that can understand and interact not just in the real world but also in the metaverse, where physical reality, AR, and VR all come together in a single space," said Facebook.
The 2,792 hours of video were collected by Facebook staff using a variety of cameras. The Vuzix Blade augmented reality headset made by Vuzix is just one, others include GoPro, Pupil Labs, ZShades, and Wee-view. The purpose of mixing different sets is to avoid "over-fitting," write Grauman and collaborators, the phenomenon when a neural network has memorized frames of video information, rather than being tuned to infer similarities across differences.
Facebook said the video was "captured by 750 unique camera wearers from 73 worldwide locations and 9 different countries." Some of that was by Facebook staffers on the company's campus and some by the university collaborators.
The "4D" in Ego4D references the temporal aspect of the video Facebook's staff spent 250,000 hours looking at and providing spoken narrations summarizing what's going on in the videos, with time-stamps attached.
Facebook says the narrations "are temporally dense," given that, "On average we received 13.2 sentences per minute of video, for a total of 3.85M sentences. In total the narrations describe the Ego4D video using 1,772 unique verbs (activities) and 4,336 unique nouns (objects)."
The dataset is meant to be used to develop neural nets that will perform on a variety of new benchmark tests. To that end, Grauman and collaborators describe several new tests they've created that require a neural net to be able to produce a response to: tasks in the past, such as recall; tasks in the present, such as categorizing an activity; or future forecasting, such as producing a description of the result of an action.
For example, one task for a neural net could be to answer a natural-language query that requires the program to match the content of the query to a frame of video. An example is to ask the computer, "When did I read to my children?" The computer would have to find the scene where the camera wearer was reading to their kids. The task is labeled by the human annotations staff, who are given a pre-formatted list of labels and have to assign those to clips.
Facebook said they have 74,000 queries assigned in this way to 800 hours of video.
In a future prediction test, the computer might have to predict with which object in a frame of video the camera wearer will next interact. So, if they are at a table rolling dough, the next action predicted might be to grab a ball of dough on the table. The program will make the prediction by selecting one of a pre-set list of verbs that have been attached to video frames by the annotation staff, and appending a time estimation, like spitting out "take dough in 0.8 seconds."