Over the past two years, Facebook AI Research (FAIR) has worked with 13 universities around the world to collect the largest set of first-person video data, specifically for training in-depth image recognition models. AI trained on the data set will be better at controlling robots that interact with humans, or at interpreting images from smart glasses. “Machines will only be able to help us in our daily lives if they really understand the world through our eyes,” said Kristen Graumann of FAIR, who is leading the project.
Such technology could help people who need help at home, or direct people to the tasks they are learning to perform. “The video in this dataset is much closer to how people view the world,” said Michael Rayo, a computer vision researcher at Google Brain and Stony Brook University in New York who is not involved in Ego4D.
But the potential abuses are clear and worrying. The study is funded by Facebook, a social media giant that was recently accused in the US Senate of making profits on people’s well-being – as confirmed by the MIT Technology Review’s own investigations.
The business model of Facebook and other major technology companies is to extract as much data as possible from people’s behavior online and sell it to advertisers. The AI outlined in the project can expand the scope of people’s daily offline behavior, revealing what objects are around your home, what activities you liked, who you spent time with and even where your gaze is held – an unprecedented level of personal information.
“Confidentiality needs to be done until you take that out of the world of research and into something that’s a product,” Grauman said. “This work may even be inspired by this project.”
The largest previous set of first-person video data consists of 100 hours of footage from people in the kitchen. The Ego4D data set consists of 3,025 hours of video recorded by 855 people in 73 different locations in nine countries (USA, UK, India, Japan, Italy, Singapore, Saudi Arabia, Colombia and Rwanda).
The participants were of different ages and backgrounds; some were hired for their visually interesting professions, such as bakers, mechanics, carpenters and gardeners.
Previous datasets typically consist of semi-encrypted videos lasting only a few seconds. For Ego4D, participants wore cameras with their heads for up to 10 hours at a time and shot first-person videos of unwritten daily activities, including walking down the street, reading, doing laundry, shopping, playing with pets, board games, and interacting with others. Some of the shots also include audio, data on where the participants’ eyes are focused, and multiple points of view on the same stage. This is the first set of its kind, says Ryoo.