The clips have the person the AI needs to focus its efforts on labeled, along with information describing their pose and stating whether they are interacting with a fellow human or an object. The clips were taken from popular movies and cover a range of genres as well as countries of origin. The catalog, which contains nearly 58,000 clips, has more than 96,000 humans labeled but highlights only 80 actions.
The activities have been categorized into three groups: one for pose and movement, one for person-to-person interaction, and one for person and object interaction. In clips where more than one person is present, each person has been labeled separately. This teaches the machine that some actions require two people, such as shaking hands, and helps it learn, for example, that people might sometimes kiss while they are hugging. The algorithm can also understand when a person is doing several things at once, like singing at the same time as playing an instrument. This will help it learn that some actions are frequently accompanied by certain other actions.
Some of the other actions involved are a bit more complex than running or talking, such as swimming, brushing teeth, clinking glass, playing a board game, putting on clothes, and rowing a boat.
In a recent blog post, Google explained the reasoning behind this effort. They said that even though AI has made tremendous strides in finding objects within images and classifying them, it still struggles to recognize human actions, which tend to be less well-defined in videos than objects.
There are lots of ways this AI technology could prove useful to the firm, and it goes far beyond simply allowing it to analyze all the videos processed on YouTube.
This effort could also be used to help advertisers improve their targeting of consumers according to the actions people are more likely to watch. Advertising could be targeted according to whether you are watching videos of people arguing or hugging, for example. A related research paper shows that they want to understand not only what humans are doing but also what they might do next and what exactly they’re hoping to achieve by doing all this.
This is not unlike the way that parents build up their children’s knowledge of the world around them – for example, by pointing to a ball and saying the word “ball.”
It’s hard to imagine what picture of the nature of humanity a series of YouTube clips might be painting. It is also worth noting that using movie clips can introduce a certain degree of bias into their work, and Google's researchers admit it's not a perfect system. If you’re curious what clips it’s using, you can see the AVA dataset on their website and experience all of the bike riding, smoking and window opening for yourself.
Sources include: