to achieve the goals of the AI. The first area it observes is user behavior via metadata. It determines things about a video based on the behavior of the person whose eyes are on the screen and whose fingers are doing the clicking. “Satisfaction signals” train the AI what to suggest or not. There is a very specific list of these signals:
Which videos a user watches
Which videos they skip
Time they spend watching
Likes and dislikes
“Not interested” feedback
Surveys after watching a video
Whether they come back to rewatch or finish something unwatched
If they save and come back to watch later
All of these signals feed the Satisfaction Feedback Loop. This loop is created based on the feedback the algorithm is getting from your specific behavior. It “loops” the types of videos you like through its suggestions. This is how it personalizes each user's experience.
Gathering Metadata
To really get down to the details, here's an explanation for exactly how the AI gathers data. Observing metadata starts with the thumbnail. The YouTube AI uses the advanced technology of Google's suite of AI products. It operates a program called Cloud Vision (CV). CV uses optical character recognition (OCR) and image recognition to determine lots of things about a video based on what it finds in the thumbnail. It takes points from each image in the thumbnail and, using billions of data points already in the system, recognizes those images, and feeds that information back into the algorithm. For example, a thumbnail including a close‐up of world‐renowned physicist Stephen Hawking's face is recognized as such in CV, so that video can be “grouped” in the suggested feed along with every other video on YouTube that has been tagged under the Stephen Hawking topic. This is how your videos get discovered and watched.
In addition, CV utilizes a “safety” tool that determines, based on the data it has gathered from the images in your thumbnail, if your video is safe for all audiences to watch, or if it has adult themes, violence, or other questionable content, and it gives a “confidence” score of that determination. This score also reflects how accurately the content matches what the thumbnail shows. This means that you can create a thumbnail, plug it into Cloud Vision, and know before you finalize your video upload how the thumbnail will likely be rated in the system. Using Cloud Vision can help catch something that might, for whatever reason, be flagged as inappropriate on any data point, and therefore can give creators the opportunity to fix it even before it is live. This has cut down on demonetization and other issues creators have had in the past. It can be a very valuable tool to help you stay one step ahead of the problems. CV is not an exact replica of YouTube's safety measures, but it is close enough that creators can get a good idea of how the content will be determined by YouTube. CV might tolerate something YouTube will not, but it is still a sufficient prelaunch tool to utilize.
Figure 4.1 Thumbnail with data points
Video Intelligence
Once the thumbnail has been checked, the AI goes through every single frame of the video and creates shot lists and labels based on what it sees in the content of the video itself. For example, if you do a video in a parking lot, the AI detects the store front, people, flowers, brands, and more, so it can log that info for recommendations and run it through the same safety routine that it uses to check thumbnail images. Be aware of what is in the frame in every scene of every video you create! It will be detected by the AI and sorted accordingly, because the AI is validating the thumbnail. The AI cuts through the “noise” of every single thing in every frame and determines what is most important according to that video and its metadata.
Closed Captioning
The AI does the same thing with the language of the video. YouTube has an auto‐caption feature now, and the AI reads through the words of the caption to gather data as well. So basically going through the video frames using shot lists is like looking at what is visually being said, while listening to the audio provides even more feedback via what is actually being verbalized. Everything goes into the system.
Natural Language
The AI is also listening for actual sentence structure and breaking it down into a sentence diagram. This extracts the meaning of what is being said. It can differentiate language so it can group it categorically, but not just on the surface. For example, two different creators might both talk about Stephen Hawking in their videos, but one video might be biographical or scientific while the other might be humorous or entertaining. Even though both videos are talking about the same person, they are categorically different enough that the AI would categorize them differently and group them with different recommended content because of the language being used.
Video Title and Description
As you should expect, the algorithm also looks at the video's title and description to supplement what it has already learned from the thumbnail, frame by frame, and the language. But it only tracks this as long as it needs to before it will use viewer data that comes in. The AI “knows” that people can be deceptive with metadata, but they can't lie about what's actually in the content. Don't slap a title and description on your video haphazardly just to get it finished and uploaded. The verbiage matters, so choose your words wisely. Most creators don't leverage the video description to its fullest potential. It's another data point the AI looks at to help with search ranking and discovery.
Part 2: Algorithms with an “S”
Did you know that YouTube has more than one algorithm? The AI uses multiple systems, and each has its own objective and goal. The surface features viewers see are:
Browse Features: Homepage and Subscription
Suggested
Trending
Notification
Search
Each of these features runs separate algorithms trying to be optimized for a higher hit rate, and they all feed into the YouTube AI. They have separate hit rates to determine what actually works for users in each particular system. Hit rate means how often viewers are able to find what they actually want to watch. Have you ever heard of a fisherman getting a “hit?” It's when a fish takes the bait. Imagine that you are the fisherman who has tossed his video into the water. Potential viewers are the fish swimming by your “bait.” Maybe 10 fish take a look at the bait and swim on by because it's not the brand of bait that they like. But along comes a fish who says, “That looks good,” and he bites. Say you toss this line 10 times, and while 100 fish swam past, 10 took the bait. There's your hit rate. This hit rate is so important to each system in the AI. The algorithms are very sensitive to user behavior and the metadata on each traffic source so that they know how to change to increase the hit rate.
Additionally, YouTube is constantly running experiments—several thousand a year—and they implement about 1 in 10 changes as they go, so this translates to hundreds of changes being implemented annually. These changes help the system get smarter, and smarter means better at feeding viewers what they will watch.
Browse: Homepage
YouTube's Homepage has changed over time. Users no longer have to type a query in Search or to put in the work to navigate. The Homepage used to be where users saw only video recommendations of channels they had subscribed to. Now the Homepage has a personalized