computers were terrible at understanding what it was they were looking at. This is one domain where the human brain totally outstrips its silicon rivals. We are able to eyeball a picture very quickly and say what it is or to classify different regions of the image. A computer could analyse millions of pixels, but programmers found it very difficult to write an algorithm that could take all this data and make sense of it. How can you create an algorithm from the top down to identify a cat? Each image consists of a completely different arrangement of pixels and yet the human brain has an amazing ability to synthesise this data and integrate the input to output the answer, ‘cat’.
This ability of the human brain to recognise images has been used to create an extra layer of security at banks, and to make sure you aren’t a robot trawling for tickets online. In essence you needed to pass an inverse Turing Test. Shown an image or some strange handwriting, humans are very good at saying what the image or script is. Computers couldn’t cope with all the variations. But machine learning has changed all that.
Now, by training on data consisting of images of cats, the algorithm gradually builds up a hierarchy of questions it can ask an image that, with a high probability of accuracy, will identify it as a cat. These algorithms are slightly different in flavour to those we saw in the last chapter, and violate one of the four conditions we put forward for a good algorithm. They don’t work 100 per cent of the time. But they do work most of the time. The point is to get that ‘most’ as high as possible. The move from deterministic foolproof algorithms to probabilistic ones has been a significant psychological shift for those working in the industry. It’s a bit like moving from the mindset of the mathematician to that of the engineer.
You may wonder why, if this is the case, you are still being asked to identify bits of images when you want to buy tickets to the latest gig to prove you are human. What you are actually doing is helping to prepare the training data that will then be fed to the algorithms so that they can try to learn to do what you do so effortlessly. Algorithms need labelled data to learn from. What we are really doing is training the algorithms in visual recognition.
This training data is used to learn the best sorts of questions to ask to distinguish cats from non-cats. Every time it gets it wrong, the algorithm is altered so that the next time it will get it right. This might mean altering the parameters of the current algorithm or introducing a new feature to distinguish the image more accurately. The change isn’t communicated in a top-down manner by a programmer who is thinking up all of the questions in advance. The algorithm builds itself from the bottom up by interacting with more and more data.
I saw the power of this bottom-up learning process at work when I dropped in to the Microsoft labs in Cambridge to see how the Xbox which my kids use at home is able to identify what they’re doing in front of the camera as they move about. This algorithm has been created to distinguish hands from heads, and feet from elbows. The Xbox has a depth-sensing camera called Kinect which uses infrared technology to record how far obstacles are from the camera. If you stand in front of the camera in your living room it will detect that your body is nearer than the wall at the back of the room and will also be able to determine the contours of your body.
But people come in different shapes and sizes. They can be in strange positions, especially when playing Xbox. The challenge for the computer is to identify thirty-one distinct body parts, from your left knee to your right shoulder. Microsoft’s algorithm is able to do this on a single frozen image. It does not use the way you are moving (which requires more processing power to analyse and would slow the game down).
So how does it manage to do this? The algorithm has to decide for each pixel in each image which of the thirty-one body parts it belongs to. Essentially it plays a game of twenty questions. In fact, there’s a sneaky algorithm you can write for the game of twenty questions that will guarantee you get the right answer. First ask: ‘Is the word in the first half of the dictionary or the second?’ Then narrow down the region of the dictionary even more by asking: ‘Is it in the first or second half of the half you’ve just identified?’ After twenty questions this strategy divides the dictionary up into 21820 different regions. Here we see the power of doubling. That’s more than a million compartments – far more than there are entries in the Oxford English Dictionary, which roughly come to 300,000.
But what questions should we ask our pixels if we want to identify which body part they belong to? In the past we would have had to come up with a clever sequence of questions to solve this. But what if we programmed the computer so that it finds the best questions to ask? By interacting with more and more data – more and more images – it finds the set of questions that seem to work best. This is machine learning at work.
We have to start with some candidate questions that we think might solve this problem so this isn’t completely tabula rasa learning. The learning comes from refining our ideas into an effective strategy. So what sort of questions do you think might help us distinguish your arm from the top of your head?
Let’s call the pixel we’re trying to identify X. The computer knows the depth of each pixel, or how far away it is from the camera. The clever strategy the Microsoft team came up with was to ask questions of the surrounding pixels. For example, if X is a pixel on the top of my head, then if we look at the pixels north of pixel X they are much more likely not to be on my body and thus to have more depth. If we take pixels immediately south of X, they’ll be pixels on my face and will have a similar depth. But if the pixel is on my arm and my arm is outstretched, there will be one axis, along the length of the arm, along which the depth will be relatively unchanged, but if you move out ninety degrees from this direction it quickly pushes you off the body and onto the back wall. Asking about the depth of surrounding pixels could cumulatively build up to give you an idea of the body part that pixel belongs to.
This cumulative questioning can be thought of as building a decision tree. Each subsequent question produces another branch of the tree. The algorithm starts by choosing a series of arbitrary directions to head out from and some arbitrary depth threshold: for example, head north; if the difference in depth is less than y, go to the left branch of the decision tree; if it is greater, go right – and so on. We want to find questions that give us new information. Having started with an initial random set of questions, once we apply these questions to 10,000 labelled images we start getting somewhere. (We know, for instance, that pixel X in image 872 is an elbow, and in image 3339 it is part of the left foot.) We can think of each branch or body part as a separate bucket. We want the questions to ensure that all the images where pixel X is an elbow have gone into one bucket. That is unlikely to happen on the first random set of questions. But over time, as the algorithm starts refining the angles and the depth thresholds, it will get a better sorting of the pixels in each bucket.
By iterating this process, the algorithm alters the values, moving in the direction that does a better job at distinguishing the pixels. The key is to remember that we are not looking for perfection here. If a bucket ends up with 990 out of 1000 images in which pixel X is an elbow, then that means that in 99 per cent of cases it is identifying the right feature.
By the time the algorithm has found the best set of questions, the programmers haven’t really got a clue how it has come to this conclusion. They can look at any point in the tree and see the question it is asking before and after, but there are over a million different questions being asked across the tree, each one slightly different. It is difficult to reverse-engineer why the algorithm ultimately settled on this question to ask at this point in the decision tree.
Imagine trying to program something like this by hand. You’d have to come up with over a million different questions. This prospect would defeat even the most intrepid coder, but a computer is quite happy to sort through these kinds of numbers. The amazing thing is that it works so well. It took a certain creativity for the programming team to believe that questioning the depth of neighbouring pixels would be enough to tell you what body part you were looking at – but after that the creativity belonged to the machine.
One of the challenges of machine learning is something called ‘over-fitting’. It’s always possible to come up with enough questions to distinguish an image using the training data, but you want to come up with a program that isn’t too tailored to the data it has been trained on.