Machine learning and artificial intelligence is hard. Hard to understand. At least that's how it can seem. A lecturer once told me that the best definition he knew for "artificial intelligence" was "Anything we can't do yet". But if true artificial intelligence is still a few years away, machine learning is increasingly common in our day to day lives. We talk to our phones, we search Google. Amazon tells us what to buy. Twitter tells us what to think. But what is machine learning? Let's begin with a simple definition:
Machine learning is programming computers by examples.
And that's pretty much it. Programming by examples. Rather than giving a computer a series of instructions, we give it examples that show it the kind of thing we mean. Using those examples, the computer is able to spot patterns and behave in an appropriate way when it encounters something like those examples in the future.
Let's take something simple that a child of four could do: visual recognition. Show a small child a picture of a cat, and they will tell you it's a cat. But doing that kind of thing with computers has always been immensely complex. If we were going to write a list of instructions to tell the computer how to recognise a cat, where would we begin? With the ears? What does a cat's ear look like?
Rules based visual recognition system meets its nemesis
But with machine learning, we don't give the computer a series of explicit commands to identify a cat, we just give it thousands and thousands of cat photos and let the computer decide what it is that makes a cat look like a cat.
For this reason, we don't call this process "programming", we call it "training".
Training to learn
Let's work through an example of a machine learning system. All machine learning systems do pretty much the same thing. They take an input. They process it using maths, processing and pixie dust, and they produce an output.
For example, let's say we're going to build an application to automatically classify news articles as "technology", "business" and "sport". We'll begin by collecting a bunch of example "technology" articles, and feed them in to our machine learning system and tell them that the output should be "technology".
This is called "training" the system. Rather than giving it rules, you give it examples.
Then, once the system is trained, you give it something it hasn't seen before and ask it for the output it thinks is correct.
The machine learning system will then make its best attempt on giving you an answer, given its previous experience. If the answer is right, great. If not, you can always feed the correct answer back in as another example.
That's how machine learning systems get better over time. They learn. Hence the name.
Building one (the fun part)
There are several ways of creating a machine learning system. You might use a neural network. That's an application that tries to simulate the processes of neurons in a brain. But brains are quite tricky. They're like pasta makers. Everyone's got one. No-one knows how they work.
So we'll look at an alternative approach that uses statistics to learn from examples. Statistical machine learning is used in many online systems from Google searches to Amazon recommendations. Most of them are built using the ideas of an 18th Century mathematician called the Rev. Thomas Bayes. In order to understand what Bayes said, we just need to understand this theorem:
You DO NOT need to learn this
OK, just kidding. You don't need to understand that. All you need to understand, is that the good Rev. Bayes discovered a way to turn probability on its head. A way of turning observations of what happened in the past into predictions about what happens in the future.
In our example, we're going build a system to read news stories and decide whether they're about sport, technology or business. We'll get a bunch of examples from the web and categorise them ourselves. We can then feed these into our system.
But what will the code do with them? Well, it looks at the words used in each of the example categories and counts how frequently they occur. In that way the computer can very easily estimate things like:
The probability that a business article contains the word "growth" [Result 1]
We can make that estimate by totalling up the number of instances of the word "growth" in all of our example business articles, and dividing it by the total number of all words in the examples.
Now that's pretty easy to calculate, but it's not very useful. It's just a value derived from a set of historic articles.
But this is where the good Rev. Bayes comes in. Remember that pile of mathematics we ignored above? Well that allows us to take the numbers we've already calculated and work this out…
The probability that an article is a business article, given that it contains the word "growth" [Result 2]
That might seem like a really small change, but it's not. Re-read it and compare it to result 1 above. What result 2 gives us is the ability to look at any article in the future and get a hint about whether it's a business article.
Now of course doing this with just the word "growth" really doesn't tell you much. But what if you did the same calculation with all the words that appear in the example set of articles? That would give you hundreds of words, and hundreds of pieces of evidence that would allow you to decide if any article you saw in the future was, or was not, about business.
Let's try our code out. If you want to follow along at home, I've posted the code on Github. Included in the source code are two sets of data: our example news articles that have already been categorised (these will be the examples that train our code) and then a whole bunch of other articles that we will ask the code to analyse.
To see how well the program will work, I've gone through the articles and given them a little prefix to show whereabouts they appeared in the news sites they were downloaded from. These prefixes are only to make it easy for us to analyse the results. They aren't taken into account by the code itself. Because that would be cheating.
Here are our news articles that we're going to analyse:
In fact there are about 80 articles that have been taken from different web sites from our original examples.
So how does the code do? Does it do better than pure chance? When the code runs, it classifies these stories as "Technology" stories.
Actually it classified 37 stories as "Technology", 34 of which came from technology news sections, and three of which came from business. Of those three articles, one is about the benefits of hiring a virtual office assistant, which is kind of a business and technology story. The second is about recent changes in the stock market value of Google and Microsoft. And the third has a lot of references to office "windows". So that's either 34 or 36 out of 37 correctly analysed.
For sport, the code finds these nine articles:
And that's all of the sports articles. That tells you two things: the code is really good at spotting a sports story when it sees one, and also that I clearly don't like sport because I got bored after downloading nine examples.
Finally, here's what it thought were business stories:
That's thirty stories all about business, and it doesn't include the three business stories which it decided were more like technology stories.
So the result is really pretty good. Which is probably comparable with accuracy of a human. Like a human, it can improve over time. If you find examples where the code went wrong – like the article that contained references to office "Windows" – you can add these back in as examples in the correct category. Of course, unlike a human, this app could analyse thousands and thousands of stories a day, relentlessly, without the need for food or sleep. And it can run on a very, very low-powered computer. In fact, the phone in your pocket would have enough processing power.
But not only can the application automatically classify news stories, it can also tell you why it decided to put a story in a particular category. Every word in the story is treated as a piece of evidence that is for, or against, the article being in a particular category. Here's one example. This is an example of an article that the code decided was about business:
The "bluer" the word, the stronger the evidence that this article is a business article. The "redder" the word, the greater the evidence that this isn't a business article.
But why does this matter? (the profit part)
Thank you for getting this far into this post. You will have hopefully have found our worked example interesting, but you might be wondering what relevance something like this has for your business. This kind of thing might be all well and good for likes of Google, but is it really going to help you sell more jeans and shoes?
Well it just might. Let's say you are actually in the business of selling jeans and shoes. You may well have a large set of records that show what people have bought in previous years. Let's imagine, just for the moment, that you want to sell a pair of jeans. Look back at your data and consider each instance of someone buying those jeans previously. What products had they already purchased? How long ago was it? Treat all of those pieces of information as examples that build up a profile of what a "person about to buy jeans" is like. Then look at your current customers. For each of them calculate:
The probability they will buy jeans given they have already bought the following…
You will then be able to rank your customers in terms of likelihood of buying your product. Rather than marketing to everybody you've sold anything to, you'll be able to market to those people who are most likely to be interested in your product or service. In a sense, you could build a recommendation engine for a customer, that works in the same way that Amazon has a recommendation engine for products.
Machine learning may sound abstract. It might sound like space age technology that is only applicable to the likes of Elon Musk and Google. But it isn't. Turn your data sets into examples, and allow your business to learn and profit from it.