This talk will describe the big ideas behind machine learning: what it can do for us, and how it works.
We'll avoid two of the traps that machine learning introductions often fall into.
We'll avoid focusing on implementation, at the expense of understanding how things work; in other words, we'll try not to make this sound too much like magic.
We'll also avoid getting so far into the details of how things are implemented that we obscure simple ideas with complex mathematical notation.
Let's look at a problem I recently solved using machine learning, so we can see where it's useful.
I wanted to parse descriptions of ingredients in recipes to extract the quantity, unit of measure, and ingredient name.
At first this seemed simple, but the more examples I looked at the more complex the problem became. In the end I gave up trying to use basic string parsing. While it was easy for me to look at an ingredient string and see the answer, it was hard to determine what the rules were to get a computer to do that.
This kind of problem is ideal for machine learning, because the killer feature of machine learning is generalisation. If a system can generalise, it can work with examples that weren't explicitly considered by the design.
As an aside, you can read more about how I solved this in practice in my article on Named Entity Recognition on the thoughtbot blog.
Think about what you do when you write a typical program: you consider all the possilbe types of input, and write down rules for the computer to follow.
If there are too many possibilities to consider all of them, or if our understanding of the rules is too vague to write them down precisely, we can't follow this typical approach.
Fortunately, you've probably built generalising systems before, even if you didn't realise you were doing it.
If your high school was anything like mine, you did a lot of experiments in science class. We're going to look at a simple experiment here, and use it as an analogy for how machine learning works.
One popular experiment is to measure how high a ball bounces when it is dropped from different heights.
The aim of the experiment is to discover if there's a mathematical relationship between the height of the drop and the height of the bounce. If we discover such a relationship, we'll be able to use it to predict the height of future bounces.
The first step is to collect data. We have to drop a ball a few times from different heights and record the heights of the bounces.
Once we have plotted our data, we can see a clear trend: the points on our chart are arranged in roughly a straight line. We can add a trend line to our chart to indicate the relationship between drop height and bounce height.
When we decide where to draw a straight trend line, we're really chosing two values:
A fixed point the line passes through. Any point would do, but we usually use the point where the line passes through the vertical axis and call it the intercept. In our experiment, that means we're picking the value of bounce height when drop height is zero, so we'd probably expect something around zero.
The gradient of the line, which is a measure of how steeply the line slopes. If the gradient is 2, that means the line goes up 2 units each time it goes across 1 unit. In our experiment, the gradient tells us how much we expect the bounce height to go up by each time we increase the drop height by 1 metre, so we'd probably expect something between zero and one.
Once we have these two values, we can calculate the any point on our line with some simple Python code:
This line is a mathematical model of how a ball bounces. It can make predictions about how a ball will bounce, even when the drop height isn't one of the ones we measured. In other words, we've built a generalising system.
So far, we picked the gradient and intercept of the line by eye, plotting whichever trend line looks right to us, but we might not get the best possible result. To make sure we're getting the best result, we need some measure of how well our trend line fits the values we measured in our experiment.
The measure that's typically used is the average squared error, which is calculated like this:
We can visualise it like this:
As we've already seen, we can use our line to predict a bounce height based on a drop height. We can build on that to write some code to calculate the error:
import math
measurements = [
(1, 0.72), (2, 1.48), (3, 2.26), # etc.
]
def cost():
errors = [
predict_bounce_height(drop_height) - measured_bounce_height
for drop_height, measured_bounce_height
in measurements
]
return sum([math.pow(err, 2) for err in errors]) / len(measurements)
Now that we can attach a number to a line to tell us how good or bad it is, we can attempt to find the best possible line.
If we plot a chart of the error against the gradient, we can see a pattern: the error is lowest at a single specific point.
There are lots of algorithms available to find the minimum value of a function. For example, imagine an algorithm that makes small changes to the gradient, and iteratively gets closer to the best result.
Machine learning systems often use this type of algorithm to find best parameters to fit a model to our data set.
The killer feature we were aiming for was generalisation. So how do we know if our model generalises well? We've seen it can make predictions, but are they right?
This is easy to test: we can collect more data, data we didn't use to develop our model, and check if the predictions are close to what we observe in the real world. We can even re-use our error score calculation to see how well the predictions fit our test data.
Our data was very simple&emdash;we only had one input variable, and one output variable. We could look at the data on a scatter plot and it was clear we should pick a straight line as our model. In real world machine learning systems, it's rarely that simple. We may have to try several different models before we find one that fits the data well, and generalises well to new examples.
If you're looking for more information, I'd recommend this book for a good overview of different types of models. While it is somewhat more mathematical than this talk, each equation is accompanied by a clear description and a worked example.
For a more hands-on approach, check out the online course from fast.ai.