How do machines learn?

Let us do a one-sentence summary of What is Machine Learning, the Part-I of this series. A machine is said to learn a task when its performance, according to some measure, improves after being exposed to some training (or experience).

Here we see how the definition can be made into an implementable algorithm. The key concept is that of error which is the difference between what the machine predicts and what the actual value is. The actual value, provided as a part of the experience, is also called the ground truth. The overall idea is very simple: adjust the parameters for the solution such that each adjustment lets you reduce the error. Keep adjusting the parameters until you can no longer reduce the error: this is called convergence and it is said that the system converged to a solution. As an aside, it is good to remember that there is no guarantee for a system to converge; in which case, we simply stop after making adjustments for a certain specific number of times.

1. Stochastic Gradient Descent

Pictorially and conceptually, what we do is that we look at the error as a landscape over the parameters we adjust. If for some values of parameters, the error is high, then we are on a hill. If for another set of values, the error is low, then we are in a valley. Reducing the error by adjusting the parameters is akin to travelling down into a valley. Therefore the correct solution is adjusting the parameters in such a way that we find a way down into a valley. Convergence is to stop at the very bottom of the valley.

The same idea is expressed in fancy terms so that it sounds technical and also is impressive :-) The essence of the strategy is that we start at some random point in the landscape (randomly chosen values of the parameters), find the direction of the downward slope and descend along the way. Another word for randomness is stochastic. Slope is mathematically called a gradient. Travelling down in descending. Therefore, this very simple, yet powerful strategy is called Stochastic Gradient Descent (SGD)

Behind almost every learning algorithm in AI today lies Stochastic Gradient Descent implemented as a 3-step strategy.

start with randomly chosen values for parameters
calculate error, i.e. how far you are from the real solution
move towards solution, i.e. bring down the error

It is extremely simple and intuitive so much so that it can be understood by anyone willing to work with a bit of high-school maths.

2. Mathematical Details

As we make particular choices in the three steps of SGD, we get different formulations and sometimes different algorithms. Before doing that, let us give it the mathematical flavour that it deserves so that we can appreciate it in action later.

Consider a relationship between input feature $x$ and the output value $y$ given by $$y = f(a, b; x)$$ where $a, b$ are parameters of the function $f$. For example, $f(x) = ax^2 + b$ is a function with input $x$ and parameters $a$ and $b$. We did not write the above equation in the simpler way that we learn in school as $y = f(x)$ because the parameters are important in learning. In fact, we say that a machine learns when it gets the correct values of the parameters ($a, b$) from the experience which in many cases is simply a collection of pairs $(x, y)$. $(x, y)$ means that if the input is $x$, then its correspondig output is $y$.

What about the error? Error is just the difference between the predicted values and the ground truth. This also needs to be made more precise. There are many ways to calculate differences: the most popular is called Mean Squared Error given by $$E = \frac{1}{N}\sum_{i=1}^{N}(\hat{y_i} - y_i)^2$$ where $N$ is the number of samples in the training set, $y_i$ is the ground truth and $\hat{y_i}$ is the predicted value for the training sample $i$. MSE is used when we compute the overall error in training set but if we want to look at a single example as in SGD, we use just the difference. $$E_i = (\hat{y_i} - y_i)$$

The last step is actually to move down the slope. This is the most interesting in practice as you will see! Slope is generally given by a derivative in maths. From first principles, it is computed as $$\Delta_a = \frac{\partial{f}}{\partial{a}} = \frac{f(a+\Delta, b; x) - f(a, b; x)}{\Delta}$$ where we are trying to find the effect of adjusting the parameter $a$ by changing it slightly, with the change given by $\Delta$. We do this for every parameter of the function $f$. Thus, for finding the effect of $b$, we have $$\Delta_b = \frac{\partial{f}}{\partial{a}} = \frac{f(a, b+\Delta; x) - f(a, b; x)}{\Delta}$$

Finally, how do we use the derivatives to adjust/update the parameters? Note that in both the derivatives above, a positive value implies that you are climbing and a negative value means that you are descending. We wish to reward the parameters where the slope is negative and penalise when positive. Also, remember that we want to descend slowly. Keeping these factors in mind, the adjustment equation is given by $$a_{\mbox{new}} = a_{\mbox{old}} - \eta\nabla{E_i}$$ where $\nabla$ stands for the gradient of the error function $E_i$ for example $i$ in the training set. $\eta$, called the learning rate, controls how fast we adjust the parameters. It is usually kept small, around $10^{-3}$. Too high a value will make the algorithm jump wildly and too small a value will make it converge very slowly.

The above equation may be written more explicitly to bring out the fact that we are adjusting the parameters, $$a_{\mbox{new}} = a_{\mbox{old}} - \eta\; \frac{\partial{E_i}}{\partial{a}}$$ and $$b_{\mbox{new}} = b_{\mbox{old}} - \eta\; \frac{\partial{E_i}}{\partial{b}}$$ See how the equation increases the parameter value when the derivative is -ve and decreases the values when the derivative is +ve. It is thus rewarding and penalising the parameters appropriately.

The entire mathematics is easily generalisable to arbitrary functions with multiple parameters. In the general case, it is common to write the function $f$ as $$f(\Theta;\mathbf{x})$$ where $\Theta$ represents the multiple parameters and $\mathbf{x}$ represents a vector of all the input variables. We are not interested in such complex beasts right now, but you will encounter them when you do any research in Deep Learning!

We are now done, even with the mathematical details. All that is left is to see SGD in action. Let us do just that in Part - III (SGD in Action!)

END OF PART - II