Let us do a one-sentence summary of What is Machine Learning, the Part-I of this series. A machine is said to learn a task when its performance, according to some measure, improves after being exposed to some training (or experience).
Here we see how the definition can be made into an implementable algorithm. The key concept is that of
Pictorially and conceptually, what we do is that we look at the error as a landscape over the parameters we adjust. If for some values of parameters, the error is high, then we are on a hill. If for another set of values, the error is low, then we are in a valley. Reducing the error by adjusting the parameters is akin to travelling down into a valley. Therefore the correct solution is adjusting the parameters in such a way that we find a way down into a valley. Convergence is to stop at the very bottom of the valley.
The same idea is expressed in fancy terms so that it sounds
technical and also is impressive :-) The essence of the strategy
is that we start at some random point in the landscape
(randomly chosen values of the parameters), find the direction
of the downward slope and descend along the way. Another word
for randomness is
Behind almost every learning algorithm in AI today lies Stochastic Gradient Descent implemented as a 3-step strategy.
As we make particular choices in the three steps of SGD, we get different formulations and sometimes different algorithms. Before doing that, let us give it the mathematical flavour that it deserves so that we can appreciate it in action later.
Consider a relationship between input feature $x$ and the output value $y$ given by $$y = f(a, b; x)$$ where $a, b$ are parameters of the function $f$. For example, $f(x) = ax^2 + b$ is a function with input $x$ and parameters $a$ and $b$. We did not write the above equation in the simpler way that we learn in school as $y = f(x)$ because the parameters are important in learning. In fact, we say that a machine learns when it gets the correct values of the parameters ($a, b$) from the experience which in many cases is simply a collection of pairs $(x, y)$. $(x, y)$ means that if the input is $x$, then its correspondig output is $y$.
What about the error? Error is just the difference between the
predicted values and the ground truth. This also needs to be
made more precise. There are many ways to calculate differences:
the most popular is called
The last step is actually to move down the slope. This is the most interesting in practice as you will see! Slope is generally given by a derivative in maths. From first principles, it is computed as $$\Delta_a = \frac{\partial{f}}{\partial{a}} = \frac{f(a+\Delta, b; x) - f(a, b; x)}{\Delta}$$ where we are trying to find the effect of adjusting the parameter $a$ by changing it slightly, with the change given by $\Delta$. We do this for every parameter of the function $f$. Thus, for finding the effect of $b$, we have $$\Delta_b = \frac{\partial{f}}{\partial{a}} = \frac{f(a, b+\Delta; x) - f(a, b; x)}{\Delta}$$
Finally, how do we use the derivatives to adjust/update the
parameters? Note that in both the derivatives above, a
positive value implies that you are
The above equation may be written more explicitly to bring out the fact that we are adjusting the parameters, $$a_{\mbox{new}} = a_{\mbox{old}} - \eta\; \frac{\partial{E_i}}{\partial{a}}$$ and $$b_{\mbox{new}} = b_{\mbox{old}} - \eta\; \frac{\partial{E_i}}{\partial{b}}$$ See how the equation increases the parameter value when the derivative is -ve and decreases the values when the derivative is +ve. It is thus rewarding and penalising the parameters appropriately.
The entire mathematics is easily generalisable to arbitrary functions with multiple parameters. In the general case, it is common to write the function $f$ as $$f(\Theta;\mathbf{x})$$ where $\Theta$ represents the multiple parameters and $\mathbf{x}$ represents a vector of all the input variables. We are not interested in such complex beasts right now, but you will encounter them when you do any research in Deep Learning!
We are now done, even with the mathematical details. All that is left is to see SGD in action. Let us do just that in Part - III (SGD in Action!)