Stochastic Gradient Descent in Action!

In the previous two articles - What is Machine Learning? and How do Machines Learn?, we discussed the basics of Machine Learning and the importance of Stochastic Gradient Descent (SGD) algorithm. It is now time to implement SGD and show how it works.

As we get into the implementation details, it is also very motivating to remember that understanding SGD is the first great step in getting into AI/ML/Deep Learning fields. A simple and pretty normal way of extending SGD to neural networks with multiple layers results in the popular backpropagation algorithm. Backpropagation is the sole learning algorithm that we have known in the past 50 years or so! The ideas we learn here - from preparing and scaling data to using the test data and making predictions on unknown samples - are all used almost unchanged in most of the research today. With a little experience and some mathematical knowledge, many pitfalls and shortcomings of SGD and backpropagation are being overcome by researchers today.

1. Setting up the problem

Let us define our task, performance measure and experience. The task is to predict the cost of a book. The feature that we want to use is the number of pages in the book. The experience part is given by data for 55 different books, each with a certain number of pages and for which the costs are known.

We use a very common trick to get the performance measure: divide the experience into two parts. The first is used as experience, i.e. examples from which the system learns. This part of the experience is called the training set. The second part, where also we know the number of pages and costs, is used for measuring the performance. We pretend that for these books, only the number of pages is given and ask the system to predict the costs. We do these predictions before the training set is used for adjusting the parameters and after the parameters have been adjusted to reduce the error. If the predictions after training are closer to the actual values (remember that we have the costs in the data) than at the beginning, we say that the system learnt predicting the costs from number of pages. This second part of the data, which is used to measure the performance of the system, is called the test set.

We begin the implementation (in Python) and understand the details along the way. The data is present in a file named books-training.txt. We read it into an array $D$; the first column is the number of pages and the second, the cost. As there are 55 samples in the data, $D$ is a $55\times2$ dimensional array. Let us take the first 40 examples as the training set and the last 15 as test set. The code is shown in the cells 1 and 2 below. We also show the first 8 items and plot the data. You can see from the printed data that a book with 210 pages costs ₹569.10 (first entry) while a book with 335 pages costs ₹1667.85 (last entry).

The plot shows the number of pages on the X-axis and the cost on the Y-axis. The first entry is shown as a black dot at the position $X = 210, Y = 569.10$. There is a clear pattern in the dots which means that the costs are related to the number of pages. The data will be scattered if there is no relationship. It makes sense for us now to learn the pattern because it clearly exists!

# Just the usual imports of numpy and matplotlib!
import numpy as np
from math import exp, comb
from matplotlib import pyplot as plt
np.set_printoptions(precision=2)

# Read data from file and show 8 samples for understaning
D = np.loadtxt('books-training.txt')
print('Some Examples from the data - No. of pages and price')
print(D[:8,:])

# Plot the data to get an idea of the relationship between
# number of pages and cost (if any!)
plt.grid(True)
plt.plot(D[:40, 0], D[:40,1], 'k.')
plt.show()

# Take the first 40 books for training, i.e. as experience
x = D[:40, 0]
y = D[:40, 1]

# Keep the last 15 examples for testing
tx = D[40:, 0]
ty = D[40:, 1]

Some Examples from the data - No. of pages and price
[[ 210.    569.1 ]
 [ 175.    362.25]
 [ 355.   1895.85]
 [ 295.   1255.05]
 [ 230.    707.1 ]
 [ 260.    941.1 ]
 [ 240.    781.5 ]
 [ 335.   1667.85]]

2. Mathematical Formulation

After setting up the problem qualitatively, it is time to give it a more precise definition. The pattern in the plot showed us that there is indeed a relationship between the cost of the book and the number of pages. Mathematically, let $x$ denote the number of pages and $y$, the cost of the book. We express the relationship between them as $$y=f(x)$$ where $f$ is a function. $f$ can take many forms, but given the simple pattern, we can restrict it to be a polynomial, i.e. something of the form $$a_0 + a_1x + a_2x^2 + \ldots + a_kx^k$$ where $a_0, a_1, a_2, \ldots, a_k$ are the parameters that we need to learn.

Of course, an immediate question is, "what should $k$ be?" In other words, what should be the degree of the polynomial? The pattern in the plot shows clearly that it is not a straight line; so, we need at least a second degree polynomial. Let us assume that $$y = a_0 + a_1x + a_2x^2$$ We need to find $a_0, a_1$ and $a_2$ from the training data given.

We find the parameters using our 3-step learning strategy. The code is given in Cell 3 below as the Python function sgd_learn. The function takes as inputs

train: training data inputs (number of pages). It is usually a good idea to scale input to the range $(-1,+1)$
labels: training data outputs (cost of the books). Again, it is a good idea to scale the outputs (if numerical) into the range $(0,+1)$.
params: initial values (randomly chosen) of the parameters; in our case rndomly chosen values for $a_0, a_1, a_2$
nparams: number of parameters which defaults to $2$, but is $3$ in our case.
delta: the step size that is taken; defaults to $0.0001$
eta: the learning rate; defaults to 0.05
nepochs: the maximum number of times our learning algorithm runs - is also the number of steps that we go down the error hill so to speak! Default value is 1000.

It returns the learnt values of the parameters given as input.

3. sgd_learn Function in Detail

Let us write the learning function one step at a time. The function has the following steps.

Initialise the necessary variables: The first is to find out how many training samples we have. Line 5 does just that - it initialised the variable size as the number of training examples. For us, this would be $40$.
Second, remember that we need to find the adjustment that reduces error at each step. We think of error as a landscape where high values are hills and low values are valleys. We can then adjust the parameters to travel down a slope into a valley. Therefore, we have a variable which keeps track of the slopes for each parameter: gradient_j is just that. It is a nparams sized vector initialised to 0s (Line 8).
Travel down the error slope: In our simple code, we travel down the slope taking nepochs steps (Lines 11 - 35). In other words, we adjust the parameters nepochs times.
- Start at a random point, i.e. take one training sample at random from the training set (Line 13).
- Predict the value for this selected point. That is, use the current values of $a_0, a_1$ and $a_2$ and calculate $y = a_0 + a_1x + a_2x^2$ and store it in value_j (Line 16). np.polyval function evaluates a polynomial with parameters given by params at the point train[j] (our randomly chosen point).
- Calculate the squared error as (value_j - labels[j])$^2$ (Line 19).
- Take several small steps to see which direction reduces error, i.e. goes down into error valley. (Lines 22 to 35).
- Take a delta sized step along each parameter (line 23).
- After taking the step, calculate the value of $y$ and the error again. This is stored in error_delta (Line 27).
- Calculate slope as the difference between the errors at the starting point and after taking a small step (Line 33). For a complete understanding, think matematical derivative .
- Adjust the parameters by going in the opposite direction to the calculated slope. If the slope is positive, it means that taking the step made you go up and therefore you need to reduce the value of the parameter because it is increasing the error. Otherwise, increase the parameter value because it is now taking us down the slope. This is done in Line 36 which implements the update equation from Part - II.
That's it! After nepochs adjustments, return the values of the adjusted parameters.

We are done!

There is also another useful function scale_params which is used because, in practice, we scale the inputs and outputs into the range $(-1,+1)$ or $(0,+1)$. Ignore it for now and just see how it is used (Cell 6, Line 15).

def sgd_learn(train, labels, params, nparams=2, delta=1e-4, 
              eta=5e-2, nepochs=1000) :
       
    # Determine how many samples we have for training
    size = train.shape[0]

    # Initialise everything to 0s
    gradient_j = np.zeros(nparams)             # Slopes
    
    # Learn by repeating many times
    for i in range(nepochs) :
        # Start somewhere at random!
        j = int(np.random.random() * size)
        
        # Predict a value at the starting point
        value_j = np.polyval(params, train[j])
        
        # Error = difference between predicted and actual values
        error = (value_j - labels[j]) ** 2
        
        # Take several small steps around the starting point
        for k in range(nparams) :
            params[k] += delta
            
            # Compute new value after taking the small step
            # and return to the atarting point
            error_delta = (np.polyval(params, train[j]) -
                           labels[j]) ** 2
            params[k] -= delta
            
            # Find slope as difference between values
            # before and after taking a small step
            gradient_j[k] = (error_delta - error) / delta
            
        # Walk down the slope (but slowly!)
        params = params - eta * gradient_j
        
    # Return whatever we learnt
    return params

def scale_params(unscaled, p=1.0, q=1.0, rev=True) :
    sz = unscaled.shape[0]
    scaled = np.zeros(sz)
    if rev :
        rlearnt = unscaled[::-1]  # Reverse them (np.polyval issue!)
    for k in range(sz) :
        for j in range(k, sz) :
            scaled[k] += comb(j, k) * (p ** k) * (q ** (j - k)) * rlearnt[j]
    if rev :
        scaled = scaled[::-1]        # Right side up again!
    
    return scaled

4. Running the Code and Seeing it in Action

Let us now put our newly written sgd_learn through its paces on our book-price problem. We first prepare the inputs to the function. Our training and test data are already in place as the arrays x and y. The number of parameters is $3$ as we want to fit the simplest polynomial that is not a straight line (Line 2). Remember that our parameters are initialised to random values - let us do that now. Line 5 in the code below (Cell 4) does just that! We create three random values in the range $-5.0$ to $+5.0$ and assign them to $a_0, a_1$ and $a_2$. Practically, it is good to keep these values small. We can initialise them in the range $-1$ to $+1$ also.

Finally, we plot the test data (no. of pages vs. cost) against the predicted values with randomly initialised parameters. The yellow dots are the ground truth and the green circles are the predicted values. You can see that they are nowhere close!

# Choose number of parameters
pars = 3

# Give randomly chosen values to the parameters initially
init_params = np.random.random(pars) * 10.0 - 5.0

# Give helpful outputs to the user
print('No. of Parameters Chosen: ', pars)
print('Initial Values of Parameters: ', init_params)

# Let us just see what the system predicts at the start
# without having learnt anything!
plt.grid(True)
plt.plot(tx, np.polyval(init_params, tx/10.), 'go', tx, ty, 'y.')
plt.show()

No. of Parameters Chosen:  3
Initial Values of Parameters:  [-4.7   4.04  3.  ]

Now for one of those things which makes practice different from theory! In our data, $x$ is the number of pages and it ranges from $90$ to $360$. These numbers give us grief if we directly use them; again, the general rule of thumb is that they should be small values, preferably in the range $(-1, +1)$. We do similarly with the costs and reduce them to the range $(0, +1)$ by dividing them with the maximum cost in the data. Otherwise, you will find that we need to play quite a lot with the learning rates, number of epochs and even step sizes (eta, nepochs, delta).

Line 5 scales all $x$ values into the range $(-1, +1)$ before giving them as input to sgd_learn (Line 9). The learnt parameters returned by the function are stored in learnt. These are scaled back to the original range in Line 12).

Like we did earlier, we plot the predicted values on test data now after learning. Again the yellow dots are the ground truth and the green circles are the predicted values. You can see now that they match perfectly showing that the system has learnt to predict the book costs from the number of pages.

Note that we scaled the test data also into the range $(-1, +1)$ (Line 14) before giving them to the np.polyval() function with the learnt parameters learnt (Line 16).

# This is an implementation detail
# We need to scale or normalise inputs to the system
# so that it can handle any value. What we do is
# scale all the inputs into the range 0 to +1
X = -1.0 + 2.0 * (x - 90.0) / 270.0
Y = y / y.max()               # Scale into (0, 1)

# Give the number of parameters and their initial values,
# traning data (X, Y) to the learning algorithms
learnt = sgd_learn(X, Y, init_params, nparams=pars, eta=1e-1)
learnt *= y.max()             # Scale back to the original

# Let us just see what the system predicts after learning
TX = -1.0 + 2.0 * (tx - 90.0) / 270.0
plt.grid(True)
plt.plot(tx, np.polyval(learnt, TX), 'go', tx, ty, 'y.')
plt.show()

Yay, let us see how good the learning is in numbers. The original book data was generated with the following function: $$y = 1.8x^2 - 10.2x - 10.5$$ where $x = P/10, P = $ No. of pages.

What did the system learn? We look at the values in learnt . These values need to be scaled because we scaled the inputs into the range $(-1, +1)$. This is quite easy if we take the scaled values of $x$ as $X = mx + l$. We can compute $m$ and $l$ for the scaling that we did and then scale the learnt parameter values. This is done in Lines 11, 12 and Lines 16-18. After scaling, the values are printed in Line 23 with the ground truth values in Line 24. You can see that they are close showing that the system has learnt the parameters quite well! You can also see what happens by changing the number of epochs.

# Let us now print the parameters learnt and
# compare them against the ground truth!
actuals = np.array([1.8, -10.2, -10.5])

# As we scaled the x values into the range (-1,+1),
# we need to scale the learnt parameters into
# the correct range. Done using simple linear
# scaling by taking X = mx + l
# From our Eqn X = -1.0 + 2.0 * (x - 90) / 270
# we get the following m and l values
m = 2.0 / 27.0
l = -45.0 / 27.0

# Scale the learnt parameters using m and l
plearnt = scale_params(learnt, m, l)

# Now for printing the actual and learnt
# parameters
print('No. of Parameters Chosen: ', pars)
print('Learnt Parameters: ', plearnt)
print('Ground Truth: ', actuals)

No. of Parameters Chosen:  3
Learnt Parameters:  [  1.8  -10.25  -9.95]
Ground Truth:  [  1.8 -10.2 -10.5]

Finally, let us print the actual costs that we read in from the data file books-training.txt and the predicted values on the test data. These values are shown below. The first column is the number of pages, the second, ground truth and the third, predicted values. The predicted values are quite close to the ground truth values (they differ by a few paisa here and there) showing that the system is predicting very well.

We can use it to predict the cost of a 400-page book (not present either in the training or test data) and see that it turns out to be ₹2461.50.

X = -1.0 + 2.0 * (tx - 90.0) / 270.0
py = np.polyval(learnt, X)

# Printing a table of predicted and ground truth values for
# test data
print('{0:15s}   {1:15s}   {2:15s}'.format('No. of Pages', 
                                           'Ground Truth',
                                           'Predicted Cost'))
for i in range(tx.shape[0]) :
    print('{0:8.0f}        '.format(tx[i]),
          '{0:10.2f}        '.format(ty[i]),
          '{0:10.2f}'.format(py[i]))

pgs = 400
spgs = -1.0 + 2.0 * (pgs - 90.0) / 270.0
print('Price predicted for a {0:d} page book: {1:.2f}'.format(
    pgs, np.polyval(learnt, spgs)))

No. of Pages      Ground Truth      Predicted Cost 
     250             859.50             859.35
     190             445.50             445.41
     165             311.25             311.20
     180             389.10             389.03
     235             743.85             743.71
     220             636.30             636.17
     275            1070.25            1070.10
     105              80.85              80.96
     310            1403.10            1402.96
     305            1352.85            1352.71
     125             143.25             143.30
     285            1160.85            1160.70
     115             110.25             110.33
     150             241.50             241.49
     255             899.85             899.70
Price predicted for a 400 page book: 2461.51

5. Playing with the Code

You can use the code above to see its inner workings in even more detail. Try the following exercises.

Set the nepochs parameter to 25 when calling the function sgd_learn in Line 9 of Cell 5. Increment this value in steps of 25 until it reaches about 300. What is happening to the learning in the system?
Now set the nepochs back to its default value of 1000. Try increasing and decreasing the learning rate eta in the same Line 9 of Cell 5. Again see what happens.
Change the number of parameters to a larger value (say pars = 7) on Line 2 in Cell 4. What happens? Do we need to change eta or nepochs?
Once you gain confidence in running the programs given here, see what happens when you do not scale the training data before calling sgd_learn(). Try with a very small eta = 1e-11 and very large nepochs=80000.

THE END