In the previous two articles - What is Machine Learning? and How do Machines Learn?, we discussed the basics of Machine Learning and the importance of Stochastic Gradient Descent (SGD) algorithm. It is now time to implement SGD and show how it works.
As we get into the implementation details, it is also very
motivating to remember that understanding SGD is the first great
step in getting into AI/ML/Deep Learning fields. A simple and pretty
normal way of extending SGD to neural networks with multiple layers
results in the popular
Let us define our task, performance measure and experience. The task is to predict the cost of a book. The feature that we want to use is the number of pages in the book. The experience part is given by data for 55 different books, each with a certain number of pages and for which the costs are known.
We use a very common trick to get the performance measure: divide the experience into two parts. The first is used as experience, i.e. examples from which the system learns. This part of the experience is called the
We begin the implementation (in Python) and understand the details along the way. The data is present in a file named
The plot shows the number of pages on the X-axis and the cost on the Y-axis. The first entry is shown as a black dot at the position $X = 210, Y = 569.10$. There is a clear pattern in the dots which means
# Just the usual imports of numpy and matplotlib!
import numpy as np
from math import exp, comb
from matplotlib import pyplot as plt
np.set_printoptions(precision=2)
# Read data from file and show 8 samples for understaning
D = np.loadtxt('books-training.txt')
print('Some Examples from the data - No. of pages and price')
print(D[:8,:])
# Plot the data to get an idea of the relationship between
# number of pages and cost (if any!)
plt.grid(True)
plt.plot(D[:40, 0], D[:40,1], 'k.')
plt.show()
# Take the first 40 books for training, i.e. as experience
x = D[:40, 0]
y = D[:40, 1]
# Keep the last 15 examples for testing
tx = D[40:, 0]
ty = D[40:, 1]
After setting up the problem qualitatively, it is time to give
it a more precise definition. The pattern in the plot showed us
that there is indeed a relationship between the cost of the book
and the number of pages. Mathematically, let $x$ denote the
number of pages and $y$, the cost of the book. We express the
relationship between them as
$$y=f(x)$$
where $f$ is a function. $f$ can take many forms, but given the
simple pattern, we can restrict it to be a
Of course, an immediate question is, "what should $k$ be?" In
other words, what should be the
We find the parameters using our 3-step learning strategy. The
code is given in Cell 3 below as the Python function
Let us write the learning function one step at a time. The function has the following steps.
Second, remember that we
need to find the adjustment that reduces error at each
step. We think of error as a landscape where high values
are hills and low values are valleys. We can then
adjust the parameters to travel down a slope into a
valley. Therefore, we have a variable which keeps track
of the slopes for each parameter:
There is also another useful function
def sgd_learn(train, labels, params, nparams=2, delta=1e-4,
eta=5e-2, nepochs=1000) :
# Determine how many samples we have for training
size = train.shape[0]
# Initialise everything to 0s
gradient_j = np.zeros(nparams) # Slopes
# Learn by repeating many times
for i in range(nepochs) :
# Start somewhere at random!
j = int(np.random.random() * size)
# Predict a value at the starting point
value_j = np.polyval(params, train[j])
# Error = difference between predicted and actual values
error = (value_j - labels[j]) ** 2
# Take several small steps around the starting point
for k in range(nparams) :
params[k] += delta
# Compute new value after taking the small step
# and return to the atarting point
error_delta = (np.polyval(params, train[j]) -
labels[j]) ** 2
params[k] -= delta
# Find slope as difference between values
# before and after taking a small step
gradient_j[k] = (error_delta - error) / delta
# Walk down the slope (but slowly!)
params = params - eta * gradient_j
# Return whatever we learnt
return params
def scale_params(unscaled, p=1.0, q=1.0, rev=True) :
sz = unscaled.shape[0]
scaled = np.zeros(sz)
if rev :
rlearnt = unscaled[::-1] # Reverse them (np.polyval issue!)
for k in range(sz) :
for j in range(k, sz) :
scaled[k] += comb(j, k) * (p ** k) * (q ** (j - k)) * rlearnt[j]
if rev :
scaled = scaled[::-1] # Right side up again!
return scaled
Let us now put our newly written
Finally, we plot the test data (no. of pages vs. cost) against the predicted values with randomly initialised parameters. The yellow dots are the ground truth and the green circles are the predicted values. You can see that they are nowhere close!
# Choose number of parameters
pars = 3
# Give randomly chosen values to the parameters initially
init_params = np.random.random(pars) * 10.0 - 5.0
# Give helpful outputs to the user
print('No. of Parameters Chosen: ', pars)
print('Initial Values of Parameters: ', init_params)
# Let us just see what the system predicts at the start
# without having learnt anything!
plt.grid(True)
plt.plot(tx, np.polyval(init_params, tx/10.), 'go', tx, ty, 'y.')
plt.show()
Now for one of those things which makes practice different from
theory! In our data, $x$ is the number of pages and it ranges
from $90$ to $360$. These numbers give us grief if we directly
use them; again, the general rule of thumb is that they should
be small values, preferably in the range $(-1, +1)$. We do
similarly with the costs and reduce them to the range $(0, +1)$
by dividing them with the maximum cost in the data. Otherwise,
you will find that we need to play quite a lot with the learning
rates, number of epochs and even step sizes (
Line 5 scales all $x$ values into the range $(-1, +1)$ before
giving them as input to
Like we did earlier, we plot the predicted values on test data now after learning. Again the yellow dots are the ground truth and the green circles are the predicted values. You can see now that they match perfectly showing that the system has learnt to predict the book costs from the number of pages.
Note that we scaled the test data also into the range $(-1, +1)$
(Line 14) before giving them to the
# This is an implementation detail
# We need to scale or normalise inputs to the system
# so that it can handle any value. What we do is
# scale all the inputs into the range 0 to +1
X = -1.0 + 2.0 * (x - 90.0) / 270.0
Y = y / y.max() # Scale into (0, 1)
# Give the number of parameters and their initial values,
# traning data (X, Y) to the learning algorithms
learnt = sgd_learn(X, Y, init_params, nparams=pars, eta=1e-1)
learnt *= y.max() # Scale back to the original
# Let us just see what the system predicts after learning
TX = -1.0 + 2.0 * (tx - 90.0) / 270.0
plt.grid(True)
plt.plot(tx, np.polyval(learnt, TX), 'go', tx, ty, 'y.')
plt.show()
Yay, let us see how good the learning is in numbers. The original book data was generated with the following function: $$y = 1.8x^2 - 10.2x - 10.5$$ where $x = P/10, P = $ No. of pages.
What did the system learn? We look at the values in
# Let us now print the parameters learnt and
# compare them against the ground truth!
actuals = np.array([1.8, -10.2, -10.5])
# As we scaled the x values into the range (-1,+1),
# we need to scale the learnt parameters into
# the correct range. Done using simple linear
# scaling by taking X = mx + l
# From our Eqn X = -1.0 + 2.0 * (x - 90) / 270
# we get the following m and l values
m = 2.0 / 27.0
l = -45.0 / 27.0
# Scale the learnt parameters using m and l
plearnt = scale_params(learnt, m, l)
# Now for printing the actual and learnt
# parameters
print('No. of Parameters Chosen: ', pars)
print('Learnt Parameters: ', plearnt)
print('Ground Truth: ', actuals)
Finally, let us print the actual costs that we read in from the
data file
We can use it to predict the cost of a 400-page book (not present either in the training or test data) and see that it turns out to be ₹2461.50.
X = -1.0 + 2.0 * (tx - 90.0) / 270.0
py = np.polyval(learnt, X)
# Printing a table of predicted and ground truth values for
# test data
print('{0:15s} {1:15s} {2:15s}'.format('No. of Pages',
'Ground Truth',
'Predicted Cost'))
for i in range(tx.shape[0]) :
print('{0:8.0f} '.format(tx[i]),
'{0:10.2f} '.format(ty[i]),
'{0:10.2f}'.format(py[i]))
pgs = 400
spgs = -1.0 + 2.0 * (pgs - 90.0) / 270.0
print('Price predicted for a {0:d} page book: {1:.2f}'.format(
pgs, np.polyval(learnt, spgs)))
You can use the code above to see its inner workings in even more detail. Try the following exercises.