Lab 06: Linear Regression - Least Squares Approximation


In this lab, we look at two methods to fit a curve through experimental data. Experimental data is expected to have noise, i.e. variability around the true value. For example, let us say that we experimentally vary the voltage in a simple resistor circuit and find the current for each value of voltage. There is a simple linear relationship between them given by Ohm's law ($V = IR$). However, the experimental values need not conform strictly to the law but show some variability which arises due to a number of reasons including human limitations and instrument quality.

The idea is to fit an appropriate $V = IR$ line with the measured $V$ and $I$ values. How do we do this? The primary idea is to define an error as the squared difference between the experimental value and the expected value from Ohm's law. The total error in our experiment is then the sum of individual errors in each measurement. We find the appropriate value of $R$ that minimises the total error.


Fitting a straight line

Fitting a straight line $y = px + q$ through the data points is particularly simple. Let the experimental data be in the form of $(x_i, y_i)$ pairs where $x_i$ is the $i^{th} x-$value and $y_i$ is the $y$ value corresponding to $x_i$.

  1. Create an $A$ matrix $$ A = \left(\begin{array}{cc} \sum_{i=1}^Nx_i^2 & \sum_{i=1}^Nx_i \\ \sum_{i=1}^Nx_i & N \end{array}\right) $$
  2. Create a $B$ matrix $$ B = \left(\begin{array}{c} \sum_{i=1}^Nx_iy_i \\ \sum_{i=1}^Ny_i \end{array}\right) $$
  3. Use Gauss Elimination method from the previous labs to solve $AX = B$ where $X$ is a $2\times1$ vector of the two unknowns $p$ and $q$.

The code skeletons

In [ ]:
import numpy as np
from matplotlib import pyplot as plt         # For plotting data
np.set_printoptions(precision=2, suppress=True)
In [ ]:
######### COPY GAUSS ELIMINATE AND BACK SUBSTITUTE METHODS
######### FROM EARLIER LABS INTO THIS CELL

We assume that $x_i$ and $y_i$ values are given as two arrays $X$ and $Y$. Test data samples are given to you in the cell below.

In [ ]:
X = np.arange(10)
Y = np.array([-1.12, 0.7, 3.53, 5.47, 7.08, 9.65,\
              12.95, 16.65, 18.13, 20.82])

Generally, it is a good idea to plot the data and see if there are any patterns. When we plot the data using the code given in the cell below, you can see that the points appear to be along a straight line and therefore we can try to fit a line through them.

In [ ]:
########### YOU RUN THIS CODE!
plt.grid(True)             # Gives a background grid
plt.plot(X, Y, 'bo')       # Plot data using blue dots for points
plt.show()

Let us now create the $A$ and $B$ matrices.

In [ ]:
def create_A(X) :
    A = np.zeros((2,2))
    
    # Fill your code here as per the steps given above.
    
    return A

def create_B(X, Y) :
    B = np.zeros((2,1))
    
    # Fill your code here as per the steps given above.
    
    return B

Call gauss_eliminate() and back_substitute() to solve the system of equations $AX = B$. Remember that you need to concatenate $A$ and $B$ matrices before passing them to gauss_eliminate(). You should have the answers $p$ and $q$ in the cell below.

In [ ]:
############ SOLVE THE EQUATION AX = B TO GET
############ p and q

We can verify if our answer is correct by plotting the data and the line we now have as Y = p * X + q. If the answer is correct, the line that we plotted must be very close to all the data points and, equally importantly, it should look correct: never underestimate the power of the human eye!

In [ ]:
## This cell is only for plotting the data so that you can
## visually see whether the answer you have is correct.
plt.grid(True)                  # Draws a grid on the plot
plt.plot(X, Y, 'bo')            # Plot the original data as points
plt.plot(X, X * p + q, 'g--')   # Draw a green dotted line as 
plt.show()                      #   your answer


Problems

Fit a line through the data given below.
x 01234 56789
y 7.17.7311.14 10.1511.7412.59 15.1914.7618.55 17.44


Fitting to a linear combination of functions

The above method can be generalised to fit any curve of the form $$ y = c_0 + c_1f_1(x) + c_2f_2(x) + \ldots + c_mf_m(x) $$ Examples of such functions are $y = 5 + x + x^2$, $y = 2\sin(x) - 0.25\sin(3x)$ etc.

The derivations in this case are harder and beyond the scope of the current class. The solution is obtained as follows.

Assume that there are $N$ data points $(x_i, y_i), i = 1, 2, \ldots, N$.

  1. Create a $N\times1$ vector $Y$ $$ Y = \left(\begin{array}{c} y_1 \\ y_2 \\ \vdots\\ y_N \end{array}\right) $$
  2. Create a $(m+1)\times1$ vector $C$ $$ C = \left(\begin{array}{c} c_0 \\ c_1 \\ \vdots \\ c_m \end{array}\right) $$
  3. Create a $N\times(m+1)$ matrix $A$ $$ A = \left(\begin{array}{ccccc} 1 & f_1(x_1) & f_2(x_1) & \ldots & f_m(x_1) \\ 1 & f_1(x_2) & f_2(x_2) & \ldots & f_m(x_2) \\ 1 & f_1(x_3) & f_2(x_3) & \ldots & f_m(x_3) \\ \vdots&\vdots& \vdots & \ddots & \vdots \\ 1 & f_1(x_N) & f_2(x_N) & \ldots & f_m(x_N) \end{array}\right) $$ Note that this definition of $A$ is the transpose of the one taught in class.
  4. Solve the following equation for $C$ using gauss_eliminate() and back_substitute() functions to get the answer. $$ A^TAC = A^TY $$

Code skeletons are given in the cells below. Fill the rest of the code.

In [ ]:
def create_comb_A(X, m) :
    A = np.zeros((X.shape[0], m))            # m is no. of functions
    ##
    ## Fill your code here
    ##
    
    return A

Write the function that actually solves the equation given above and returns the $C$ matrix.

In [ ]:
def solve_comb(X, Y, m) :                   # m is no. of functions
    A = create_comb_A(X, m)
    ##
    ## Fill code here to create a matrix G = Transpose(A) time A
    ##
    
    ## Fill code here to create a matrix H = Transpose(A) times Y
    
    ## Use gauss_eliminate and back_substitute to solve the eqns
    RM = gauss_eliminate()                  # Fill appropriate
    C = back_substitute()                   # matrices as arguments
    
    return C

Worked out example

Let us now try and solve a problem with the code that we wrote. Fit the curve given by $y = c_0 + c_1x + c_2x^2 + c_3x^3$ through the data given below. $$ \begin{array}{||c||c|c|c|c|c|c|c|c|c|c||} \hline \textbf{x} & -3 & -2 & -1 & 0 & 1 & 2 & 3 & 4 & 5 & 6 \\\hline \textbf{y} & -71.67 & -20.93 & 4.95 & 11.8 & 10.87 & 4.65 & 0.69 & 9.29 & 33.45 & 79.75 \\\hline \end{array} $$ After reading the data into two arrays $X$ and $Y$, plot the data.

For this example, you can run the code in each of the cells below and see what happens.

In [ ]:
X = np.arange(-3,7)
Y = np.array([-71.67, -20.93, 4.95, 11.8, 10.87,\
             4.65, 0.69, 9.29, 33.45, 79.75])                            # Fill the data
plt.grid(True)
plt.plot(X, Y, 'b.')
plt.show()

Clearly we cannot fit a straight line through the data.

The pattern is quite different. The curve we are trying to fit is $y = c_0 + c_2x + c_2x^2 + c_3x^3$. Thus, $f_O(x) = 1, f_1(x) = x, f_2(x) = x^2$ and $f_3(x) = x^3$. So, define the functions in the cell below.

In [ ]:
def f0(x) :
    return 1

def f1(x) :
    return x

def f2(x) :
    return x**2

def f3(x) :
    return x**3

We already have the $Y$ matrix because it contains the $Y$ values. The only thing to do is to create the $A$ matrix. See how we do it with the code skeleton in the cell below. The way we are defining $Y_0, Y_1, Y_2$ and $Y_3$, they automatically are $N\times1$ vectors where $N$ is the number of data points. In this case, $N = 10$. You can use this as a template for solving the problem given as an exercise.

In [ ]:
Y0 = np.array([[f0(x) for x in X]])
Y1 = np.array([[f1(x) for x in X]])
Y2 = np.array([[f2(x) for x in X]])
Y3 = np.array([[f3(x) for x in X]])
A = np.concatenate((Y0.T, Y1.T, Y2.T, Y3.T), axis=1)
print(A)

Once we have the $A$ matrix, we need two other matrices before doing gauss_eliminate() and back_substitute(). They are $G = A^TA$ and $H = A^TY$.

In [ ]:
G = A.T.dot(A)
H = A.T.dot(Y).reshape(A.shape[1], 1)      # To get a proper vector
print(G)
print(H)
In [ ]:
RA = gauss_eliminate(np.concatenate((G, H), axis=1))
ans = back_substitute(RA)
print(ans.T)

The curve that fits the data with the least error is thus $$ y = 12.71 + 1.91x -5.28x^2 + 1.14x^3 $$ How good is this? Let us plot the original data points, i.e. $X$ and $Y$ values along side the curve. Look at the plot and make your own judgement!

In [ ]:
plt.grid(True)
plt.plot(X, Y, 'bo')
plt.plot(X,12.71 + 1.91*X - 5.28*X**2 + 1.14*X**3, 'g-')
plt.show()


Problem

Do the following similar problem but with different functions $f_1(x) \ldots f_3(x)$.

A rubber band is plucked and left to vibrate. A snap-shot of the vibrating rubber band is taken with a high-speed camera and its positions are measured. The data is given as $(x_i, y_i)$ pairs in the arrays below. As it is a vibrating motion, we try and fit sine waves through the data. Such a curve is given mathematically by the equation $$ y = c_0 + c_1\sin(x) + c_2\sin(2x) + c_3\sin(3x) $$ Do exactly the same things we did in the worked out example and show the obtained curve.

Code cells are given below to help you.

In [ ]:
# The data points are given below in X and Y vectors
X = np.array([0.,    0.2,   0.41,  0.61,  0.81,  1.02,\
              1.22,  1.42,  1.63,  1.83,  2.03,  2.24,\
              2.44,  2.64,  2.85,  3.05,  3.25,  3.46,\
              3.66,  3.86,  4.07,  4.27,  4.47,  4.68,\
              4.88,  5.08,  5.29,  5.49,  5.69,  5.9, \
              6.1,   6.31,  6.51,  6.71,  6.92,  7.12,\
              7.32,  7.53,  7.73,  7.93,  8.14,  8.34,\
              8.54,  8.75,  8.95,  9.15,  9.36,  9.56,\
              9.76,  9.97, 10.17, 10.37, 10.58, 10.78,\
              10.98, 11.19, 11.39, 11.59, 11.8,  12.])

Y = np.array([11.89, 14.94, 16.17, 15.11, 17.19, 16.28,\
              13.9,  15.16, 14.96, 15.25, 17.79, 19.24,\
              17.77, 18.61, 17.06, 13.55, 11.15,  8.88,\
              6.28,  7.12,  7.77,  7.1,  10.34, 11.08,\
              9.43, 10.74, 10.22,  8.22,  9.43, 10.08,\
              10.39, 12.99, 14.99, 15.25, 16.72, 16.89,\
              14.57, 14.72, 15.,   14.19, 16.46, 17.71,\
              17.47, 19.5,  18.99, 15.62, 14.44, 11.4,\
              7.28,  7.63,  7.06,  6.81,  8.4,   9.83,\
              9.44, 11.59, 10.33,  8.11,  9.47,  8.94])
In [ ]:
# Use this cell to plot the data
In [ ]:
# Define the functions f0, f1, f2 and f3 in this cell
from math import sin         # Needed for the functions
In [ ]:
# Use this cell to create the A matrix
# First create four vectors Y0, Y1, Y2 and Y3
# Concatenate them to form Y
In [ ]:
# Define G and H matrices in this cell
In [ ]:
# Use the G and H matrices to do gaussian elimination and
# get the coefficients (the answer)
In [ ]:
# Plot the original data points and the curve from your answer
# in this cell to see if your answer appears correct


THE END