Lab 6 (6/09/2017)


In this lab, we will implement the code for

  • Simple Linear Regression
  • General Linear Regression

Prelimiaries

Use the code below to read data from a file. The data files

may be downloaded. These files contain X coordinates in the first column and Y coordinates in the second column. The columns are separated by a ','.

In [5]:
import numpy as np
from matplotlib import pyplot as plt     # Optional! Only for plotting data

np.set_printoptions(precision=2, suppress=True)

# Let us read our data from the file "prob-1.dat"
# The file has two columns separated by commas:
# first contains x-coordinates
# second contains y-coordinates
ip = np.genfromtxt('prob-1.dat', delimiter=',')
X = ip[:,0]
Y = ip[:,1]
print 'X Coordinate Data: '
print X
print 'Y Coordinate Data: '
print Y
X Coordinate Data: 
[-5.  -4.9 -4.8 -4.7 -4.6 -4.5 -4.4 -4.3 -4.2 -4.1 -4.  -3.9 -3.8 -3.7 -3.6
 -3.5 -3.4 -3.3 -3.2 -3.1 -3.  -2.9 -2.8 -2.7 -2.6 -2.5 -2.4 -2.3 -2.2 -2.1
 -2.  -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.  -0.9 -0.8 -0.7 -0.6
 -0.5 -0.4 -0.3 -0.2 -0.1 -0.   0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9
  1.   1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.   2.1  2.2  2.3  2.4
  2.5  2.6  2.7  2.8  2.9  3.   3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
  4.   4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9]
Y Coordinate Data: 
[-47.25 -44.46 -34.72 -36.48 -35.62 -35.7  -35.15 -36.67 -29.4  -36.25
 -36.26 -20.79 -35.97 -28.98 -29.01 -29.83 -31.92 -32.09 -18.5  -28.84
 -30.52 -19.08 -33.94 -21.77 -16.9  -25.59 -24.63 -17.52 -16.21 -20.05
 -20.47 -17.19 -15.41 -14.56 -17.28 -14.69  -4.41 -10.9   -6.39 -12.88
  -9.08 -11.51  -7.84  -0.33  -4.38  -1.65  -4.6   -3.92  -5.79  -3.9
   2.15   7.     0.84   4.91   2.62  -2.51  -5.4   14.17  13.17   1.61
   6.56   6.99   8.57   4.17  11.96  11.41   4.93  16.25  13.57  13.58
   7.94   6.18  19.5   16.39  22.58  16.04  11.62  19.74  20.86  14.8
  22.31  24.13  21.73  29.37  17.39  33.31  19.73  26.45  30.64  34.26
  35.79  33.66  35.29  34.84  28.6   33.87  23.97  29.16  37.63  39.16]

Problem 1: Simple Linear Regression

Simple Linear Regression is used to fit a line ($y = ax + b$) through a set of data points $(x_i, y_i), i = 1, \ldots, N$. The unknown line parameters $a$ and $b$ are found by solving the linear system $$ \left(\begin{array}{cc} \sum_{i=1}^N x_i^2 & \sum_{i=1}^N x_i \\ \sum_{i=1}^N x_i & n \end{array}\right) ~ \left(\begin{array}{c} a \\ b \end{array}\right) = \left(\begin{array}{c} \sum_{i=1}^N x_iy_i \\ \sum_{i=1}^N y_i \end{array}\right) $$

In [6]:
# This will plot your data - if matplotlib is not available
# on your machine, you may ignore this piece of code
plt.plot(X, Y, 'g.')
plt.show()

Fit a straight line ($y = ax + b$) through the data using the method of Simple Linear Regression and compare your answers with the true values given by $a = 7.75$ and $b = -1.5$. If matplotlib and pyplot are available, use the following code to get a plot for your best-fit line.

In [8]:
C1 = np.array([7.75, -1.5])     # Use your actual solution instead!
L = np.polyval(C1, X)
plt.plot(X, Y, 'g.')
plt.plot(X, L, 'r-')
plt.show()

Problem 2: General Linear Regression

General linear regression allows us to fit a linear combination of any set of mutually orthogonal basis functions to given data. One important subset are the polynomials given by $a_0 + a_1x + a_2x^2 + \ldots + a_kx^k$. The fitting is done by taking the following matrices: $$ A = \left(\begin{array}{ccccc} 1 & x_1 & x_1^2 & \cdots & x_1^k \\ 1 & x_2 & x_2^2 & \cdots & x_2^k \\ \cdots & \cdots & \cdots & \cdots & \cdots \\ 1 & x_N & x_N^2 & \cdots & x_N^k \end{array}\right) Y = \left(\begin{array}{c} y_1 \\ y_2 \\ \cdots \\ y_N \end{array}\right) $$

We then solve the linear system of equations given by $$ A^TAc = A^TY $$ for $c$, the set of coefficients $a_0, \ldots, a_k$.

Use the data set given in the file prob-2.dat and fit a degree 4 polynomial using general linear regression. That is, the polynomial is of the form $$ y = a_0 + a_1x + a_2x^2 + a_3x^3 + a_4x^4 $$

In [ ]: