Lab-IV (09/09/2016):

DATA ANALYSIS


In this lab, we will see how to analyse data that we get from experiments using Python and its supporting packages. We shall use:

Featured Functions/Methods

Preliminaries

Initialisation

In [11]:
import numpy
from matplotlib import pyplot as plt
from scipy import polyfit
from scipy.stats.stats import pearsonr
np.set_printoptions(precision=2)      # Show numbers to 2-digits only
Reading data from a file into a numpy array

X = np.genfromtxt('data filename', delimiter=',')
This will read data (separated by ',' (comma)) from a given file into the numpy array X.

Sample Problems with Solutions

1. This problem illustrates the use of a histogram
A dice is tossed 2000 times and the values are recorded in a file named dice-data.csv. Verify if the dice is unbiased.

Soln:
An unbiased dice should result in almost equal number of throws for each value '1' to '6'. As the dice is thrown 2000 times, each value should occur approximately 333 times each. By computing a histogram of the data, we will get the number of times each value is thrown and it should be easy to see if all the numbers are roughly equal. A histogram shows the frequency or how many times a value occurs in a given dataset. Usually, the values are put into bins, that is groups of values. It will become clear as you work your way through this example.

In [4]:
X = np.genfromtxt('dice-data.csv', delimiter=',')
X = np.array(X, dtype='int')
print "X: ", X
(f, b) = np.histogram(X, bins=6)
print "Bins: ", b
print "Frequencies: ", f
X:  [6 2 5 ..., 1 6 5]
Bins:  [ 1.          1.83333333  2.66666667  3.5         4.33333333  5.16666667
  6.        ]
Frequencies:  [355 326 310 335 330 344]

The histogram method generates a histogram of the data. The bins=6 tells that the data should be grouped into 6 values (because a dice has numbers '1' to '6'). The variable b shows how the data is grouped. The first bin has values between '1.' and '1.83'; Look now at 'f'. The first value is 355. It means that there are 355 items in the data with values between 1 and 1.83. As we know that the data is from throwing a dice, it means that '1' occurred 355 times.

Now, let us look at a way in which both the histogram computation and plotting the resulting frequencies can be done together using the hist method in pyplot.

In [5]:
plt.hist(X, bins=6)
Out[5]:
(array([ 355.,  326.,  310.,  335.,  330.,  344.]),
 array([ 1.        ,  1.83333333,  2.66666667,  3.5       ,  4.33333333,
        5.16666667,  6.        ]),
 <a list of 6 Patch objects>)

As seen above, the histogram shows that '1' occurred 355 times, '2', 326 times, '3', 310 times, '4', 335 times, '5', 330 times and '6', 344 times. As these numbers differ from 333 only by about 20, we can assume that the dice is unbiased.

2. This problem shows the use of numpy for solving a system of linear equations. A mathenatician is fond of describing interesting car number plates that he sees. He gives the following information about a car number plate. Find the number.

The sum of the first two digits is the number formed by the last two digits.
The third digit is obtained by subtracting the second from the first digit.
The fourth digit is obtained by subtracting the third from the second digit.
The sum of all the digits is 25.

Soln:
We can convert the above into equations. Let the four digits be w, x, y and z
w + x = 10y + z, i.e., w + x - 10y - z = 0
w - x = y, i.e., w - x - y = 0
x - y = z, i.e., x - y - z = 0
w + x + y + z = 25

These are written in matrix form as AX = B where A is the matrix containing the coefficients of w, x, y and z; X is a vector containing the unknowns; and, B is the vector containing the right side values of the equations. So, we define A, B and X appropriately.

In [6]:
A = np.array([[1, 1, -10, -1],
              [1, -1, -1, 0],
              [0, 1, -1, -1],
              [1, 1, 1, 1]])
B = np.array([[0],
              [0],
              [0],
              [25]])
print "A: ", A
print "B: ", B
print "Car Number: ", np.linalg.solve(A, B).transpose()
A:  [[  1   1 -10  -1]
 [  1  -1  -1   0]
 [  0   1  -1  -1]
 [  1   1   1   1]]
B:  [[ 0]
 [ 0]
 [ 0]
 [25]]
Car Number:  [[ 9.  8.  1.  7.]]

The car number is therefore 9817.

There is also another way to solve it. As AX = B, X = inv(A)xB.

In [7]:
print "Car Number is ", np.linalg.inv(A).dot(B)
Car Number is  [[ 9.]
 [ 8.]
 [ 1.]
 [ 7.]]

Once again, we get the correct answer. The first method is considered superior - can you find out why? You can also verify that A.dot(inv(A)) gives the Identity matrix.

3. This problem is about what happens when we have more equations than unknowns. A student writes a computer program that finds the maximum of a given set of N numbers. She runs the program for different values of N and notes down the time taken to find the maximum each time. Given her data in the file max-data.csv, how much time does it take to find the maximum of 200,000 numbers? The data in the file is in two columns: the first column is N and the second column is the time taken in milliseconds(ms).

Soln: Having more equations than unknowns is the most commonly encountered case in data analysis and helps handle noise in data. That is, the values in the equations are not absolutely correct but approximate. The answer is then finally one that minimises the overall error across all equations. The method we use is regression.

Data analysis using regression has three steps:

  1. Compute correlation to determine if there exists a relationship at all. The correlation coefficients, which vary between -1 and +1, should be at least -0.8 or +0.8. It is also a good idea to plot the data.
  2. If the correlation is high, then determine if the relationship is linear.
  3. Determine the relationship by using the polyfit method in Scipy. For a linear relationship, the equation fit to the data is: y = ax + b, and the function returns a and b values.
In [12]:
maxdata = np.genfromtxt('max-data.csv', delimiter=',')
print "Times for finding Maximum: "
print maxdata
print "Correlation Coefficient: ", pearsonr(maxdata[:,0], 
                                            maxdata[:,1])[0]
plt.plot(maxdata[:,0], maxdata[:,1], 'bo')
plt.show()
Times for finding Maximum: 
[[  1.00e+04   4.50e+02]
 [  2.00e+04   4.44e+02]
 [  3.00e+04   4.43e+02]
 [  5.00e+04   5.43e+02]
 [  7.50e+04   5.06e+02]
 [  1.00e+05   5.97e+02]
 [  2.50e+05   7.96e+02]
 [  5.00e+05   1.02e+03]]
Correlation Coefficient:  0.985400928358

In [15]:
(sl, icept) = polyfit(maxdata[:,0], maxdata[:,1], 1)
print "Slope: ", sl, " Intercept: ", icept
print "Equation of the line: y = ", sl, "x + ", icept
maxfn = np.polyval([sl, icept], maxdata[:,0])
plt.plot(maxdata[:,0], maxdata[:,1], 'bo')
plt.plot(maxdata[:,0], maxfn, 'r--')
plt.show()
Slope:  0.00120869966403  Intercept:  443.513230966
Equation of the line: y =  0.00120869966403 x +  443.513230966

In [17]:
print "Time for 200000 Items: ", sl*200000 + icept
Time for 200000 Items:  685.253163772


Lab Problems

1. A cricket fan posts a very strange observation on social media. He claims that a certain top batsman's performance is related to how much it rains in Melbourne, Australia. He says that the number of 50s scored by the batsman in a year (in ODIs) is determined by the amount of rain (in cm/year) received in Melbourne. To support his claim, he posted data (which is in file Lab04-01.csv) in two columns: the first is the amount of rain in cm and the second is the number of 50s. Can you analyse the data and determine whether the claim is True?!

2. A space explorer wants to determine the acceleration due to gravity (g) on an alien planet by firing a bullet and measuring its path. The equation to use is
             S = ut + 0.5gt**2
where S is the distance, u, the initial velocity and t, the time. He records the data in the file Lab04-2.csv Read the data, fit the equation above and find g. The first column is the time in seconds, the second column is the distance in metres.

3. Find the 5-digit number given by the following information:

4. Two data files containing roughly similar text on the human digestive system are given to you. The first is in English and the second is in Spanish. Plot the word length histograms (plot the number of words of a particular length against length) and discuss any differences between the two languages.