In this lab, we will see how to analyse data that we get from experiments using Python and its supporting packages. We shall use:

Featured Functions/Methods

```
d = np.genfromtxt(filename, delimiter=',')
```
This function reads data from a file directly into a numpy array. The first argument is the name of the file; the second, the delimiter used in the file. For example, if the file is in CSV format, with commas separating the data values, then the delimiter is ','. It returns a Numpy array.
```
(f, b) = np.histogram(data array, bins=No. of bins)
```
This function creates a histogram or a frequency distribution of the data. It returns the frequencies and the bin lower limits as numpy arrays, f and b. The size of the bin limits array b is one greater than the number of bins because it contains the upper limit of the last bin at the end. That is, if the number of bins = 4, b will contain 5 values, bl_1, bl_2, bl_3, bl_4 and bh_4 where 'l' indicates lower limit and 'h' indicates higher or upper limit.
```
matplotlib.pyplot.hist(data array, bins=No. of bins)
```
This function plots a bar graph, with the specified number of bins, of the histogram computed from the values in the data array.
```
sol = np.linalg.solve(A, B)
```
This function solves the linear system of equations given in matrix form as AX = B and returns the solution as a Numpy array.
```
(c, p) = scipy.stats.stats.pearsonr(A, B)
```
This function computes the Pearson correlation coefficient between two lists of values, A and B. It returns the Pearson correlation coefficient c and also a probability value p. p is the probability that the correlation between A and B is due to chance and is not due to the actual relationship between data items. For believable results, |c| > 0.7 or 0.8 and p should be less than 1/N where N is the number of items in the lists.
```
(s1, s2, ..., sn, c) = polyfit(A, B, n)
```
This function performs regression on (x,y) data given in A, B where A is an array of 'x' coordinates and B is an array of 'y' coordinates. n determines the kind of function that is fit through the points. If n is 1, then a line c + s1*x is fit through the points. Otherwise, the polynomial c + s1x + s2x**2 + s3*x**3 + ... + sn*x**n is fit through the points. The return values are the coefficients of the polynomial.
```
f = np.polyval(A, X)
```
This function evaluates the polynomial given in a numpy array A at points given in X (which is also a numpy array) and returns A(X) as a numpy array. If p is of length N, this function returns the value: p[0]*x**(N-1) + p[1]*x**(N-2) + ... + p[N-2]*x + p[N-1] If x is a sequence, then p(x) is returned for each element of x. If x is another polynomial then the composite polynomial p(x(t)) is returned.

In [11]:

import numpy
from matplotlib import pyplot as plt
from scipy import polyfit
from scipy.stats.stats import pearsonr
np.set_printoptions(precision=2)      # Show numbers to 2-digits only

Reading data from a file into a numpy array

X = np.genfromtxt('data filename', delimiter=',')
This will read data (separated by ',' (comma)) from a given file into the numpy array X.

Sample Problems with Solutions

1. This problem illustrates the use of a histogram
A dice is tossed 2000 times and the values are recorded in a file named dice-data.csv. Verify if the dice is unbiased.

Soln:
An unbiased dice should result in almost equal number of throws for each value '1' to '6'. As the dice is thrown 2000 times, each value should occur approximately 333 times each. By computing a histogram of the data, we will get the number of times each value is thrown and it should be easy to see if all the numbers are roughly equal. A histogram shows the frequency or how many times a value occurs in a given dataset. Usually, the values are put into bins, that is groups of values. It will become clear as you work your way through this example.

In [4]:

X = np.genfromtxt('dice-data.csv', delimiter=',')
X = np.array(X, dtype='int')
print "X: ", X
(f, b) = np.histogram(X, bins=6)
print "Bins: ", b
print "Frequencies: ", f

X:  [6 2 5 ..., 1 6 5]
Bins:  [ 1.          1.83333333  2.66666667  3.5         4.33333333  5.16666667
  6.        ]
Frequencies:  [355 326 310 335 330 344]

The histogram method generates a histogram of the data. The bins=6 tells that the data should be grouped into 6 values (because a dice has numbers '1' to '6'). The variable b shows how the data is grouped. The first bin has values between '1.' and '1.83'; Look now at 'f'. The first value is 355. It means that there are 355 items in the data with values between 1 and 1.83. As we know that the data is from throwing a dice, it means that '1' occurred 355 times.

Now, let us look at a way in which both the histogram computation and plotting the resulting frequencies can be done together using the hist method in pyplot.

In [5]:

plt.hist(X, bins=6)

Out[5]:

(array([ 355.,  326.,  310.,  335.,  330.,  344.]),
 array([ 1.        ,  1.83333333,  2.66666667,  3.5       ,  4.33333333,
        5.16666667,  6.        ]),
 <a list of 6 Patch objects>)

As seen above, the histogram shows that '1' occurred 355 times, '2', 326 times, '3', 310 times, '4', 335 times, '5', 330 times and '6', 344 times. As these numbers differ from 333 only by about 20, we can assume that the dice is unbiased.

2. This problem shows the use of numpy for solving a system of linear equations. A mathenatician is fond of describing interesting car number plates that he sees. He gives the following information about a car number plate. Find the number.

The sum of the first two digits is the number formed by the last two digits.
The third digit is obtained by subtracting the second from the first digit.
The fourth digit is obtained by subtracting the third from the second digit.
The sum of all the digits is 25.

Soln:
We can convert the above into equations. Let the four digits be w, x, y and z
w + x = 10y + z, i.e., w + x - 10y - z = 0
w - x = y, i.e., w - x - y = 0
x - y = z, i.e., x - y - z = 0
w + x + y + z = 25

These are written in matrix form as AX = B where A is the matrix containing the coefficients of w, x, y and z; X is a vector containing the unknowns; and, B is the vector containing the right side values of the equations. So, we define A, B and X appropriately.

In [6]:

A = np.array([[1, 1, -10, -1],
              [1, -1, -1, 0],
              [0, 1, -1, -1],
              [1, 1, 1, 1]])
B = np.array([[0],
              [0],
              [0],
              [25]])
print "A: ", A
print "B: ", B
print "Car Number: ", np.linalg.solve(A, B).transpose()

A:  [[  1   1 -10  -1]
 [  1  -1  -1   0]
 [  0   1  -1  -1]
 [  1   1   1   1]]
B:  [[ 0]
 [ 0]
 [ 0]
 [25]]
Car Number:  [[ 9.  8.  1.  7.]]

The car number is therefore 9817.

There is also another way to solve it. As AX = B, X = inv(A)xB.

In [7]:

print "Car Number is ", np.linalg.inv(A).dot(B)

Car Number is  [[ 9.]
 [ 8.]
 [ 1.]
 [ 7.]]

Once again, we get the correct answer. The first method is considered superior - can you find out why? You can also verify that A.dot(inv(A)) gives the Identity matrix.

3. This problem is about what happens when we have more equations than unknowns. A student writes a computer program that finds the maximum of a given set of N numbers. She runs the program for different values of N and notes down the time taken to find the maximum each time. Given her data in the file max-data.csv, how much time does it take to find the maximum of 200,000 numbers? The data in the file is in two columns: the first column is N and the second column is the time taken in milliseconds(ms).

Soln: Having more equations than unknowns is the most commonly encountered case in data analysis and helps handle noise in data. That is, the values in the equations are not absolutely correct but approximate. The answer is then finally one that minimises the overall error across all equations. The method we use is regression.

Data analysis using regression has three steps:

Compute correlation to determine if there exists a relationship at all. The correlation coefficients, which vary between -1 and +1, should be at least -0.8 or +0.8. It is also a good idea to plot the data.
If the correlation is high, then determine if the relationship is linear.
Determine the relationship by using the polyfit method in Scipy. For a linear relationship, the equation fit to the data is: y = ax + b, and the function returns a and b values.

In [12]:

maxdata = np.genfromtxt('max-data.csv', delimiter=',')
print "Times for finding Maximum: "
print maxdata
print "Correlation Coefficient: ", pearsonr(maxdata[:,0], 
                                            maxdata[:,1])[0]
plt.plot(maxdata[:,0], maxdata[:,1], 'bo')
plt.show()

Times for finding Maximum: 
[[  1.00e+04   4.50e+02]
 [  2.00e+04   4.44e+02]
 [  3.00e+04   4.43e+02]
 [  5.00e+04   5.43e+02]
 [  7.50e+04   5.06e+02]
 [  1.00e+05   5.97e+02]
 [  2.50e+05   7.96e+02]
 [  5.00e+05   1.02e+03]]
Correlation Coefficient:  0.985400928358

In [15]:

(sl, icept) = polyfit(maxdata[:,0], maxdata[:,1], 1)
print "Slope: ", sl, " Intercept: ", icept
print "Equation of the line: y = ", sl, "x + ", icept
maxfn = np.polyval([sl, icept], maxdata[:,0])
plt.plot(maxdata[:,0], maxdata[:,1], 'bo')
plt.plot(maxdata[:,0], maxfn, 'r--')
plt.show()

Slope:  0.00120869966403  Intercept:  443.513230966
Equation of the line: y =  0.00120869966403 x +  443.513230966

In [17]:

print "Time for 200000 Items: ", sl*200000 + icept

Time for 200000 Items:  685.253163772

Lab Problems

1. A cricket fan posts a very strange observation on social media. He claims that a certain top batsman's performance is related to how much it rains in Melbourne, Australia. He says that the number of 50s scored by the batsman in a year (in ODIs) is determined by the amount of rain (in cm/year) received in Melbourne. To support his claim, he posted data (which is in file Lab04-01.csv) in two columns: the first is the amount of rain in cm and the second is the number of 50s. Can you analyse the data and determine whether the claim is True?!

2. A space explorer wants to determine the acceleration due to gravity (g) on an alien planet by firing a bullet and measuring its path. The equation to use is
S = ut + 0.5gt**2
where S is the distance, u, the initial velocity and t, the time. He records the data in the file Lab04-2.csv Read the data, fit the equation above and find g. The first column is the time in seconds, the second column is the distance in metres.

3. Find the 5-digit number given by the following information:

The number formed by the last three digits is 4 less than twice the number formed by the first two digits.
The third digit is obtained if we subtract the first digit from the second.
The sum of the 1st, 3rd and 5th digits is 8 more than the sum of the 2nd and 4th digits.
The number formed by the third and fourth digits is twice the first digit.
The sum of all the digits is 20.

4. Two data files containing roughly similar text on the human digestive system are given to you. The first is in English and the second is in Spanish. Plot the word length histograms (plot the number of words of a particular length against length) and discuss any differences between the two languages.

Lab-IV (09/09/2016):

DATA ANALYSIS

Featured Functions/Methods

Preliminaries¶

Sample Problems with Solutions

Lab Problems