d = np.genfromtxt(filename, delimiter=',')
This function
reads data from a file directly into a numpy array. The first argument
is the name of the file; the second, the delimiter used in the
file. For example, if the file is in CSV format, with
commas separating the data values, then the delimiter is ','. It returns a Numpy array.
(f, b) = np.histogram(data array, bins=No. of bins)
This function creates a histogram or a frequency distribution of the data. It returns the frequencies and the bin lower limits as numpy arrays, f and b. The size of the bin limits array b is one greater than the number of bins because it contains the upper limit of the last bin at the end. That is, if the number of bins = 4, b will contain 5 values, bl_1, bl_2, bl_3, bl_4 and bh_4 where 'l' indicates lower limit and 'h' indicates higher or upper limit.
matplotlib.pyplot.hist(data array, bins=No. of bins)
This function plots a bar graph, with the specified number of bins, of the histogram computed from the values in the data array.
sol = np.linalg.solve(A, B)
This function solves the linear system of equations given in matrix form as AX = B and returns the solution as a Numpy array.
(c, p) = scipy.stats.stats.pearsonr(A, B)
This function computes the Pearson correlation coefficient between two lists of values, A and B. It returns the Pearson correlation coefficient c and also a probability value p. p is the probability that the correlation between A and B is due to chance and is not due to the actual relationship between data items. For believable results, |c| > 0.7 or 0.8 and p should be less than 1/N where N is the number of items in the lists.
(s1, s2, ..., sn, c) = polyfit(A, B, n)
This function performs regression on (x,y) data given in A, B where A is an array of 'x' coordinates and B is an array of 'y' coordinates. n determines the kind of function that is fit through the points. If n is 1, then a line c + s1*x is fit through the points. Otherwise, the polynomial
f = np.polyval(A, X)
This function evaluates the polynomial given in a numpy array A at points given in X (which is also a numpy array) and returns A(X) as a numpy array. If p is of length N, this function returns the value:
Initialisation
import numpy
from matplotlib import pyplot as plt
from scipy import polyfit
from scipy.stats.stats import pearsonr
np.set_printoptions(precision=2) # Show numbers to 2-digits only
This will read data (separated by ',' (comma)) from a given file into the numpy array X.
1. This problem illustrates the use of a histogram
A dice is tossed 2000 times and the values are recorded in a file named dice-data.csv. Verify if the dice is unbiased.
Soln:
An unbiased dice should result in almost equal number of throws for each value '1' to '6'. As the dice is thrown 2000 times, each value should occur approximately 333 times each. By computing a histogram of the data, we will get the number of times each value is thrown and it should be easy to see if all the numbers are roughly equal. A histogram shows the frequency or how many times a value occurs in a given dataset. Usually, the values are put into bins, that is groups of values. It will become clear as you work your way through this example.
X = np.genfromtxt('dice-data.csv', delimiter=',')
X = np.array(X, dtype='int')
print "X: ", X
(f, b) = np.histogram(X, bins=6)
print "Bins: ", b
print "Frequencies: ", f
Now, let us look at a way in which both the histogram computation and plotting the resulting frequencies can be done together using the hist method in pyplot.
plt.hist(X, bins=6)
As seen above, the histogram shows that '1' occurred 355 times, '2', 326 times, '3', 310 times, '4', 335 times, '5', 330 times and '6', 344 times. As these numbers differ from 333 only by about 20, we can assume that the dice is unbiased.
2. This problem shows the use of numpy for solving a system of linear equations. A mathenatician is fond of describing interesting car number plates that he sees. He gives the following information about a car number plate. Find the number.
The third digit is obtained by subtracting the second from the first digit.
The fourth digit is obtained by subtracting the third from the second digit.
The sum of all the digits is 25.
These are written in matrix form as AX = B where A is the matrix containing the coefficients of w, x, y and z; X is a vector containing the unknowns; and, B is the vector containing the right side values of the equations. So, we define A, B and X appropriately.
A = np.array([[1, 1, -10, -1],
[1, -1, -1, 0],
[0, 1, -1, -1],
[1, 1, 1, 1]])
B = np.array([[0],
[0],
[0],
[25]])
print "A: ", A
print "B: ", B
print "Car Number: ", np.linalg.solve(A, B).transpose()
There is also another way to solve it. As AX = B, X = inv(A)xB.
print "Car Number is ", np.linalg.inv(A).dot(B)
3. This problem is about what happens when we have more equations than unknowns. A student writes a computer program that finds the maximum of a given set of N numbers. She runs the program for different values of N and notes down the time taken to find the maximum each time. Given her data in the file max-data.csv, how much time does it take to find the maximum of 200,000 numbers? The data in the file is in two columns: the first column is N and the second column is the time taken in milliseconds(ms).
Soln: Having more equations than unknowns is the most commonly encountered case in data analysis and helps handle noise in data. That is, the values in the equations are not absolutely correct but approximate. The answer is then finally one that minimises the overall error across all equations. The method we use is regression.Data analysis using regression has three steps:
maxdata = np.genfromtxt('max-data.csv', delimiter=',')
print "Times for finding Maximum: "
print maxdata
print "Correlation Coefficient: ", pearsonr(maxdata[:,0],
maxdata[:,1])[0]
plt.plot(maxdata[:,0], maxdata[:,1], 'bo')
plt.show()
(sl, icept) = polyfit(maxdata[:,0], maxdata[:,1], 1)
print "Slope: ", sl, " Intercept: ", icept
print "Equation of the line: y = ", sl, "x + ", icept
maxfn = np.polyval([sl, icept], maxdata[:,0])
plt.plot(maxdata[:,0], maxdata[:,1], 'bo')
plt.plot(maxdata[:,0], maxfn, 'r--')
plt.show()
print "Time for 200000 Items: ", sl*200000 + icept
1. A cricket fan posts a very strange observation on social media. He claims that a certain top batsman's performance is related to how much it rains in Melbourne, Australia. He says that the number of 50s scored by the batsman in a year (in ODIs) is determined by the amount of rain (in cm/year) received in Melbourne. To support his claim, he posted data (which is in file Lab04-01.csv) in two columns: the first is the amount of rain in cm and the second is the number of 50s. Can you analyse the data and determine whether the claim is True?!
2. A space explorer wants to determine the acceleration due to
gravity (g) on an alien planet by firing a bullet and
measuring its path. The equation to use is
S = ut + 0.5gt**2
where S is the distance, u, the initial velocity and t, the
time. He records the data in the
file Lab04-2.csv Read
the data, fit the equation above and find g. The first
column is the time in seconds, the second column is the distance
in metres.
3. Find the 5-digit number given by the following information:
4. Two data files containing roughly similar text on the human digestive system are given to you. The first is in English and the second is in Spanish. Plot the word length histograms (plot the number of words of a particular length against length) and discuss any differences between the two languages.