Lab VII (13/10/2016)

STATISTICAL DATA ANALYSIS


In this lab, we will look at some statistical tests for analysing experimental data and drawing statistically valid conclusions.

Preliminaries

The most important question that one asks after performing any experiment is, "does the data from the experiment support or negate the hypothesis which formed the basis of the experiment".

Example 1: Let us hypothesise that Hyderabad was warmer in May this year than during last year. Can we validate this claim?

We get temperature measurements for May last year and for this year. We can find the mean temperature for the two years and see which is greater. This is a beginning but, "is the observed difference real? Or is it due to the random variations in the data or the measuring techniques?" Statistics lets us answer the question.

The key is the t-test which can be used to verify if the means of two distributions are significantly different. The significance depends on several factors including the number of samples obtained from the experiments. For example, a difference in means of, say 4, may not be significant if the number of samples is only 5 but a difference of only 0.5 can be significant if the number of samples is 50. You will see all this in the following example.

The temperature data is given in the files May15_temp.dat and May16_temp.dat.

In [2]:
import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as plt

May15 = np.loadtxt("May15_temp.dat", dtype='int32')
May16 = np.loadtxt("May16_temp.dat", dtype='int32')

# Let us compute the means of the two datasets
print 'Mean temperature in May 2015 is ', np.mean(May15)
print 'Mean temperature in May 2016 is ', np.mean(May16)
Mean temperature in May 2015 is  41.0967741935
Mean temperature in May 2016 is  40.5806451613

The difference is only 0.5 degrees and, in fact, May 2015 was hotter! The question is, "is it really significant?"

In [3]:
r = stats.ttest_ind(May15, May16)
print r.statistic, r.pvalue, r.statistic/r.pvalue
2.06215704891 0.0435310848171 47.3720574061

The simplest way to interpret the above numbers is that r.statistic should be way greater than 1 (at least 100) and r.pvalue should be very small (typically less than 10^{-3}) to say that the difference in the means is significant. In the above example, the values do not satisfy these criteria and therefore we cannot say that the difference is meaningful. We can simplify the criterion by saying that r.statistic/r.pvalue > 100. For our temperature data it is not and therefore, we cannot say that May 2015 was hotter than May 2016 or anything else for that matter! There is no evidence that May 2015 and May 2016 are different.

Anyway, it is always a good idea to plot the data. So, plot the temperature data for May 2015 and May 2016 and see how they look. You can make your own observations about the result.

In [4]:
plt.plot(May15, 'ro-')
plt.plot(May16, 'bd-')
plt.show()

Example 2: We do an experiment to determine if drinking affects the speed of responses to a set of simple language tasks. A set of 500 words is given and the speed with which the user responds is measured. The data for a user when (s)he is sober and when drunk is given in the files sober.dat and drunk.dat. Response times in milliseconds is given for a test subject who claimed that drinking does not affect the response time. What are your conclusions?

In [5]:
sober = np.loadtxt('sober.dat', dtype='int32')
drunk = np.loadtxt('drunk.dat', dtype='int32')
print 'Average Response Times (in ms)'
print 'When Sober: ', np.mean(sober), '   When Drunk: ', \
                      np.mean(drunk)
print 'Result of t-test:'
r = stats.ttest_ind(sober, drunk, equal_var=False)
print r.statistic, r.pvalue, r.statistic/r.pvalue
Average Response Times (in ms)
When Sober:  122.196    When Drunk:  120.726
Result of t-test:
2.27871835993 0.0230671341426 98.7863661712

This is amazing: the response time goes down after drinking. So, is drinking better for responses? But, the t-test tells us that the result is not significant.

Let us look at the data.

In [6]:
plt.plot(sober, 'go')
plt.plot(drunk, 'rd')
plt.show()

What is the difference? When drunk, the response times are not at all consistent; they vary wildly. Can statistics help us here? Yes, it can! We now use the F-test which tells us if two sets of data have the same variance and if the difference is significant.

Let us try this on our data.

In [7]:
print 'Variances in the data: Sober - ', np.var(sober), \
                    'Drunk - ', np.var(drunk)
print 'Result of F-test:'
f = stats.f_oneway(sober, drunk)
print f.statistic, f.pvalue, f.statistic/f.pvalue
Variances in the data: Sober -  9.889584 Drunk -  197.770924
Result of F-test:
5.19255736387 0.0228943606378 226.805082965

The F-test tells us that r.statistic and r.pvalue are not very high or low respectively but their ratio is above 100. We can conclude that the difference in variances is significant and that drinking makes the person unreliable and highly variable when it comes to response times although the mean response time does not appear to be affected.

Example 3: In this example, we see how we verify if the experimental data follows the hypothesised model. We want to verify if a given dice is unbiased, that is, all the values are equally likely. The data for a 1000 tosses of the dice is given in the file dice.dat. Is the dice unbiased?

In [8]:
dice = np.loadtxt('dice.dat', dtype='int32')
df, db = np.histogram(dice, bins=6)
print 'Frequency Distribution for Dice Data: ', df
plt.hist(dice,bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5])
plt.show()
Frequency Distribution for Dice Data:  [159 180 175 171 154 161]

An ideal dice should give us each number from '1' to '6' exactly 166.67 times. We see that '2' occurs 180 times while '5' occurs only 154 times. But is this difference significant?

We now use the last of our statistical tests, the chi-squared test. This tests if the distribution obtained from our experimental data (that is, do the frequencies shown above) really follows a uniform distribution which is that each value from '1' to '6' occurs exactly 166.67 times?

The method stats.chisquare (expt_hist, theory_hist) takes two arguments. The first argument, expt_hist is the histogram of the data obtained from the experiment. The second, theory_hist is the histogram that we hypothesise or expect to get. The result of the test tell us if expt_hist really can be similar to theory_hist.

In [11]:
cs = stats.chisquare(df, [166.67, 166.67, 166.67, 166.67, 166.67, 166.67])
print cs.statistic, cs.pvalue
3.10393832123 0.683965317354

The pvalue should be close to '1' while the statistic should be close to '0'. The above values are reasonable; if the two distributions are not identical, we get statistic values > 100. WE can omit the second distribution if we want to test for a uniform distribution as in the case of dice throws or coin tosses.

Just for the fun of it, we can test it against a normal distribution. We generate a normal distribution using np.random.normal() method from numpy package. The values should lie between '1' and '6' to simulate dice throws. Therefore, we fix the mean of the normal distribution as '3.5' and the standard deviation as '1.5' and ask for 1000 values. We then compute the histogram and use it as our theory_hist in the stats.chisquare method.

In [10]:
print 'Checking for Uniform Distribution'
cs = stats.chisquare(df)
print cs.statistic, cs.pvalue

print 'Checking for Normal Distribution'
N = np.int32(np.random.normal(3.5,1.5,1000))
Nf, Nb = np.histogram(N, bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5])
plt.hist(N, bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5])
plt.show()
print Nf, Nb
ns = stats.chisquare(df, Nf)
print ns.statistic, ns.pvalue                     
Checking for Uniform Distribution
3.104 0.683955816072
Checking for Normal Distribution

[104 226 270 202  96  33] [ 0.5  1.5  2.5  3.5  4.5  5.5  6.5]
608.15923714 3.49047824931e-129

Now you see what happens when we try to match an experimental distribution to a wrong theoretical distribution. The pvalue is 10^{-129}. If that is not close to '0', I don't know what is!! So, the normal distribution is absolutely wrong.

A better way to test if two datasets are drawn from the same distribution is what I call the Vodka Test!. Of course, officially it is called the KS Test or the Kolmogorov-Smirnov test. It is provided in the scipy.stats package as the ks_2samp method. The above datasets may be compared as stats.ks_2samp(ds1, ds2) where ds1 and ds2 are the two datasets. It also returns a pair of values. pvalue close to 1 means that the two datasets are identical while a value close to 0 implies that the differences between the two datasets are significant and that they are dissimilar.

In summary, what we have seen is that analysing data is quite easy in Python. We use the plot, ttest_ind, f_oneway and chisquare or KS tests to verify if our experimental data verifies or negates our initial hypotheses. To test if two sets of values have identical mean values, we use the t-test; if we need to test whether two sets of values with identical or similar means have identical variances, we use the F-test; and, if we need to check whether two sets of values have the same distribution or if the set of experimental values follow a theoretical distribution, we use the chi squared test or KS test. In all these cases, we get a value computed from the distributions (statistic) and another value (p-value) which is used as a reference. If the statistic is very different from the p-value, our initial hypothesis cannot be justrified.

Problems

Note:You may refer to the previous lab sessions to solve these problems. Write the Python code for each problem in a separate file whose name is Lab07-QuestionNo.py. Email these files as attachments in a single email to itlab2.2016@gmail.com
  1. Daily maximum and minimum temperatures for 120 days are given in the file max-min.dat. The first column contains maximum and the second column, the minimum temperatures. Do you think there is a relationship between the maximum and minimum temperatures? If so, find the equation relating maximum and minimum temperatures. Use your equation to predict the minimum temperature when the maximum temperature on a particular day is 36 degrees. Also, plot the maximum and minimum temperatures in a single plot.

    Hint: You should use linear regression and correlation coefficients.
  2. The course IT for Engineers with 100 students is taught one year by one teacher and the next year by another teacher. The marks given by the first teacher are in the file teacher_1.dat and those given by the second teacher are in the file teacher_2.dat. An enterprising student decided to use statistical tests to decide which teacher's class to join! Naturally, the student computed the average marks given by each teacher and wished to join the class with the higher average marks. What do you think? Which teacher is easier with the marks?

    Hint: Compare the distribution of marks given by the two teachers - not just the averages. Also, plot the histograms.
  3. A researcher claimed that people in cities live almost 4 years longer on an average than those that primarily live in villages. This conclusion is based on the data given in the file cereal.dat. The first column gives how many years a person in village lives and the second column gives how many years a person in the city lives. Do you agree with the researcher's claim?

  4. Another researcher used the data given in the file big_cereal.dat to claim that there is no difference in the average life spans of people living in cities or villages. Again, the data in the file is in two columns. The first column is the number of years lived by a person in the city and the second, by a person in a village. Do you agree with this claim?


HAVE FUN!