Lab VII (13/10/2016)

STATISTICAL DATA ANALYSIS

In this lab, we will look at some statistical tests for analysing experimental data and drawing statistically valid conclusions.

Preliminaries

The most important question that one asks after performing any experiment is, "does the data from the experiment support or negate the hypothesis which formed the basis of the experiment".

Example 1: Let us hypothesise that Hyderabad was warmer in May this year than during last year. Can we validate this claim?

We get temperature measurements for May last year and for this year. We can find the mean temperature for the two years and see which is greater. This is a beginning but, "is the observed difference real? Or is it due to the random variations in the data or the measuring techniques?" Statistics lets us answer the question.

The key is the t-test which can be used to verify if the means of two distributions are significantly different. The significance depends on several factors including the number of samples obtained from the experiments. For example, a difference in means of, say 4, may not be significant if the number of samples is only 5 but a difference of only 0.5 can be significant if the number of samples is 50. You will see all this in the following example.

The temperature data is given in the files May15_temp.dat and May16_temp.dat.

In [2]:

import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as plt

May15 = np.loadtxt("May15_temp.dat", dtype='int32')
May16 = np.loadtxt("May16_temp.dat", dtype='int32')

# Let us compute the means of the two datasets
print 'Mean temperature in May 2015 is ', np.mean(May15)
print 'Mean temperature in May 2016 is ', np.mean(May16)

Mean temperature in May 2015 is  41.0967741935
Mean temperature in May 2016 is  40.5806451613

The difference is only 0.5 degrees and, in fact, May 2015 was hotter! The question is, "is it really significant?"

In [3]:

r = stats.ttest_ind(May15, May16)
print r.statistic, r.pvalue, r.statistic/r.pvalue

2.06215704891 0.0435310848171 47.3720574061

The simplest way to interpret the above numbers is that r.statistic should be way greater than 1 (at least 100) and r.pvalue should be very small (typically less than 10^{-3}) to say that the difference in the means is significant. In the above example, the values do not satisfy these criteria and therefore we cannot say that the difference is meaningful. We can simplify the criterion by saying that r.statistic/r.pvalue > 100. For our temperature data it is not and therefore, we cannot say that May 2015 was hotter than May 2016 or anything else for that matter! There is no evidence that May 2015 and May 2016 are different.

Anyway, it is always a good idea to plot the data. So, plot the temperature data for May 2015 and May 2016 and see how they look. You can make your own observations about the result.

In [4]:

plt.plot(May15, 'ro-')
plt.plot(May16, 'bd-')
plt.show()

Example 2: We do an experiment to determine if drinking affects the speed of responses to a set of simple language tasks. A set of 500 words is given and the speed with which the user responds is measured. The data for a user when (s)he is sober and when drunk is given in the files sober.dat and drunk.dat. Response times in milliseconds is given for a test subject who claimed that drinking does not affect the response time. What are your conclusions?

In [5]:

sober = np.loadtxt('sober.dat', dtype='int32')
drunk = np.loadtxt('drunk.dat', dtype='int32')
print 'Average Response Times (in ms)'
print 'When Sober: ', np.mean(sober), '   When Drunk: ', \
                      np.mean(drunk)
print 'Result of t-test:'
r = stats.ttest_ind(sober, drunk, equal_var=False)
print r.statistic, r.pvalue, r.statistic/r.pvalue

Average Response Times (in ms)
When Sober:  122.196    When Drunk:  120.726
Result of t-test:
2.27871835993 0.0230671341426 98.7863661712

This is amazing: the response time goes down after drinking. So, is drinking better for responses? But, the t-test tells us that the result is not significant.

Let us look at the data.

In [6]:

plt.plot(sober, 'go')
plt.plot(drunk, 'rd')
plt.show()

What is the difference? When drunk, the response times are not at all consistent; they vary wildly. Can statistics help us here? Yes, it can! We now use the F-test which tells us if two sets of data have the same variance and if the difference is significant.

Let us try this on our data.

In [7]:

print 'Variances in the data: Sober - ', np.var(sober), \
                    'Drunk - ', np.var(drunk)
print 'Result of F-test:'
f = stats.f_oneway(sober, drunk)
print f.statistic, f.pvalue, f.statistic/f.pvalue

Variances in the data: Sober -  9.889584 Drunk -  197.770924
Result of F-test:
5.19255736387 0.0228943606378 226.805082965

The F-test tells us that r.statistic and r.pvalue are not very high or low respectively but their ratio is above 100. We can conclude that the difference in variances is significant and that drinking makes the person unreliable and highly variable when it comes to response times although the mean response time does not appear to be affected.

Example 3: In this example, we see how we verify if the experimental data follows the hypothesised model. We want to verify if a given dice is unbiased, that is, all the values are equally likely. The data for a 1000 tosses of the dice is given in the file dice.dat. Is the dice unbiased?

In [8]:

dice = np.loadtxt('dice.dat', dtype='int32')
df, db = np.histogram(dice, bins=6)
print 'Frequency Distribution for Dice Data: ', df
plt.hist(dice,bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5])
plt.show()

Frequency Distribution for Dice Data:  [159 180 175 171 154 161]

An ideal dice should give us each number from '1' to '6' exactly 166.67 times. We see that '2' occurs 180 times while '5' occurs only 154 times. But is this difference significant?

We now use the last of our statistical tests, the chi-squared test. This tests if the distribution obtained from our experimental data (that is, do the frequencies shown above) really follows a uniform distribution which is that each value from '1' to '6' occurs exactly 166.67 times?

The method stats.chisquare (expt_hist, theory_hist) takes two arguments. The first argument, expt_hist is the histogram of the data obtained from the experiment. The second, theory_hist is the histogram that we hypothesise or expect to get. The result of the test tell us if expt_hist really can be similar to theory_hist.

In [11]:

cs = stats.chisquare(df, [166.67, 166.67, 166.67, 166.67, 166.67, 166.67])
print cs.statistic, cs.pvalue

3.10393832123 0.683965317354

The pvalue should be close to '1' while the statistic should be close to '0'. The above values are reasonable; if the two distributions are not identical, we get statistic values > 100. WE can omit the second distribution if we want to test for a uniform distribution as in the case of dice throws or coin tosses.

Just for the fun of it, we can test it against a normal distribution. We generate a normal distribution using np.random.normal() method from numpy package. The values should lie between '1' and '6' to simulate dice throws. Therefore, we fix the mean of the normal distribution as '3.5' and the standard deviation as '1.5' and ask for 1000 values. We then compute the histogram and use it as our theory_hist in the stats.chisquare method.

In [10]:

print 'Checking for Uniform Distribution'
cs = stats.chisquare(df)
print cs.statistic, cs.pvalue

print 'Checking for Normal Distribution'
N = np.int32(np.random.normal(3.5,1.5,1000))
Nf, Nb = np.histogram(N, bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5])
plt.hist(N, bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5])
plt.show()
print Nf, Nb
ns = stats.chisquare(df, Nf)
print ns.statistic, ns.pvalue

Checking for Uniform Distribution
3.104 0.683955816072
Checking for Normal Distribution

[104 226 270 202  96  33] [ 0.5  1.5  2.5  3.5  4.5  5.5  6.5]
608.15923714 3.49047824931e-129

Now you see what happens when we try to match an experimental distribution to a wrong theoretical distribution. The pvalue is 10^{-129}. If that is not close to '0', I don't know what is!! So, the normal distribution is absolutely wrong.

A better way to test if two datasets are drawn from the same distribution is what I call the Vodka Test!. Of course, officially it is called the KS Test or the Kolmogorov-Smirnov test. It is provided in the scipy.stats package as the ks_2samp method. The above datasets may be compared as stats.ks_2samp(ds1, ds2) where ds1 and ds2 are the two datasets. It also returns a pair of values. pvalue close to 1 means that the two datasets are identical while a value close to 0 implies that the differences between the two datasets are significant and that they are dissimilar.

In summary, what we have seen is that analysing data is quite easy in Python. We use the plot, ttest_ind, f_oneway and chisquare or KS tests to verify if our experimental data verifies or negates our initial hypotheses. To test if two sets of values have identical mean values, we use the t-test; if we need to test whether two sets of values with identical or similar means have identical variances, we use the F-test; and, if we need to check whether two sets of values have the same distribution or if the set of experimental values follow a theoretical distribution, we use the chi squared test or KS test. In all these cases, we get a value computed from the distributions (statistic) and another value (p-value) which is used as a reference. If the statistic is very different from the p-value, our initial hypothesis cannot be justrified.

Lab VII (13/10/2016)

STATISTICAL DATA ANALYSIS

Preliminaries

Problems