A Story about Data, Part 1: The shape of the data

Note about the visualisations: All of the plotting was done with Basis-Processing. You’ll find its source here.

The current dataset that I’m working comes from the education domain. Roughly, there are 29000 records, each record lists the following:

  • Location of the student’s school
  • Language of the student
  • Student’s score before intervention
  • Student’s score after intervention


The score is not a single number, it is a set of 56 responses marked as 0/1. Generally, a 1 may be treated as a favourable answer, therefore, adding them up to get a single aggregate score has natural ordering: a sense of who did better.
Now, 29000 records is not a lot, but it is not exactly trivial to analyse them either. The following paragraphs will summarise the visualisations/analyses I’ve run on the data. It is to be noted that this is a work in progress.

Visualisation

Before any sort of data transformation or dimension reduction, I considered the sorts of visualisation which would help me in one of two ways:

  • Raw data exploration
  • Raw data summary
  • Raw data distribution

Raw data exploration

There are several ways of exploring the raw data in n dimensions. I chose the one which seemed very intuitive – Parallel Coordinates. Parallel coordinates simplify the issue of data representation by plotting n parallel axes for n corresponding dimensions. For a single data point, points are marked on each axis according to its corresponding value, and these points are joined by a straight line. The result is that a single sample is a series of broken lines connected end-to-end.

The picture below shows an example; you can find more details here and here.


Raw data summary

I’ve heard this repeated endlessly: measures based on invalid assumptions are…invalid. No measure is affected as greatly by this idea as the summary of data. Because we are looking for a shorthand to characterise the data, it is important to decide on the correct shorthand. Nothing too complicated, but good summaries (graphical or numerical) are resistant to outliers, and tell some story about the data.
Here is a box plot of the data set, broken down by language. The top row shows the pre-intervention score, and the one below shows the post-intervention results. The languages which do not show up (Bengali, English, Gujarathi, MultiLng, Nepali, NotKnown and Oriya) had too few samples to accurately determine any of the quartiles, and thus had to be discarded for this summary.
You can read more about box plots here.


Raw data distribution

This is where things start to get interesting. I was initially interested in seeing if any of the data came close to the Normal distribution. I selected the pre-intervention and the post-intervention total scores first. They are reproduced below.

Right, so it looks like they are nowhere close to a Normal distribution. A couple of interesting things to note: the maximum and the minimum bins of the histogram have uncommonly high values. This _might_ be because of bad data: but further crosschecking revealed no discrepancies. This bears further investigation, anyways.

But appearances might be deceptive. So I run a few more tests. One of them is a numerical test, and the other is a Q-Q plot. Note that at this point, I’m still essentially dealing with univariate data.

The statistic I chose to test the deviation from normality is the Jarque-Bera test, which uses the kurtosis and the skew as intermediate values to calculate the Jarque-Bera statistic. This is then compared with a chi-square distribution table to test the null hypothesis that the data is from a Normal distribution.

I used a handy chi-square calculator found here.

Anyway, the details of the Jarque-Bera test for the pre-intervention and post-intervention data are listed below:

Pre-intervention Score
n = 28535.0
Skewness = -1.0001504234198
Kurtosis = -1.99959305171352
JB statistic = 34476.3843030411

Post-intervention Score
n = 28535.0
Skewness = -1.00010106352368
Kurtosis = -1.9997273050813
JB statistic = 34477.5108572449

Well, the JB statistic is so high as to be laughable, using the probability threshold of 0.05 in the chi-square calculator gives the alpha value of 5.9914; and our JB statistic for both cases is much, much higher than either of them.
I wonder though, whether this is because of the fact that the probabilities of most bins are less than 0.05 to begin with.
For the moment, I rejected the null hypothesis for both the pre- and the post-intervention scores.


One final, graphical test, the Quantile-Quantile plot, remains.
This gives a graphical sense of the deviation of the distribution from the Normal distribution.
The pre-intervention and post-intervention Q-Q plots are shown, in their respective order.

If any of those datasets was somewhat normal, they’d follow the green curve. As it turns out they are not.
Before we move on to the next stage of analysis, I decided to try another metric, namely, the improvement in score. This is calculated very simply by subtracting the pre- from the post-intervention score.
Let’s take a look at the shape of the data.

Oh, hmm…that’s pretty interesting; this does look like a Normal distribution, or at least one of its variants. Maybe a Cauchy, I don’t know. But there is definitely a pattern here. In this case, it even makes sense to calculate the mean improvement, which comes out to 7.22.

I spent some time trying to fit the distribution to different distributions, without much success. For example, here is the best Normal distribution I could fit, without it becoming grossly distorted.
This particular Normal curve has a variance of 245.47 and a mean of 7.22.

I tried fitting the Cauchy distribution as well. There was some agreement, but only after a log-transform of the improvement data. This particular Cauchy curve had a mean 4.13 and a scale parameter of 0.14.

Let’s see what the Jarque-Bera test has to say for the untransformed data:

Score Improvement
n = 28535.0
Skewness = -4.24685767755943
Kurtosis = 37.1765663807246
JB statistic = 1474523.40413686

UPDATE: This supposed ‘discrepancy’ puzzled me for a while till the StackExchange hivemind told me that the distribution is nowhere near Normal. I guess I should have squinted at the data a little less 🙂

Again, it says that the data is not normally distributed. This is an interesting case where the data ‘looks’ normal to the eye, but a statistic implies otherwise. Let’s do the Q-Q plot to see how it comes out.

There does seem to some trend in the deviation; it’s tracking the Normal probability plot, somewhat, but there is a consistent deviation. There is some agreement, but on the whole, but sadly, I conclude that this data (while following some trend) is not normal.

Having said that, the Cauchy distribution fits the transformed data a lot better, though all transformations I’ve read about are carried out with the aim of fitting the Normal distribution to the data, not any other.

In the next post, I’ll delve into some more visualisations, as well as start establishing (or rejecting) hypotheses regarding dependency between multiple variables.
Also, some more interpretation of all the information I’ve presented above is in order; expect some of that too.