Continuing on with my work, I was just about to conclude the non-normal data of the distribution. However, I remembered reading about different transformations that can be applied to data to make it more normal. Are any such transformations likely to have any effect on the normality (or the lack thereof) of the score data?

I’d read about the Box-Cox family of transformations: essentially proceeding through powers and their inverses, in the quest to improve normality. I decided to try it, using the Jarque-Bera statistic as a measure of the normality of the data.

Continue reading A Story about Data, Part 2: Abandoning the notion of normality

# Tag Archives: statistics

# A Story about Data, Part 1: The shape of the data

**Note about the visualisations**: All of the plotting was done with Basis-Processing. You’ll find its source here.

The current dataset that I’m working comes from the education domain. Roughly, there are 29000 records, each record lists the following:

- Location of the student’s school
- Language of the student
- Student’s score before intervention
- Student’s score after intervention

Continue reading A Story about Data, Part 1: The shape of the data

# Interacting with Graphs : Mouse-over and lambda-queuer

In the previous post, I described how I’d put together a basic system to drive data selection/exploration through a queue. While generating more graphs, it became evident that the code for mouseover interaction followed a specific pattern. More importantly, using Basis to plot stuff, mandated that I look at the inverse problem; namely, determining the original point from the point under the mouse pointer. In this case, it was pretty simple, since I’m only dealing with 2D points. Here’s a video of how it looks like. The example shows the exploration of a covariance matrix.

Continue reading Interacting with Graphs : Mouse-over and lambda-queuer

# Playing around with Self Organising Maps

(Click the image to see the evolution of the SOM)

The image above was generated off 200 samples of a large data set. Sample vectors were 56-dimensional bit strings. The similarity measure used was the Hamming Distance. Brighter green represents values at a higher Hamming Distance with respect to zero.

The (very dirty) code is up at Github here.

Unrelated: I’ve been watching Leonard Susskind’s lectures on Statistical Mechanics; they’re a *tour de force*.