# A Story about Data, Part 2: Abandoning the notion of normality

Continuing on with my work, I was just about to conclude the non-normal data of the distribution. However, I remembered reading about different transformations that can be applied to data to make it more normal. Are any such transformations likely to have any effect on the normality (or the lack thereof) of the score data?
I’d read about the Box-Cox family of transformations: essentially proceeding through powers and their inverses, in the quest to improve normality. I decided to try it, using the Jarque-Bera statistic as a measure of the normality of the data.

# A Story about Data, Part 1: The shape of the data

Note about the visualisations: All of the plotting was done with Basis-Processing. You’ll find its source here.

The current dataset that I’m working comes from the education domain. Roughly, there are 29000 records, each record lists the following:

• Location of the student’s school
• Language of the student
• Student’s score before intervention
• Student’s score after intervention

# Interacting with Graphs : Mouse-over and lambda-queuer

In the previous post, I described how I’d put together a basic system to drive data selection/exploration through a queue. While generating more graphs, it became evident that the code for mouseover interaction followed a specific pattern. More importantly, using Basis to plot stuff, mandated that I look at the inverse problem; namely, determining the original point from the point under the mouse pointer. In this case, it was pretty simple, since I’m only dealing with 2D points. Here’s a video of how it looks like. The example shows the exploration of a covariance matrix.

# Driving data visualisation over a queue using RabbitMQ and lambda-queuer

One of the things which has bothered me ever since I took the dive into visualisation is the problem of interactivity. The aim of interacting with a visualisation is to drill down or explore areas of the visualisation which are (or seem) interesting. Put another way, we are essentially filtering the data from a visual standpoint. In most cases, mouse interactions may be sufficient. But what if I wanted to be able to filter the data programmatically and have the result reflected in the visualisation?

One way is to simply re-run the code which generates the visualisation each time we use a different filter. This is the simplest, and, in many cases, enough. In this case, the modification to the code is made in an offline fashion. What if we wanted to do the same, but while the program is running? This describes my attempt at one such implementation. Albeit still somewhat primitive, we’ll see where it ends up. For the purposes of demonstration, I used the Parallel Coordinates visualisation, which is available on GitHub. I’ll continue using Processing through Ruby-Processing for this description.

# Installing the Basis gem (updated for v0.5.1+)

You can use Ruby-Processing in two ways.

Use the jruby-complete.jar that Ruby-Processing ships, the Gems-in-a-Jar approach. In this mode, all gems you install will be packaged as part of the JAR.

If you’re following the first approach, first head to the location where the jruby-complete.jar is located, for Ruby-Processing. There, do this:

java -jar jruby-complete.jar -S gem install basis-processing --user-install


Alternatively, if you’re using a conventional JRuby installation, do this:

sudo jruby -S gem install basis-processing


# A guide to using Basis (updated for v0.6.0+)

This is a quick tour of Basis. Find the source for Basis on GitHub. Installing Basis is pretty simple; just grab it as a gem for your JRuby installation. Brief notes on the installation can be found here.

UPDATE: Starting from version 0.6.0, Basis allows you to specify axis labels. Additionally, you can specify arrays of points instead of plotting points one at a time. When you do this, you can also specify a corresponding legend string, which will show up in a legend guide. See below for more details.

UPDATE: Starting from version 0.5.9, you can turn grid lines on or off. Additionally, the matrix operations implementation has been ported to use the Matrix class in Ruby’s stdlib.

UPDATE: Starting from version 0.5.8, you can customise axis labels, draw arbitrary shapes/text/plot custom graphics at any point in your coordinate system. See below for more details.

UPDATE: With version 0.5.7, experimental support has been added for drawing objects which aren’t points. Interactions with such objects is currently not supported. Additional support for drawing markers/highlighting in custom bases is now in.

UPDATE: Starting from version 0.5.1, Basis has been ported to Ruby 1.9.2, because of the kd-tree library dependency. Currently, there are no plans of maintaining Basis compatibility with Ruby 1.8.x. As an aside, I personally recommend using RVM to manage the mess of Ruby/JRuby installations that you’re likely to have on your machine.

UPDATE: Basis has hit version 0.5.0 with experimental support for mouseover interactivity. More work is incoming, but the demo code below is up-to-date, for now. The code below should be the same as demo.rb on GitHub.

# Basis: Plotting arbitrary coordinate systems in Ruby-Processing

One of the first things I realised while working on visualisations in Processing is that a lot of the work required in setting up coordinate systems and plotting them is somewhat of a chore. Specifically, for things like parallel coordinates, multiple axes, each with its own scaling, I initially ended up with some pretty ugly custom code for each case. I did look around in the Libraries section of the Processing website, but didn’t find anything specific to manipulating and plotting coordinate axes.

# Data interactions in parallel coordinates: 40x-60x speedup

This is an update on the visualisation post on parallel coordinates. Understanding the Processing model made me realise that it probably wasn’t a good idea to draw all the samples each time draw() was called. Of course, every refreshed call of draw() does not clear away the previous frame’s graphics, so that just makes it easier. In the end, I went and explicitly drew only the samples which were under the current mouse position.

The speedup is obvious and massive: whereas the previous version worked well with only 300 samples, the current one processes 18000 samples without breaking a sweat. At 29,000 samples, there is a bit of a slowdown, but only just a bit, you wait 1 second for the highlighting instead of 6-7.
Here’s the new video, using 18k samples. Notice how much denser the mesh is.

# Data interactions in parallel coordinates

Processing is growing on me. Inspired by the different and (very) interesting data visualisation examples I’ve seen, I decided to take a shot at interacting with the parallel coordinates that I generated here. Of course, I had to reduce the number of samples for this demonstration; it’d slow to a unholy crawl otherwise. For this video, I’ve taken 300 samples. The interaction is essentially a mouse-hover highlighting of any sample(s) under it. I fiddled with the colors a bit, but decided that a white-on-greyscale scheme would show up better.
Of course, I still haven’t gotten around to labeling the axes. This I’ll probably pick up next. But as the video demonstrates, there’s a lot to Processing than meets the eye.

PS: By the way, the actual demonstration ends around the halfway mark; I was trying to figure out how to stop the bloody recorder.

# Parallel Coordinate visualisation of 28k, 5-dimensional data

This is the visualisation of the same dataset that I’ve been working on for a while, for exploring different data mining and visualisation techniques. Currently, the axes aren’t labelled, and the color coding is for different categories. Looks like a really interesting way to explore the data.

# Decision Trees

Continuing on with my exploration of the data mining landscape, I extracted out a decision tree out of the data under scrutiny. It is the same data (20,000 samples, 56 samples), but the dimensions I’ve chosen for partitioning are a bit different from the raw attributes. I’ve conflated the 56 dimensions into a single number, since this is a test score we’re talking about, and I’m not sure modelling the indivdual responses for constructing the tree would be the best idea. I’m not really looking for close fits, buckets or bins should be adequate for discretising the response space.
Accordingly, I’ve partitioned the pre-scores as EXCELLENT, GOOD, AVERAGE, etc. The attribute that I’m attempting to predict is also a score, but this is the post-score. Well, not really the score itself, the improvement in score is a more sensible metric to attempt to guess.

I had a problem with trying to visualise the data, but I’ve been able to make do with indenting the different levels of decision nodes; this should be fine till I really need to use a visualisation library. I think the code could use a bit of work – probability notation does not lend itself easy to elegant variable naming. I’ll probably write a lot more on this topic once I’m into Bayesian Nets.

# Matrix Theory: Diagonalisation and Eigenvector Computation

$A= \left( \begin{array}{cc} 2 & 0 \\ 0 & 5 \end{array} \right)\$

The operation it performed on basis vectors of the standard basis S was one of scaling, and scaling only. When operated on by a linear transformation matrix, if a vector is only scaled as a consequence, that vector is an eigenvector of the matrix, and the scalar is the corresponding eigenvalue. This is just the definition of an eigenvector, which I rewrite below:

$A.x = \lambda x$

For my graduation project, I’d written a machine vision/2D image algorithms system called IRIS. We’d used it to drive robots around, integrating sonar and visual data. However, that’s not what reawakened my interest in taking a re-look at the IRIS code. Currently, playing around with data sets has had me rifling through books and equations I liked looking at in college. It is almost like a second education, and I think it only right that I get IRIS up and running, if only to steal some code from it (even though it is in C++, and I’m currently doing my investigations using Ruby).

With that said, I dug into my old SourceForge account, where (to my somewhat irrational surprise) the code was still untouched. However, that code will probably not compile as-is. Even though it had been compiled under Linux, it had dependencies on drivers for hardware like the sonar systems and the webcam. I’m still not exactly sure if I want to get all those dependencies resolved; they aren’t my primary focus at this point. So I stripped off whatever was not required and pushed the clean, compiling source to GitHib here.

# Playing around with Self Organising Maps

(Click the image to see the evolution of the SOM)

The image above was generated off 200 samples of a large data set. Sample vectors were 56-dimensional bit strings. The similarity measure used was the Hamming Distance. Brighter green represents values at a higher Hamming Distance with respect to zero.
The (very dirty) code is up at Github here.
Unrelated: I’ve been watching Leonard Susskind’s lectures on Statistical Mechanics; they’re a tour de force.

# A pipeline for adaptive bitrate video encoding

I’ve been working on something unusual lately, namely, building a pipeline for encoding video files into formats suitable for HTTP Live Streaming. The actual job of encoding into different formats at different bit rates and resolutions is done using a combination of ffmpeg and x264. To me, the interesting part lies in how we have tried to speed up the process, using the venerable Map-Reduce approach. Before I dive into the details, here’s a quick review of the basic idea of HLS.

Put very simply, adaptive streaming serves video content in multiple qualities, allowing the streaming client choice in selecting which quality to use depending upon the bandwidth constraint on the consumer side. This choice is not a one-time choice, depending upon the encode cut duration, the client can switch to higher or lower resolutions dynamically throughout the entire playback of the video stream.
How is this accomplished?

# Filesystems: my current reading list

Stuff I’m reading now specific to filesystems…reading Linux kernel source requires a stout heart if you’ve never done it before. And a bit of a shift in mindset: it’s not all objects anymore

# iOS AppDev Patterns: Linked Content Cursors

I’ve already talked about the Content Cursor pattern. This post is an extension of that idea to increase the flexibility of layout across sections.

To understand the problem, let us revisit a page from our hypothetical iPad magazine.
Here’s the layout of the page in portrait mode.

…and here is the same page in landscape mode.

The first important thing to notice here is that the two Politics sections have changed in position and/or size. More specifically, the upper Politics section has morphed into a tall rectangle, while the lower one has stretched horizontally.

# iOS AppDev Patterns: Content Cursor

iOS offers only the most barebones approach to placing content in a view, namely by specifying absolute coordinates. Of course, one can use autoresizing to make sure the positions of these contents are modified proportionally, but the initial positioning of a content block needs specification of the exact x- and y-coordinate of the top left of this ‘box’. This can render the layout inflexible, tedious and brittle. Every small change in position of a block has ripple effects on the position of succeeding blocks.

Content Cursor solves this problem.