This is the visualisation of the same dataset that I’ve been working on for a while, for exploring different data mining and visualisation techniques. Currently, the axes aren’t labelled, and the color coding is for different categories. Looks like a really interesting way to explore the data.
I should have seen it coming. Text communicates well – up to a point. All the current analyses I’ve been working on, starting from Self-Organising Maps to Decision Trees, are very well served by good, solid visualisation. My current need is a way to visualise data structures effectively, even if it is merely a bunch of nodes which can be expanded/collapsed to show more information. Additionally, it would be nice (not necessary) for this to happen interactively, but I don’t mind a command line-driven approach. In fact, I prefer the command line; makes it easier to drive it through a scripting language like Ruby.
So far, I’ve looked at Processing, d3.js and ProtoVis. I like the idea of d3.js: the data-driven approach makes a lot of sense, but I think I need to refresh quite a bit of CSS-fu to take advantage of its capabilities. Apart from visualisation derived from data mining algorithms, showing the data as-is in an aesthetic manner is also a worthy goal at this point. In particular, the parallel coordinates visualisation caught my eye.
Oh well, at least I know what I’ll be doing for the next few days.
Continuing on with my exploration of the data mining landscape, I extracted out a decision tree out of the data under scrutiny. It is the same data (20,000 samples, 56 samples), but the dimensions I’ve chosen for partitioning are a bit different from the raw attributes. I’ve conflated the 56 dimensions into a single number, since this is a test score we’re talking about, and I’m not sure modelling the indivdual responses for constructing the tree would be the best idea. I’m not really looking for close fits, buckets or bins should be adequate for discretising the response space.
Accordingly, I’ve partitioned the pre-scores as EXCELLENT, GOOD, AVERAGE, etc. The attribute that I’m attempting to predict is also a score, but this is the post-score. Well, not really the score itself, the improvement in score is a more sensible metric to attempt to guess.
I had a problem with trying to visualise the data, but I’ve been able to make do with indenting the different levels of decision nodes; this should be fine till I really need to use a visualisation library. I think the code could use a bit of work – probability notation does not lend itself easy to elegant variable naming. I’ll probably write a lot more on this topic once I’m into Bayesian Nets.
I’ve avoided proofs unless absolutely necessary, but the relation between the same eigenvector expressed in two different bases, is important.
Given that AS is the linear transformation matrix in standard basis S, and AB is its counterpart in basis B, we can write the relation between them as:
where C is the similarity transformation. We’ve seen this relation already; check here if you’ve forgotten about it.
Continue reading Matrix Theory: An essential proof for eigenvector computations
I return to the first example about basis vectors, when I spoke of linear transformations. The linear transformation we had was this:
The operation it performed on basis vectors of the standard basis S was one of scaling, and scaling only. When operated on by a linear transformation matrix, if a vector is only scaled as a consequence, that vector is an eigenvector of the matrix, and the scalar is the corresponding eigenvalue. This is just the definition of an eigenvector, which I rewrite below:
Understand that there is nothing extremely special about the standard basis vectors [1,0] and [0,1]. All 2D vectors may be represented as linear combinations of these vectors. Thus, the vector [7,24] may be written as:
A symmetric matrix looks like this:
Notice how the values are reflected across the diagonal a-b-c-d; this holds true for any symmetric matrix.
Continue reading Matrix Theory: Linear transformations and Basis vectors
My main aim in this series of posts is to describe the kernel — or the essential idea — behind some of the simple (and not-so-simple) eigenvector algorithms. If you’re manipulating or mining datasets, chances are you’ll be dealing with matrices a lot. In fact, if you’re starting out with matrix operations of any sort, I highly recommend following Professor Gilbert Strang’s lectures on Linear Algebra, particularly if your math is a bit rusty.
I have several reasons for writing this series. My chief motivation behind trying to understand these algorithms has stemmed from trying to do PCA (Principal Components Analysis) on a medium size dataset (20000 samples, 56 dimensional). I felt (and still feel) pretty uncomfortable about calling LAPACK routines and walking away with the output without trying to understand what goes on inside the code that I just called. Of course, one cannot really dive into the thick of things without understanding some of the basics: in my case, after watching a couple of the lectures, I began to wish that I had better mathematics teachers in school.
Continue reading Eigenvector algorithms for symmetric matrices: Introduction
I’ve been following the Linear Algebra series of lectures from MIT’s OpenCourseWare site. While watching Lecture 3 (I’m at Lecture 6 now), Professor Strang enumerates 5 methods of matrix multiplication. Two of those provided insights I wish my school teachers had provided me, but it was the fifth method which got me thinking.
The method is really a meta-method, and is a way of breaking down multiplication of large matrices in a recursive fashion. To demonstrate the idea, here’s a simple multiplication between two 2×2 matrices.
Continue reading Guiding MapReduce-based matrix multiplications with Quadtree Segmentation
For my graduation project, I’d written a machine vision/2D image algorithms system called IRIS. We’d used it to drive robots around, integrating sonar and visual data. However, that’s not what reawakened my interest in taking a re-look at the IRIS code. Currently, playing around with data sets has had me rifling through books and equations I liked looking at in college. It is almost like a second education, and I think it only right that I get IRIS up and running, if only to steal some code from it (even though it is in C++, and I’m currently doing my investigations using Ruby).
With that said, I dug into my old SourceForge account, where (to my somewhat irrational surprise) the code was still untouched. However, that code will probably not compile as-is. Even though it had been compiled under Linux, it had dependencies on drivers for hardware like the sonar systems and the webcam. I’m still not exactly sure if I want to get all those dependencies resolved; they aren’t my primary focus at this point. So I stripped off whatever was not required and pushed the clean, compiling source to GitHib here.
Continue reading IRIS, Reloaded