Continuing on with my exploration of the data mining landscape, I extracted out a decision tree out of the data under scrutiny. It is the same data (20,000 samples, 56 samples), but the dimensions I’ve chosen for partitioning are a bit different from the raw attributes. I’ve conflated the 56 dimensions into a single number, since this is a test score we’re talking about, and I’m not sure modelling the indivdual responses for constructing the tree would be the best idea. I’m not really looking for close fits, buckets or bins should be adequate for discretising the response space.
Accordingly, I’ve partitioned the pre-scores as EXCELLENT, GOOD, AVERAGE, etc. The attribute that I’m attempting to predict is also a score, but this is the post-score. Well, not really the score itself, the improvement in score is a more sensible metric to attempt to guess.
I had a problem with trying to visualise the data, but I’ve been able to make do with indenting the different levels of decision nodes; this should be fine till I really need to use a visualisation library. I think the code could use a bit of work – probability notation does not lend itself easy to elegant variable naming. I’ll probably write a lot more on this topic once I’m into Bayesian Nets.