My take on Agile Analytics from the ThoughtWorks Technology Radar, which I presented at the Sheraton Bangalore today, is based off the following document.
Patient: Will I survive this risky operation?
Surgeon: Yes, I’m absolutely sure that you will survive the operation.
Patient: How can you be so sure?
Surgeon: Well, 9 out of 10 patients die in this operation, and yesterday my ninth patient died.
Andrew Lang, a Scottish writer and collector of folk tales, once remarked that many people use statistics as a drunken man uses lamp-posts…for support rather than illumination. Even so, we have come a long way from the 9th century, when Al Kindi used statistics to decipher encrypted messages and developed the first code breaking algorithm, in Baghdad – incidentally, he was instrumental in introducing the base 10 Indian numeral system to the Islamic and the Christian world.
1654 – Pascal and Fermat create the mathematical theory of probability,
1761 – Thomas Bayes proves Bayes’ theorem,
1948 – Shannon’s Mathematical Theory of Communication defines capacity of communication channels in terms of probabilities. Bit of a game changer, that one. All our designs of communication networks and error-correction algorithms stem from insights found in that work.
Today, we realise that the pace at which we collect data far exceed our capability to make sense of it. Data is everywhere, *literally*. The blood cells in your body trying to determine whether that molecule is an oxygen molecule or not? That is data. Your build breaking? That is data. You’re running a static analysis tool to check your test coverage? Yeah, that is data analysis.
Unfortunately, we are at that point where our opinions about whether a piece of data is relevant to analysis, form far too slowly. How slowly? Well, human reflexes take milliseconds, while CPUs and GPUs function on the order of nanoseconds. That is six orders of magnitude. And that is how slow we are.
This, we cannot afford to be. In the past century, data collection was the bottleneck. Datasets larger than a few kilobytes were unheard of. Now, we are playing in gigabyte territory. When I was consulting with a telecommunications company, a few months back, all calls through their network would generate upwards of 600 MB of data per day.
Volume is not the only dimension of this deluge of data. The rate of flow of incoming data gives us pause too. Think of the stock markets, imagine having to make decisions based on data, which within a few minutes (or even a few seconds), will become obsolete. Analytics is not a goal in itself. It is merely an aid to decision-making. Given the speed at which new data is collected, and the speed at which old data fades into obsolescence, we must be prepared to deal with incomplete, fast-flowing data.
Think of it as a stream from which you scoop a handful of water to determine the level of bacteria in the water. You only have limited information from a single sample, but, if you sample from multiple points upstream and downstream, you’ll finally get a fairly correct answer to your question.
Agile Analytics conjures up images of iterations, collaborating with customers, and fast feedback, when working on DW/BI projects. Indeed, this is what Ken Collier talks about in his book Agile Analytics. However, I wish to tackle a different angle. Hal Varian, Chief Economist at Google says believes that the dream job of this decade, is that of a statistician. Everyone has data. It’s harder to get opinions about the data. It’s harder to, as he says, “tell a story about this data”.
We’re at a moment in the software industry where lots of things have begun to intersect with our field of interest. Statistics is one of them. Assume you are a software engineer, and have more than a peripheral interest in this field. What do you do?
Learn classical statistics. Learn Bayesian statistics. You probably hated those textbooks, so don’t use them; there are tons of more useful educational resources on the Web. Get into machine learning. Understand that machine learning is not some super-exotic field of study. I’ll risk a limb and say that Machine Learning is just More Statistics under a trendy name.
Get a acquainted with a few languages and libraries. R, NumPy, Julia. In fact, I’m super-excited by Julia because of it offers native building blocks for distributed computation. Read a few papers on real-world distributed systems.
I do not talk about this because you’ll be building a distributed analytics engine from scratch (though you could). You will, through study of the subjects above, gain a much deeper understanding of why you should be analysing something, and also how such systems are built, You’re all, regardless of your previous background, engineers.
You will also encounter a lot of literature concerning visualisation while doing this. Visualisation is one of those things we don’t really pay much attention too, until we really need it. Bars, graphs, colours: anything in lieu of numbers, that can give us some visual indication of what’s going on. Health check pages, for example, are a useful way of integrating diagnostic information of a system.