Monday, June 28, 2010

of lines an graphs

i made a comment on brian krebs' recent blog post "Anti-virus is a Poor Substitute for Common Sense" that seems to have gotten a number of negative reactions from other readers. i thought perhaps i should expand on my comment here so as to demonstrate why i wasn't just nitpicking.

first we need an example of a line graph like the one i brian's post:
unlike the graph from the NSS Labs report that brian posted about, i've made it clear where my actual data points are so that you can see clearly what interpolation can do to a graph. while i have absolutely no data points above 70 the line still goes up above 70 and then comes back down, just like the graph in the report in question.

now, if you couldn't see where the data points actually were you might easily think that the data actually showed the value went above 70 and then came back down again. in fact one could easily make the mistake that every point on the graph represented what really happened, even the points for which there is no actual data, because a line graph without the data points clearly denoted implies that there is continuous data along the entire graph. but the reality is our data is not continuous, it's discrete. we take a finite number of measurements at fixed points in time.

even with the data points clearly marked on my own example graph above there is the implication that had i actually measured at some point i would have gotten a value on the line, even though that isn't necessarily true, and in fact there are many many points where that is most certainly false. let's say for example that my graph shows how the detection rate changes over time on a fixed set of 10 items. the detection rate can never be 15%, it's simply not numerically possible even though the line implies it is.

these are some of the problems you encounter using a continuous data visualization for discrete data. i'm not trying to suggest these are huge problems but they are problems because they mislead the reader. these subtle sorts of things are exactly what it means to lie with numbers/statistics. it makes no sense for there to have been periods of time during which the detection rate on a fixed set of malware went down instead of up (go on, pull my other leg) and that was almost certainly the result of interpolation rather than something reflected by actual data.

the misapplication of continuous data visualizations for discrete data is a hallmark of junk science. i don't know if that is representative of the work NSS Labs actually does (i've yet to successfully penetrate their reg-wall in order to see for myself) or if it was just a simple mistake - greater transparency on their part (as apparently discussed by david harley) would allow better peer review and help eliminate uncertainty about such things. as it stands, however, i have an admittedly small reason to be skeptical of them. their own marketing (PDF) bills them as being scientific and expert but a scientist ought to understand his/her data better than this.