Tuesday, July 06, 2010

testing testing 1 2 3

last month ed moyle published a pair of posts about an incident in which a particular piece of open source IRC server software was found to have a backdoor planted (intentionally, by an unknown party) in the source code archive on the software's official site. the backdoor wasn't discovered for over half a year and in that time the trojanized copy of the software apparently made it into the gentoo linux distribution (which has since been corrected, but if you're a gentoo user/admin and you didn't hear about this then you'll want to go check some things real fast).

all of which calls into question the accuracy of linus' law which states (more or less) that 'given enough eyeballs all bugs are shallow'.

so the question is how can you know if your flaw finders (be they auditing code to find security flaws, or testing the product to find regular bugs, or even something else entirely) are doing a good job? how can we measure it? how can we test our testers or our code auditors or whatever other flaw finders we might have? as a software developer myself, i'm sensitive to the issue of software flaws, so this was a question that interested me and almost immediately a thought popped into my head - introduce your own flaws and see how good your flaw finders are at finding them. so long as they're fully documented you should be able to remove them before the final release of the code, and by measuring how many of these particular flaws get found you can get an estimate of how good a job your flaw finders are doing.

additionally, by comparing the number of artificially introduced flaws found to the total number of flaws being found you can even get an estimate of the size of the total flaw population. animal population sizes are often estimated this way, with one exception - usually it doesn't involve making new animals. that underscores one of the biggest problems with the idea; the flaws you artificially introduce may have little in common with natural flaws, and as such finding them may not be of comparable difficulty.

when estimating wild animal populations it's more common to capture some, tag them, release them, and then go on a second round of capturing to see how many of the tagged animals are captured a second time. doing this with software flaws would necessitate having 2 groups of flaw finders (or separating the ones you have into 2 groups) so that the flaws found and tagged by one group in the first pass are used to evaluate the other group in the second pass.

2 groups are necessary because, unlike animals, flaws don't move, so the group who found and tagged the flaws are going to be a little too good at finding them the second time. ideally they should also not discuss the flaws they tagged with the other group or they could wind up giving them hints that skew the score. keeping the 2 groups really separate would be what complicates this approach and where the artificially introduced flaws would have an advantage, since those would be easier to keep secret.

a compromise where the two approaches are combined could also be possible, and if the naturally occurring flaws are adequately classified it should be possible to use that information to draft more relevant artificial flaws. in addition this would enable finer grained metrics to be collected so as to find out if there are some types of flaws your flaw hunters have more trouble with than others.

of course, this has probably all been thought of before, but just in case it hasn't i just thought i'd throw it out there.