Saturday, April 21, 2007

when is a bad test not a bad test?

when it's FUD'ing snake oil...

there have been a number of posts about anti-virus/anti-malware testing recently... even i posted about testing in response to what has become a series of posts about anti-virus testing over on anton chuvakin's blog (1, 2, 3, 4)... well this post is a follow-up because anton has managed to post the original test paper that his series of posts were based on...

to say that i was unimpressed would be an understatement... lets start with the number of samples - you may recall from my previous post that i said that the minimum number samples needed to account for the 2% detection rate that was being claimed was 50... according to the actual paper
Of the 35 malware files, three invalid files were removed from the sample set, leaving 32 malware binaries used in the final tests and performance calculations
so if only 32 samples were used, how is it that the lowest scoring product only detected 2% of the samples? detecting just a single sample gives a detection rate of 3%, not 2%, and all products tested detected more than just one sample... it can't be blamed on anton misremembering the figure he was told either, since the actual test paper states
the lowest was tied between ClamAV and FileAdvisor with a 2% detection rate
in one place and
two products tied for the lowest detection rate at 2%
thankfully the chart with their results clears this up - it's 2 raw detections (not a 2% detection rate) which means a 6% detection rate (which was also correctly reported in that same chart)... now, you'll have to forgive me for calling a spade a spade, but this level of mathematical incompetence (recognizing that you can't have a 2% detection rate with only 32 samples is grade school math and simply reading the column marked "percent" in a chart takes even less skill than that) is inexcusable for people who wish to have their test taken seriously... given such a complete lack of mathematical acumen, it's almost understandable that they failed to realize that a test bed of 32 samples isn't anywhere near large enough to give statistically significant results...

bias also figured heavily in the test... not only because they only used samples that got through the layers of protection already present on the systems they culled the samples from (thereby missing a potentially huge chunk of what's really posing a threat to users and irrevocably compromising the results and the integrity of the test itself), but on a deeper level the test was written from the perspective of an incident response technician... what is the perspective of an incident response technician? well these are the people who spend their days dealing with the after effects of the failure of security software and/or preparing for the next failure so as to make cleaning up after that event easier than cleaning up after the previous one... all they see are the security product's failures because that's their job, and if it weren't for the immutable fact that all security products fail they wouldn't have that job and they'd have to find some other form of employment (such as performing and publishing dubious tests)... just as police (who deal with criminals on a daily basis) are prone to developing an imbalanced view of society if they're not careful, so too are incident response technicians prone to developing an imbalanced view of the efficacy of security products if they aren't exposed to the security product's successes (which are generally invisible by design)... this acute form of perceptual bias taints the entire test at fundamental levels - including the design and methodology of the test (as evidenced by their belief that they need only collect samples that successfully compromised production systems that already had protective measures in place)...

so far these problems would seemingly be attributable to the testers simply being inexperienced and/or suffering from false authority syndrome... it's time to shine a light on a part of the test that can't easily be attributed to that... there was one product in the test that didn't do too badly - in fact it did better than all other products, it detected 50% more than the next best product, and it was one of the few products in the test that weren't used through virustotal... that product was asarium by proventsure (no link for reasons that are about to be made clear)... go and read the paper carefully (it's only 4 pages) and tell me if you can see something a little off about it... yes, that's right - the product that did the best, the product that was better than all others by a wide margin was the product made by the company whose president helped write the paper and is the contact listed in the abstract... the test, which is titled "Antiviral shortcomings with respect to 'real' malware" by gary golomb, jonathan gross, and rich walchuck, is NOT independent... one or more of it's authors has a clear vested interest in making one product look good at the expense of all others... this puts all the other problems with this test into a new and decidedly unfavourable light... the bad math, the sample selection bias, the insignificant sample size, etc. - in light of this revelation they all point to a cooked test designed to make all products other than asarium look worse than they really are (FUD) in order to make asarium look better than it is in comparison to them (snake oil)... the test, therefore, becomes little more than a marketing stunt by a disreputable company whose product should probably be given a wide berth...

and poor anton chuvakin - though widely regarded as a security expert, not only is he clearly not an authority on malware himself but apparently he also can't recognize a fraud/pretender when he sees one... that doesn't bode well for average folks' ability to do the same, does it...

2 comments:

Gary said...

Hey there Kurt-

I've been watching your other posts for a while, and probably should have addressed them sooner. The problem was they looked a lot more like a troller looking for a flame war than appealing dialogue (all the posts I made about intrusion detection in early 2000 sounded a lot like that too), however it wouldn't be fair to either of us to not address them either. While it sounds like you have already made up your mind, for the sake of completeness I'll reply. (There's not much question as to the one-sidedness of your post since you left out every point in the paper that start to address many of your concerns. Ironic that you start your post talking about FUD, *especially* by your thorough definition of it, but pointing out more only underscores the obvious here.) Feel free to tear this all apart, but honestly, this will be my only post here on the subject...

You list a number of things, and I'll try to address them all, but many of them seem to revolve around the validity of the sample data. As with any test, sample data is always the biggest challenge and the biggest point where people start to disagree about what the test means. (Another important point is what this test means, which we'll talk about that in a bit...) You're correct that 32 is ultimately a small sample size. I would have loved to seen it several hundred in size, but the test called for real binaries taken from live systems. I know you don't like that either, but I'll address that in a second too. After collecting these samples for several months, we had over a hundred binaries, but most of them ended up being the same (based on hashes), so we had to filter it down to the 30's for unique binaries. We could have kept collecting for a year or so to get that number up much higher, but that would only introduced other potential problems, such as time-based validity of the sample. There are several other work-arounds to this one problem, and each of them introduces more questions of validity than they compensate for. This is something we all were unhappy about, but at the end of the day, was the most defendable sample set possible for the point of the test. Like I said, we'll talk about the point of the test in a sec.

In a purely laboratory environment, I think you're right that only pulling live binaries off live systems could have been questionable, but the entire point of the paper was *not* to establish who detects the most out of thousands of [typically custom-engineered] binaries. There are too many tests out there like that, and I can only imagine what you think of those. The point of the paper, as reflected in it's very title, is the performance of products in "the real world," but more specifically the impact to end-users. I put that in quotes because I know everyone reading this blog could spend the next few years discussing how to test "real world" scenarios and not get anywhere. It's the oldest argument in the book and has spanned every technology in the tech sector, not just security-related products. I imagine that you did lots of research on Proventsure before making your post, and you may have seen that the company is only a few months old. We are all coming from end-user operational environments that have been fed a lot of FUD by vendors. The problem with what we've been hearing all these years, is that it closely resembles much of what you posted. (Valid or not, everyone has problems with everyone else's tests. It's just a tiring cycle...)

I seriously don't say that to continue your flame war... I say it because you need to think about this from an end-users perspective. That is, the folks who are actually ***liable and accountable*** for the information on these computers that are continually being compromised and that we've spent years cleaning... What matters is that end-users are completely accustomed to cleaning systems that are fully compromised with fully updated AV. That's a major point in the paper that was quite conveniently left out of your rant. For the average global enterprise, AV is positioned as a product that keeps backdoors off computers. Here, we all know the reality of the situation. Every single person here would agree that layers of "defenses" are needed, and AV does an incredible job in certain areas. The problem is not inside of AV per se, but that if you spoke to anyone at the executive IT level of many major corporations, most would think you're crazy for suggesting they need more than a firewall and AV on end-point systems to deal with malware-related problems. I know this because the exact situation happens several times a week for us, and these are seriously global enterprises. It's shocking. No one can put blame on AV products for that. The challenge then becomes, how do you (as a technical person) bring light to a problem that many corporations are suffering from due to self-inflicted blindness, and in such a way they will understand it? One way is to show the reality of the performance on full-fledged backdoor binaries (also defined in the paper, and since you didn't rant about it, I'm guessing you didn't have too many problems with, unless I missed that one) from *their* environments. Then contrast them to the single product *they're* using to address a massively complex problem -- when, as you point out, layers are needed. This entire situation is also examined two paragraphs below, and probably makes more sense there - at least, I hope it does.

The way we ran the test was to use binaries from compromised computers that were being detected via IDS's and similar network devices. The binaries came from over 100 systems with and without host-based protection mechanisms (but, again, the sample had to be significantly reduced since scanning the same binaries over and over is meaningless). A majority of these systems had little protection given the environments they were coming from (several different universities) and the compromises were detected through due-diligence in monitoring. Since some of those systems had an installed AV, but were missing the backdoors, the best way we could compensate for that was to scan the binary against as many AV's as possible. Additionally, another layer was added by waiting a couple months for AV products to get updated before scanning the collected binaries. Most people are quite comfortable with the validity of Virustotal. You ranted against the point that Virustotal uses a couple beta products, but along the lines of FUD in this point is that you forgot to mention that VT only uses what vendors tell them to use (or, in several cases, not to use). It is still certainly a more defendable test than using only the small fraction of AV products we could get legal access to.

You're point about the view of a IR technician is excellent and accurate, but again, only half the story (which at the end of the day... Well, you know your own definition of FUD...). People that helped write the paper (and others who collaborated, but were not in the title line) are in environments that deal with no less than a dozen compromised systems a week. (No, Anton was not a collaborator, which I'm sure you've wondered about.) As in, the types of compromises you talk about where all other "normally deployed" security technologies have failed. From the view of a production environment spending millions a year on IT security, and knowing it only takes one wrongly-placed compromise to cause you to be a headline in the newspaper (I'm sure you're familiar with this http://www.privacyrights.org/ar/ChronDataBreaches.htm#CP), a dozen a week is a very disconcerting environment to work in. Speaking first hand, IR technicians don't get the luxury of being to sit around patting ourselves on the back and telling each other what studs we are because we stopped 100's the week before and everything worked like it should. (Something you seem to start implying, although not in the same sense at the preceding sentence.) After reading most your posts this whole time (you seem to have a lot of time to write this stuff, so I'm sure I missed some - if so, and I missed some points because of that, I apologize), it's obvious you've never had to deal with the prospect of losing your job when briefing the CIO and General Council's office when some administrative assist's computer with the "key to the kingdom" is hacked. I mean, after *they* spend millions of dollars per year, it's not *their* fault! In fact, having been subjected to that for a few years is compelling enough to make some of those IR technicians start a company trying to help others in the same situation.

In terms of the numbers and percentages.... The draft was written before the table was finalized, and the cut-and-paste error of not propagating the 6% over the 2% is a good catch. I thank you for that, and will get the paper updated. (Actually, I plan on updating the discussion section with many of the points you bring up, and not in the sense of point/counterpoint - but rather as completely valid points on their own merits. They may or may not relate to the test at hand, but that should be left to the reader to decide and by not including them, the paper could be more slanted than any one of us would like it to be.) After spending so much time focused on dynamic programming/graphing algorithms and programming state machines, your comments about elementary math definitely brought a smile to us. And such comments by you are definitely not FUD either, right? As for Asarium not being in the Virustotal test, hopefully someday we get it there. I would love to see that become a reality in the not-so-distant future. The direct feed of malware that VT vendors get would certainly be an awesome resource to have.

In terms of using Asarium in the test versus not using it.... (Your main point for why the paper is so terribly one-sided.) If one of the major points of the paper (at a technical level) is that products should be alerting to indications that binaries look like malware based on structural binary markers (and not based on "signatures" for doing so), but there's only one commercial product that does that, what do you suggest we do? I ask with complete geniality, and of course am not suggesting that et tu suffers from false authority syndrome. (Although you sound like you have the ability to design perfect tests, so I'm interested in your thoughts.) Let's say you wanted to write about crossing dynamic graphing algorithms with "hidden state" analysis through Hidden Markov Models detecting malware binaries more generally (yes, that's yet another point you left out of your rant but was mentioned in the paper - that AV is better in terms of being more specific, which gives IR technicians more information to work from), yet there was only one tool available to you do show this, how would you go about contrasting that tool to others?

Really long story short, the exercise began as an effort in validating the product and we were surprised by the results. Even if Asarium was removed from the picture, and as much as you might like to think otherwise, the information warranted sharing as it reflects common occurrences faced by IR engineers and common misconceptions by users who believe that they have nothing to worry about.

Anyways, although your rant was extremely slanted and seemed to suffer badly from all the same terms you throw so copiously at everyone else, with an open mind it was also interesting.

kurt wismer said...

now that's a comment!

"I've been watching your other posts for a while, and probably should have addressed them sooner. The problem was they looked a lot more like a troller looking for a flame war than appealing dialogue"

gee, i'm sorry i couldn't make calling you out on your wrong-doings more appealing to you...

"You list a number of things, and I'll try to address them all, but many of them seem to revolve around the validity of the sample data."

the technical concerns revolve around the validity of the data AND the design of the test itself...

the ethical concerns revolve around you, president of a company with anti-malware aspirations, performing anti-malware comparative reviews...

"You're correct that 32 is ultimately a small sample size."

i'm glad we can agree on something at least...

"I would have loved to seen it several hundred in size, but the test called for real binaries taken from live systems. I know you don't like that either, but I'll address that in a second too."

actually, i don't really have a problem with the binaries being taken from live systems, my problem is that they were taken from PRODUCTION systems... systems which inherently filter out the malware that anti-virus products are best at dealing with precisely because they are using the aforementioned anti-virus products...

because you limited your sampling in this way, your sample cannot represent in-the-wild malware in general but rather a particular subset of in-the-wild malware... in the real world that you allude to, people are affected by the entire set of in-the-wild malware, not just the narrowly defined subset used in your cooked test...

"After collecting these samples for several months, we had over a hundred binaries, but most of them ended up being the same (based on hashes), so we had to filter it down to the 30's for unique binaries. We could have kept collecting for a year or so to get that number up much higher, but that would only introduced other potential problems, such as time-based validity of the sample."

you already introduced time-based validity problems... if you wanted
more samples, the solution is not to collect for a longer period of time, it's to collect from a broader pool of machines... if you can't collect a reasonable number of samples then perhaps you shouldn't be doing these sorts of tests in the first place...

"There are several other work-arounds to this one problem, and each of them introduces more questions of validity than they compensate for. This is something we all were unhappy about, but at the end of the day, was the most defendable sample set possible for the point of the test."

samples from a honeypot that had no anti-malware filters in place would have been more 'defendable' as they wouldn't have the selection bias that production systems impose, but that wouldn't have supported your preconceptions nearly as well as the samples you ultimately chose did...

"The point of the paper, as reflected in it's very title, is the performance of products in "the real world," but more specifically the impact to end-users. I put that in quotes because I know everyone reading this blog could spend the next few years discussing how to test "real world" scenarios and not get anywhere."

in a system that discards known malware in the wild, how can what's left be considered a representation of the set of real world threats that end users face?

"We are all coming from end-user operational environments that have been fed a lot of FUD by vendors."

you've been fed FUD by their marketing departments and created your own marketing FUD in response...

my instincts always tell me to distrust marketing (at least conventional marketing) so i know better than to listen to those FUD-laden messages and i would hope other people are learning to do the same...

"The problem with what we've been hearing all these years, is that it closely resembles much of what you posted."

as far as tests go, av companies have had their feet held to the fire more times than i can count... they've generally fallen in line, which is why companies like symantec and mcafee aren't performing their own comparative reviews like you did...

"I say it because you need to think about this from an end-users perspective."

if you think i'm not thinking about this from an end user's perspective then you definitely don't know me very well... i AM thinking about it from an end-user's perspective... as an end user i see no reason to trust a test that throws out a whack of in-the-wild malware and then claims what's left represents the real world threat i face... as an end user i see no reason to trust a test performed by one of the companies vying for my money... as an end user i see no reason why i should trust testers who can say 2% of 32 with a straight face...

"What matters is that end-users are completely accustomed to cleaning systems that are fully compromised with fully updated AV. That's a major point in the paper that was quite conveniently left out of your rant."

because i've already spent a fair bit of time covering the fact that all prevention methods fail... that should be a no-brainer, that's why i advise the use of multiple layers (and have advised that for many years)...


"For the average global enterprise, AV is positioned as a product that keeps backdoors off computers. Here, we all know the reality of the situation. Every single person here would agree that layers of "defenses" are needed, and AV does an incredible job in certain areas."

everyone does not agree with that... anton chuvakin himself used your test to help support the now popular dogma that anti-virus is dead... that is not an agreement with the need for defenses and most certainly not an agreement that av does an incredible job in any area...

"The problem is not inside of AV per se, but that if you spoke to anyone at the executive IT level of many major corporations, most would think you're crazy for suggesting they need more than a firewall and AV on end-point systems to deal with malware-related problems."

and the easy way to show them they're wrong is to point to retrospective tests like those performed by av-comparatives.org... there are well established testing methodologies that demonstrate just how bad anti-virus products are when it comes to their problem area (unknown malware)...

"The challenge then becomes, how do you (as a technical person) bring light to a problem that many corporations are suffering from due to self-inflicted blindness, and in such a way they will understand it? One way is to show the reality of the performance on full-fledged backdoor binaries (also defined in the paper, and since you didn't rant about it, I'm guessing you didn't have too many problems with, unless I missed that one) from *their* environments."

if you're trying to get traction with business people then you need to make business arguments... they already have compromises in their own systems, they already know how much that is costing them, provide them with an additional layer that saves them more money than it costs and a smart business person should bite...

"The binaries came from over 100 systems with and without host-based protection mechanisms"

whether it had host-based protection mechanisms or not seems like a non-sequitur - even those without host-based protection would still be protected by network-based protection mechanisms like gateway scanners... nobody puts up a production system and then leaves it completely bare...

"A majority of these systems had little protection given the environments they were coming from (several different universities)"

universities are a bit of a black swan as far as malware goes, you know that right?

"Most people are quite comfortable with the validity of Virustotal. You ranted against the point that Virustotal uses a couple beta products,"

the only mention i made to virustotal was to point out that the people who run virustotal are not comfortable with the validity of tests that make use of virustotal (which should speak volumes all by itself)... and that wasn't even something i said here, it was in the comments on anton's blog...

"but along the lines of FUD in this point is that you forgot to mention that VT only uses what vendors tell them to use (or, in several cases, not to use)."

vt uses components and settings that end users don't use, that's one of the things the people who run it warn about... tests based on in are not representative of the protection one gets from the actual product(s)...

"(No, Anton was not a collaborator, which I'm sure you've wondered about.)"

nope, he made it pretty clear it was just something a contact of his told him (maybe over drinks or something, who knows)...

"You're point about the view of a IR technician is excellent and accurate, but again, only half the story (which at the end of the day... Well, you know your own definition of FUD...). People that helped write the paper (and others who collaborated, but were not in the title line) are in environments that deal with no less than a dozen compromised systems a week. [...] As in, the types of compromises you talk about where all other "normally deployed" security technologies have failed."

which supports what i said about them spending their days looking at failures...

"From the view of a production environment spending millions a year on IT security, and knowing it only takes one wrongly-placed compromise to cause you to be a headline in the newspaper (I'm sure you're familiar with this http://www.privacyrights.org/ar/ChronDataBreaches.htm#CP), a dozen a week is a very disconcerting environment to work in. Speaking first hand, IR technicians don't get the luxury of being to sit around patting ourselves on the back and telling each other what studs we are because we stopped 100's the week before and everything worked like it should."

which supports what i said about the effect their experiences have on their perspective... they don't see the successes, they don't have time to look at that or to think about that... as true and as real as all that is it doesn't make it any less biased... and that bias affected the test design in ways i've already described...

"it's obvious you've never had to deal with the prospect of losing your job when briefing the CIO and General Council's office when some administrative assist's computer with the "key to the kingdom" is hacked. I mean, after *they* spend millions of dollars per year, it's not *their* fault!"

just because i don't share your outlook doesn't mean i haven't had similar experiences... the fact of the matter is, however, i don't deal well with authority figures and if my boss winds up being someone who can't figure out that sometimes bad things happen without it being anybody's fault and inspite of doing nothing wrong, then good riddance to him...

"In terms of the numbers and percentages.... The draft was written before the table was finalized, and the cut-and-paste error of not propagating the 6% over the 2% is a good catch. I thank you for that, and will get the paper updated. (Actually, I plan on updating the discussion section with many of the points you bring up, and not in the sense of point/counterpoint - but rather as completely valid points on their own merits. They may or may not relate to the test at hand, but that should be left to the reader to decide and by not including them, the paper could be more slanted than any one of us would like it to be.) After spending so much time focused on dynamic programming/graphing algorithms and programming state machines, your comments about elementary math definitely brought a smile to us. And such comments by you are definitely not FUD either, right?"

nope, definitely not FUD... whatever else you may have had on your plate, something as elementary as realizing that 2 out of 32 is not 2% (-> 2 percent -> 2 per 'cent' -> 2 out of 100) is something you folks should have been able to do in your head in the blink of an eye and the fact that you failed to twice in the paper and once while communicating with anton must invariably reflect poorly on you...

"In terms of using Asarium in the test versus not using it.... (Your main point for why the paper is so terribly one-sided.)"

you've totally misunderstood this... you, as a (new) member of the anti-malware industry have no place performing and presenting anti-malware comparative reviews (just as ford has no place performing and presenting comparative reviews of automobiles)... regardless of whether your product is included, you are biased by virtue of your position in the industry... the fact that you included asarium helped clue me into the nature of that bias - if you'd left it out i might have overlooked the fact that you're in the industry because the company is still new and relatively obscure...

so thank you for leaving that trail of bread crumbs...

"If one of the major points of the paper (at a technical level) is that products should be alerting to indications that binaries look like malware based on structural binary markers (and not based on "signatures" for doing so), but there's only one commercial product that does that, what do you suggest we do? I ask with complete geniality, and of course am not suggesting that et tu suffers from false authority syndrome."

i try my best to remember not to present myself as an authority, actually... but in answer to your question, instead of comparing apples and oranges as you're talking about here, compare apples to apples...

what you're describing sounds an aweful lot like it could be classified as a type of heuristic technology so pit it against the heuristics of other anti-malware products - ie. in a retrospective test...

oh, but of course since you're in the industry you shouldn't be doing the test yourself so go and try to convince a respected independent testing body to do the test for you...

"Really long story short, the exercise began as an effort in validating the product and we were surprised by the results."

unfortunately it's self-validation (which is of little use to the public at large because of the "self-" part) and it's already being misinterpreted as something other than that...