Wednesday, February 06, 2008

samples, variants, and signatures - oh my

last week anton chuvakin characterized the new av-test.org results as "widely ridiculed"... this struck me as odd because i hadn't seen any ridicule so i asked for links... that didn't get me anywhere so i took to googling for the answer and trying to guess what keywords might match this nebulous ridicule anton was speaking of...

now, while i didn't find ridicule, per se, i did find some significant criticism in a thread at wilderssecurity.com by none other than paul wilders himself (i really need to find a way to aggregate web forums so i can actually follow these sorts of discussions instead of finding out about them after the fact)... the criticisms weren't totally unreasonable, in fact i even agree with one (that a good test needs thorough supporting documentation on things like test methodology, sample acquisition, test bed maintenance, etc - how i wish we could see documentation as thorough as the vtc used to put out) but some i think were a little off...

one such criticism had to do with comparing apples to oranges, or in this case the number of samples used in the test to the number of signatures in a product's database - which, when taken with some of the confusion that lead to, makes me think a discussion of what these things are might be in order...

a sample is just a file containing a piece of malware... you distinguish one sample from the next by simple binary comparison - if the files are different then the samples are different, otherwise they're just copies of the same sample...

a variant is a member of a malware family... one variant is distinguished from another by being sufficiently different from that other variant that it makes sense to consider it a different thing, but that is a subjective determination usually made in the context of what makes sense for a particular detection technology... there is no objective, technology-neutral standard by which we can say sample X represents a different variant than sample Y, therefore the distinction between variants is a technology-dependent logical construct... as an example, for a scanning engine that looks at every single byte of a piece of malware it would make sense to be a lot more granular in drawing distinctions between variants than for one that only looks at at most N bytes somewhere within the body of the malware... more generally, technologies that use fundamentally different approaches to malware recognition can likewise result in different delineations between variants... this is one of the reasons why they say each vendor counts viruses/malware differently...

signatures are chunks of data that tell a scanner what a particular piece of malware looks like... they can be a selection of bytes from the actual malware (ie. a scan string), a hash value (for non-parasitic, non-polymorphic malware), a regular expression, or something written in a proprietary virus(malware) description language... the conceptual ideal is that there is a 1-to-1 correlation between signatures and variants but in practice multiple signatures may be required for a single variant (as has sometimes been the case for handling polymorphism) or alternatively a single signature might handle multiple variants or even an entire family...

now obviously, given the above descriptions, comparing samples to signatures or variants doesn't make a lot of sense, but a variant-centric focus when evaluating a test of anti-malware products is not altogether unfamiliar... in a detection test it's important to make sure you include all (or as many as possible) significant members of the malware population and also to make sure that no member is included more than any other so that detection of each is given equal weight/importance... back in the days when viruses ruled the earth it was actually necessary to try and look at things from a variant-centric perspective (subjective as it was) because of how viruses worked... the way file infecting viruses operated, it was trivial to imagine many samples containing not just the same variant of a virus but instances that were all offspring of a common parent... this made it necessary to try to tag samples with variant id's in an effort to enforce good coverage and uniformity...

these days the kind of complications that parasitic self-replication create in malware detection tests are much less of an issue because malware exhibiting those properties make up a much smaller proportion of the malware population... malware that exists as a stand-alone file is now the norm so the simpler sample-centric approach (at least for the non-viral samples) becomes more reasonable...

one potential complication that might still make a variant-centric approach necessary is packed malware or more generally malware incorporating server-side polymorphism... at least that's an argument one might bring up (and i think packed malware was mentioned in the thread) but i think there's a stronger argument to be made for server-side polymorphism in the general case producing distinctly new variants in spite common root malware that they all came from... this is because the resulting malware is the product of both relatively old malware (though the root malware may not technically qualify as known) and unknown and unguessable transformations... likewise with packing in particular, not all scanning engines deal with repacked malware equally well (even though we know many packers, the packers themselves can be altered to produce something we couldn't have guessed) so arguing that many samples created from the same root piece of malware are all the same variant may not be reasonable in light of the subjectivity of the variant concept...

given all this, from the information available in the test results in question, there really doesn't seem to be anything all that wrong... that said, without the thorough supporting documentation, we can't really verify that things are all that right either so those who really can and do read more than just the executive summary are stuck taking the results of this test with a grain of salt (in addition to the grain of salt one takes any detection test in isolation with, since the relative effectiveness of various products can change pretty quickly over a short time)...

(and anton will just have to believe what he wants to believe about whether the results that seem worthy of ridicule point to a cherry-picked test bed or not - which would be some feat, considering it's size)

4 comments:

Anonymous said...

What you're saying about variants isn't quite correct. There is an objective definition of what consists a different variant. If two malware programs differ in at least one bit in the non-modifiable parts of their bodies, then they are different variants. Otherwise they are not. The fact that most scanners choose not to distinguish between very similar variants (either for convenience or due to ineptitude) is an entirely different issue.

kurt wismer said...

hmmm... for clarity's sake i take it to mean the parts of their bodies that the malware itself doesn't modify (as any part is modifiable by an outside force)...

that works, i like it... and it makes the question of whether packed malware constitutes a new variant a trivial "yes"...

but this brings up a more troubling question because i'm pretty sure what's used in practice deviates from the definition you've just given... a truly objective definition of variant would provide a basis on which vendors would be able to objectively and consistently count the number of distinct malware entities they detect across the industry, obviating the need for 3rd party testing except as a check to keep the vendors honest...

we know this isn't happening (and has perhaps never happened except maybe in the very early days) so i'm left wondering why the industry has collectively chosen to ignore the benefits that an available objective definition like that provides...

Matt said...

But, is that objective definition worth anything? Repacking a binary is trivial at best and would result in near infinite variants if the "objective definition" were to be used.

Also, consider the below scenario shown in the image links below.

Pretend malware source code:
SRC (img)

Resultant binary after compilation command: gcc -O2 variant.c -o variant.exe
O2 disassembly (img)



Resultant malware binary after compilation command: gcc variant.c -o variant.exe
Default disassembly (img)

Should these two binaries be considered variants? The behavior is identical, the source is identical, the only difference is the presence of -O2 during compilation.

kurt wismer said...

@mb:
"Repacking a binary is trivial at best and would result in near infinite variants if the "objective definition" were to be used."

infinite variants are possible even without repacking... bit flipping is just as trivial and it too produces variants but no one argues about whether they are all the same variant or different...

"Should these two binaries be considered variants? The behavior is identical, the source is identical, the only difference is the presence of -O2 during compilation."

the fact that they're functionally identical shouldn't matter when determining if they should be considered different variants as what makes something a variant is in it's executable/interpretable code, not it's behaviour (and necessarily so since there are an infinite number of ways to implement the same function)...

likewise, having the same source shouldn't matter as the source is only one of the inputs in the process that creates the executable/interpretable code (compiling with different compilers, linking with different libraries, etc. all have an effect on the final binary)...

any anti-malware technology where variants are a significant issue (meaning any that try to identify what piece of malware they're looking at) work by looking at the malware itself, not the inputs that might have gone into making the malware and not the outputs from the malware..