Thursday, May 04, 2006

automated malware classification? how cool is that?

i was quite impressed when i read halvar flake's blog post about the automated malware classification they've developed at sabre security... the basic premise is that you have a binary difference engine (some technology that can analyze two programs and determine how similar or different they are) and a large corpus (a body of work, a set of reference samples, like a kind of library) of existing binaries that have already been analyzed, classified, and named and you use the binary difference engine to compare new samples against those in the corpus to determine which one(s) in the corpus the new sample is most similar to - which in turn allows you to say with some confidence that the new sample is the same type, family, or even the same malware as the one in the corpus, depending on the accuracy of the match...

the idea of binary comparison technology isn't particularly new... over a decade ago zvi netiv was using a kind of binary comparison technology to identify infected executables based on a suspect sample... of course that technology is in no way comparable to the bindiff2 that halvar wrote about... zvi's ivx program would compare programs suspected of being infected by a particular virus with one known (or at least believed to be) infected by the same virus - different programs that all had a very similar chunk of code in them were thought likely to be infected by the same virus and so the end user was supposed to be able to use this to help detect files infected with previously unknown viruses... bindiff2, on the other hand, apparently leverages the reverse engineering prowess of the ida disassembler... i kind of expected to hear about binary comparison technology like this when i read about how f-secure generates those call graphs of theirs because visual comparison seems like it could be pretty inaccurate in some cases)...

automated classification isn't particularly new either... one of the classes i took getting my undergraduate computer science degree had a project where we were to perform automated classification of natural language documents (by making comparisons with a representative corpus as above but using vectors to represent the documents and judging their similarity based on how close the vectors were to each other) and we were graded on our accuracy (it probably sounds more complicated than it really was)...

it's the combination of the two ideas that's really interesting and i think it's got some great potential, especially when combined with some of the other automated technologies that have been developed over the years... imagine, a new sample comes into the virus lab and the first thing that happens is it's run through this automated classification system that compares it to every other sample the company has on record... if it turns out ot be similar to something already known it's fed into an automatic signature extraction system and given to a researcher to double check the findings... further, if it was automatically classified as being related to a known virus it could be run through an automatic and controlled virus execution system to determine whether or not it was a real virus or just an intended virus... something similar could also be done if it was classified as a worm...

all of which makes the virus analyst's job more efficient and less tedious (because who want's to look at 1,000 different samples that all come from the same 5 families?)... virus analysts probably aren't in danger of becoming redundant anytime soon, of course, but the more efficient they get the faster the companies can react to new threats and that's good for everyone...

i also suspect the technology could aid in the CME's deconfliction process...

some people think this technology could help solve the naming problem, but i really don't agree... this technology alone will not address the real issues behind the naming problem - it will not tell a group of vendors which of them discovered a particular piece of malware first and it won't tell them what name that discoverer has already given the malwre... without that information each vendor is forced to make up a name so they can make signatures available to their customers as fast as possible... that's why the naming problem exists...

0 comments: