Wednesday, January 20, 2010

the myth of in-the-wild prevalence

upon reading this article at ghacks.net about scanning linux systems for viruses i became aware that there are some misunderstandings over the meaning of the term 'in the wild'.

the article in question is not the only place i've seen these misunderstandings and i don't want to knock it too hard because it does advise scanning your linux systems, but the statement that
Linux is immune to viruses right? Well…mostly. Even though a proof of concept virus has been discussed, and nothing has actually made it into the wild…you still have email on your system.
fairly clearly indicates both a lack of awareness of the threat linux faces as well as a lack of understanding of what constitutes as 'in the wild'.

so let's get this out of the way early in the discussion. 'In the wild' means literally that the malware in question is active and victimizing someone or some group, somewhere in the real world. that seems like an obvious and natural definition but what isn't obvious is the implication that that has for most people. you see many people equate 'in the wild' with epidemic. they think that if something were really in the wild it would have affected a lot of people and they would have seen it personally or known someone who had seen it. they think that they can use their own experience as a measure of whether something is 'in the wild' or not. the reality is that something being 'in the wild' does not mean that that something is common enough for you to have stumbled across it - there is a wide spectrum of prevalence possibilities for 'in the wild' malware.

to that end, there have of course been linux viruses in the wild. are there still some in the wild? well given that old viruses never really die, i'm going to have to say yes. remember, rare and 'in the wild' are not mutually exclusive concepts - something can be both at the same time. once something goes into the wild it's subsequently very difficult to conclusively show it has left the wild. in fact you could say it's equivalent to proving a negative (which, as we all know, is impossible).

(note: just to be clear, i'm not talking about the wildlist from wildlist.org. things that are on the wildlist are definitely 'in the wild' but not everything 'in the wild' gets to go on the wildlist. the wildlist is a much more narrowly defined set than what's 'in the wild')

Sunday, January 10, 2010

what's in a malware name

...that which we call conficker by any other name would taste as sour.

david harley, tom kelchner, and mary landesman have all posted their responses to an infosecurity article questioning the apparent lack of consistency in malware naming.

they all say more or less the same thing about the deluge of modern malware making harmonization of names impossible and to a certain extent they're right, but to a certain extent they're also wrong - not so much in the technical details of their answer but more in the way they're framing the problem that the infosecurity article was underlining.

now the truth is i had actually planned on writing about malware naming some time ago in response to another of david harley's articles in which he basically says malware names are irrelevant. i can see where he's coming from with that, and probably you can too. a malware detector doesn't care what the name of the malware is, only whether it's there or not - and the consumer of the malware detector generally won't care that much about the name either (certainly not whether it's the same name that all the other vendors use). in the consumer's worst case scenario all they really need is some sort of unique identifier, be it a number, a GUID, or some made up nonsense word (oh, wait, that's what they get now) in the event that they need to call up the vendor for support.

but there's a problem with this line of thinking and i'll demonstrate it with a little thought experiment. let's take all the bones in the human body and replace their current identifiers (such as scapula, ulna, radius, etc) with numbers, or GUIDs, or made up nonsense words. now try having an intelligible discussion about bones you've broken over your lifetime with someone. can you imagine how much more difficult that would be? obviously replacing their current names with the made up nonsense words would just pose difficulty in adjusting to new names but GUIDs would be far too unwieldy for people to use, and numbers would have numerical relationship baggage that would confuse the issues. let's take one more step in this thought experiment, however. let's say there are 50 different people, each with their own different set of replacement identifiers for the bones in the human body, and let's say that they collectively are trying to advise people on bone health. how well is that really going to work? not very well, obviously.

while it is true that malware today is far too numerous to harmonize the naming for each and every instance, we can't let the great become the enemy of the good. if the anti-malware world revolved exclusively around the production and consumption of malware detectors then names really would be unimportant and irrelevant, but the fact is in such a world people like david harley and tom kelchner and mary landesman wouldn't be blogging about such things because those blogs would also be irrelevant.

the thought experiment above demonstrates when names are important and why consistent names are important. names are important when you're dealing with people rather than just technology. they are important when you are trying to communicate information about threats, trends, etc. to people. people need names for things, and frankly they need to be fairly simple names - that's why storm, loveletter, and code red catch on while waledac, virut, and sality wallow in obscurity, and why people keep misspelling conficker. heck, it's why meteorologists name significant weather formations like hurricanes using human given names like harry or katrina. people also need for multiple authorities to agree on the names for things or else they can't integrate data from multiple sources and are left disoriented and confused.

again, we can't let the great (harmonizing the naming of all malware instances) become the enemy of the good (harmonizing the naming of the relative handful of malware instances the industry considers significant enough to write about in things like year-end threat reports). it may be impossible to coordinate names for each malware instance in existence and entirely pointless even if it were possible, but the same does not hold true for the small set of malware that vendors write about by name. just so we're clear, i'm not suggesting that such coordination need take place before releasing detection for the aforementioned malware. what i have in mind is something not unlike the now defunct common malware enumeration with the exception of using names instead of numbers - a post hoc harmonized second name (a common name or layman's name) for those few pieces of malware that the industry feels they need to communicate to the masses about.

of course, after all that is said and done, even if naming were consistent i fully realize that different vendors reports would list different sets of malware and to that end people still need to understand that such reports reflect not the actual threat landscape but what the vendor has seen of the threat landscape. to that end there should still be overlap between the sets of malware used by different vendors in their reports, and if there isn't that suggests sampling bias pronounced enough to render those models of the threat landscape irrelevant.