Monday, September 03, 2012

on exploit detection

data is code and code is data. i know that people like to think of data and code as being inherently different and separate from each other but in the end it's all just symbols in various languages on a real-world analog to the turing machine's infinite tape. fetching the next instruction, decoding it, and performing an operation based on what resulted from that decoding is not intrinsically different from fetching the next chunk of data, parsing it, and performing an operation based on what came out of that parsing. the ability to treat code as data is what allows us to distribute software, and the ability to treat data as code is what allows us to add new functionality to our general purpose computers that they weren't able to do before.

the distinction between programming languages and other input languages is simply that a programming language is intended to be used by programmers to create programs that are used by other people and other input languages are intended to be used by everybody else for (ostensibly) less complex purposes. it's really a matter of complexity, more than anything else, but it turns out that there are many input languages that aren't thought of as programming languages but are turing-complete and so are just as complex as any programming language. further, it's not just that there are two groups of languages (trivial and complex), but rather an entire spectrum.

with that in mind it's little wonder that data in the form of exploits can be just as bad as more traditional malware, but the implications go beyond just that. since data and code are not intrinsically different, since exploits are essentially malicious programs written in the input language for a vulnerable piece of software or hardware, some of the things we know to be true about malware should also hold for exploits.

specifically, the problem of deciding if an arbitrary input exploits a vulnerability seems like it should be reducible to the halting problem. i can certainly imagine trivial input languages where the presence or absence of an exploit is easily decidable, such as one where there are only 2 possible inputs, a legitimate one and one that triggers a vulnerability. however, in general, and certainly in no small part due to the existence of turing-complete languages, i'm fairly confident in saying that this problem is analogous to the virus detection problem. and THAT means that the problem of exploit detection, regardless of how it's approached in practice, is ultimately subject to the same limitations as the problem of virus detection.

to that end, testing exploit detection can run into the same methodological problems that testing virus detection can. for example, creating one's own exploits and testing against those instead of drawing exploit samples from the wild presents the same kind of problem that creating one's own viruses and testing against those would. namely that what is created in the lab does not necessarily correspond to what exists in the wild and so a product's ability to detect what was created in the lab doesn't necessarily correspond to it's ability to detect what's in the wild. certainly it's trivial to see how that would be true for products that detect known exploits using a method similar to known-malware scanning, but even for products that attempt to parse suspected exploits exactly the same as a vulnerable application would this would be the equivalent of emulating suspected viral code which we already know can't work all the time either (otherwise the virus problem would be decidable and we could then use it to solve the halting problem) so there could be exploits in the wild that elude such detection. as such we really should be regarding exploit detection testing that uses in-house exploits on equal footing as virus detection testing that uses lab-made viruses.