in response to a certain discussion on twitter, i found some ideas floating around in my head that just don't fit in a tweet, so i thought i'd share them here instead.
one of the ongoing problems on facebook is the phenomenon of viral scams - scams that spread in a viral manner across the facebook userbase and trick users into doing various things (whether it be installing a rogue facebook app, clicking an invisible link, or copy-n-pasting javascript into the URL bar).
there are at least 2 contentious aspects to this phenomenon. the first is whether facebook has things under control. there are certainly those who are arguing that it's not under control, that the numbers are ever increasing. there are also those who argue that, on the whole, facebook is acting in a timely manner to deal with these threats to their users. my own experience is that i rarely actually encounter these viral scams so that certainly could support the argument that they're getting killed quickly - quickly enough that they die before they make their way to me. on the other hand, though, when i do encounter these scams it doesn't appear to me that facebook is dealing with them expediently at all. a day or more to kill a viral scam campaign? really? i think they could do better - in fact, i have some specific ideas that i intend to share a little further on.
the second contentious aspect is what's the best way to deal with these scams: whether it's better for facebook to police it's network more effectively or alternatively to go after the industry whose gray areas are responsible for the lions share of the scamming (ie. the cost per action / CPA marketing industry). technical defenses employed by facebook will always be a bit of a game of whack-a-mole, but they're relatively quick and easy to implement without involving a lot of other parties. investigation and enforcement of legal authority against CPA firms can certainly have some long-lasting effects but there are problems; it takes a lot of time and coordination from the law enforcement community, and CPA is only the current low-hanging fruit from a malicious business model perspective. that means, ultimately, going after CPA firms or even entire industries will also wind up being a game of whack-a-mole, it'll just much slower and it will be law enforcement playing the game instead of a technology company.
i believe both technical defenses and legal authority are appropriate tactics. technical defenses are useful for dealing in the near term with that which has not yet been dealt with in the long term, while exercising legal authority tends to have more of a long term effect and creates a much bigger disruption to malicious business models (potentially requiring entirely new malicious business models to be developed to compensate). right now we don't yet have the benefits that legal authority can provide so we need to use technical defenses as a stop-gap at the same time as we pursue legal avenues. if/when legal authority manages to take out most of the CPA abuse channels that the scammers are currently exploiting, those scammers will monetize something else so we'll continue to need those technical defenses as we adjust our legal tactics to their new business models. in essence, technical defenses and legal authority represent compensating controls in a multi-layered approach to the problem.
to that end, i have some specific technical ideas for facebook.
all viral scams must exploit one of facebook's many communications channels. facebook needs to monitor these channels and apply some heuristics to help identify the viral scams.
to start out with, they should apply a k-nearest-neighbor algorithm or some other suitable similarity measure to the communications (do not use hashing - i hardly ever see the scams but even in the few i have seen, i've seen hash busters being used) in a sliding window of time (to limit the size of the corpus facebook would need to analyze). messages, wall posts, events, etc that cluster together as being highly similar are likely all part of the same viral campaign and should be classified as such. being part of a large cluster should cause the message (or rather the entire set of messages) to be flagged for additional review at the very least. if that review is manual, one could prioritize the review based on the size of the cluster. not all viral communications are bad, mind you - it could just be a really good joke, or a political activist campaign, or something else legitimate - that's why virality alone isn't enough to classify something as bad, but it is a good start for narrowing the scope of the analysis.
now, even if facebook stopped at this point, they'd still have something very useful for killing viral scams. identifying the set of nearly identical messages that a particular message belongs to means that you can aggregate data and judgments about those messages. an abuse report for one message is an abuse report for all, a flag to disable display of one message is a flag to disable display of all, a flag indicating one message was verified clean is a flag indicating all are verified clean, etc.
if facebook were to go further, though, the next thing they could do as a simple static heuristic is to check to see if the message contains javascript code that is displayed to the user. most users cannot read or understand javascript. for most, the only reason they'd see that in a viral message is if they're supposed to copy and paste it in order to bypass facebook's existing defenses against malicious javascript. that makes it a pretty good indicator of malicious intent.
another simple heuristic would be the presence of a link whose destination is obscured by a URL shortener. a much more contentious heuristic, of course, but i'm not really suggesting any of these on their own should be taken as proof of malice - only that they each incrementally increase the level of suspicion.
a third simple heuristic would be links to facebook applications where the app developers haven't been registered very long. it's not impossible to hit it big on facebook right out of the gate, but there are nuances that aid in growing an application's popularity that you wouldn't really expect a newbie to know right off the bat. a really big viral cluster for a really new app developer should definitely raise some eyebrows.
next, as an active heuristic, an automated process attached to a dummy facebook account could be sent to click it's way through the trail laid out in any message that has a link in it in order to see if any path results in the dummy account sending out a message belonging to the same viral campaign it started from. this means adding applications where requested, following links to outside pages, even pasting the contents of the clipboard into the URL bar when the trail leads back to facebook's domain.
finally, recognizing that different viral clusters could be related, there needs to be a mapping between inputs and outputs of those dummy accounts so that facebook can catch the condition when an incoming message from viral campaign A leads to outgoing message from viral campaign B and which in turn goes through 0-N other viral campaigns as inputs and outputs before arriving back at campaign A.
i don't know what methods facebook is using right now, but if they were aggregating nearly identical messages it shouldn't have taken a day or more to kill the last viral scam i encountered and it shouldn't still be possible to easily find something like this half a month later:
those may be neutered to the extent that they can't contribute to viral spread in facebook's environment anymore, but who knows what the pages those point to could be changed to do to people's PCs now that the main campaign is dead. those events should have been deleted a long time ago. leaving remnants of viral scams in people's accounts is a little like leaving remnants of viral code in disinfected programs.
2 comments:
Hi Kurt,
Great post. Good follow up to our discussion.
My first reaction is that many such heuristics could yield numerous false positives.
As you note, there's (plenty of) legitimate content that spreads virally on Facebook; activist status messages, jokes, memes, marketing campaigns, et cetera. All of them can spread “virally” across social graphs. That's kind of the point. Also, it evolves over time, which makes defining good heuristics a moving target.
But... I do like the “active” heuristic idea. It make me think of something.
Rather than some bot account/automation actively clicking through trails of posted links looking for patterns (reactive), Facebook could instead proactively crawl/upstream (sandbox) links as they are posted to a user's Wall. If the link exits and then redirects back into Facebook, it could be blocked. In my research, only spammers use short URLs and redirectors to share Facebook pages and applications.
Legitimate users share with their friends using Facebook's “Share” feature or else a direct link to the Facebook page.
I can think of ways to get around this limitation, such as to host a page which social engineers the user to manually return to Facebook... but any such additional overhead is likely to have a strong impact to the spammer viral potential.
When I've looked into the statistics, only about 10% of people do what the spammers request, and only about 10% of those actually fill out a CPA survey. One additional layer could effectively disincentivize the entire process.
"My first reaction is that many such heuristics could yield numerous false positives."
individually, yes. the more you combine them, the fewer false positives there should be.
more than that, however, i wasn't envisioning 100% automated system for eliminating the scams but rather an automated system for presenting a human analyst with suspicious viral campaigns.
that's why i called them heuristics - heuristics don't identify badness, they're just criteria upon which we base suspicion. some systems may treat that suspicion over-zealously, but that's particular to those systems.
"Facebook could instead proactively crawl/upstream (sandbox) links as they are posted to a user's Wall."
well they COULD do that, but i think doing it for all messages as they're posted might be too taxing on the system.
Post a Comment