|
|
Making Judgments TangibleTM |
|||
| Technology | Services | Research | About | |
|
IT'S A SMALL (SPAM) WORLD, AFTER ALL Terry Sullivan, QAQD.com One of the most hotly contested pieces of conventional wisdom regarding spam centres on the number of unique spam sources. Some authorities assert that "90 per cent of spam comes from 200 spam operations" [1], while other authors decry this as the "#1 Myth" regarding spam [2]. Since few spammers 'sign' their messages, it is impossible to achieve an accurate count of how many spammers are active at any given moment. However, the number of active spammers is at least broadly estimable based on approximately valid spammer-surrogates, such as the URLs contained within spam messages. Although such an estimate is necessarily indirect, and therefore somewhat imprecise, it is sufficient basis for a serviceable approximation. Somewhat more precise, and potentially much more interesting, is the question of how spam is distributed among an arbitrary number of unique sources. An uneven distribution pattern would imply that, even if the number of spammers were 'large' (arbitrarily defined), then a disproportionately large amount of spam must necessarily originate from a disproportionately small number of sources. Thus it's possible, in some sense, to idealize the problem, and imagine two extreme scenarios regarding the origins of spam. At one extreme, the distribution of spam is approximately uniform, with each spammer accounting for a roughly equal percentage of total spam sent. In this scenario, having a precise estimate regarding the number of spammers is critical to any analysis. At the other extreme, a comparatively small number of highly prolific spammers accounts for the bulk of spam received, thus mitigating the need for a precise estimate regarding their exact number. Ultimately, three distinct lines of evidence all strongly support the conclusion that the number of unique spam sources is indeed relatively small (numbering in the hundreds, not thousands) and that only a few dozen spam sources account for the vast majority of spam worldwide.
URL FREQUENCY One of the more intriguing recent entries into the anti-spam movement is Jeff Chan's Spam URI Realtime Block List (SURBL) [3]. Among its various data sources, SURBL extracts domain names from URLs contained in messages submitted by SpamCop users. Any domain with at least 20 individual reports is included in the SURBL block list. The inherent diversity of reporting sources helps both to ensure breadth of coverage and to minimize systematic sampling bias. Recently, SURBL has initiated a 'rollup' procedure that all but eliminates the effects of spammers' inclusion of random subdomains within the URL. For analytical purposes, some 326 domain names meeting the SURBL inclusion criteria during a four-day 'window' in the second week of June 2004 were examined. The most striking thing about the distribution of domains is that it is profoundly nonuniform. The distribution of spam domains exhibits a marked power-law characteristic. Power-law distributions (most often known as Zipf or Pareto distributions) share one feature in common: the product of frequency (in this case, spam volume) and rank (most-prolific to least-prolific) is approximately constant. A log-log plot of frequency-by-rank describes an approximately straight line.
'SMALL-WORLD' PATTERNS It is not surprising that 'small world' patterns are commonplace in spammer behaviour and activities. Consider as an example: domains A through M are all bulk registered on the same day, through the same registrar, by the same registrant, and all are subsequently used to advertise a single product. The obvious (and almost certainly correct) inference is that all of this activity originates from a single spammer. Just a few days later, domains N through Z are similarly bulk registered together (perhaps via a different registrar) and all are used to advertise a different product. Taken at face value, this pattern suggests a maximum of two spammers. But now imagine that the 'whois' data for the registrant of domains N-Z points to an email address in domain_A. Even if the address itself is bogus, the use of domain_A requires knowledge of the existence of domain_A. Thus, two highly clustered but seemingly unconnected sub-nets 'join' to form a single, interconnected whole. Such examples of connectedness between clustered nodes is common, even ubiquitous, among domain names used in spam.
FEATURE EVOLUTION A sudden, wholesale shift in spam features is utterly incompatible with a large number of spam sources. It is difficult to imagine how a large number of spam sources would come to independently and simultaneously alter their tactics in virtually identical ways. However, such a wholesale shift is entirely consistent with a relatively small number of spammers, and the observed power-law distribution of spam origin. Stated differently, if a few spammers are responsible for the majority of spam, then a shift in tactics among a few individuals will inevitably result in a sudden, large variance in spam features.
CONCLUSION These results have specific implications for the fight against spam. The uneven distribution characteristic of spam suggests that great benefits may be obtainable from tightly focused anti-spam efforts that specifically target the most prolific sources of spam. Technologically, these results help to explain the disproportionate success of a computationally simple project such as SURBL--and suggest that similarly focused legal remedies, whether civil or criminal in nature, may also prove effective. Finally, these results illuminate potentially fruitful avenues for technology R&D efforts. In particular, robust author-identification technologies have the potential to provide broad support to both technical and forensic efforts in the fight against spam.
References
(Terry Sullivan, It's A Small (Spam) World, After All, Virus Bulletin, July 2004) Copyright is held by Virus Bulletin Ltd.; made available on this site for personal use free of charge by permission of Virus Bulletin. This work may not be reproduced or redistributed without express permission from the copyright holder. |