QAQD.com Making Judgments TangibleTM
Technology Services Research About
  
IT'S A SMALL (SPAM) WORLD, AFTER ALL
Terry Sullivan, QAQD.com

One of the most hotly contested pieces of conventional wisdom regarding spam centres on the number of unique spam sources. Some authorities assert that "90 per cent of spam comes from 200 spam operations" [1], while other authors decry this as the "#1 Myth" regarding spam [2].

Since few spammers 'sign' their messages, it is impossible to achieve an accurate count of how many spammers are active at any given moment. However, the number of active spammers is at least broadly estimable based on approximately valid spammer-surrogates, such as the URLs contained within spam messages. Although such an estimate is necessarily indirect, and therefore somewhat imprecise, it is sufficient basis for a serviceable approximation.

Somewhat more precise, and potentially much more interesting, is the question of how spam is distributed among an arbitrary number of unique sources. An uneven distribution pattern would imply that, even if the number of spammers were 'large' (arbitrarily defined), then a disproportionately large amount of spam must necessarily originate from a disproportionately small number of sources.

Thus it's possible, in some sense, to idealize the problem, and imagine two extreme scenarios regarding the origins of spam. At one extreme, the distribution of spam is approximately uniform, with each spammer accounting for a roughly equal percentage of total spam sent. In this scenario, having a precise estimate regarding the number of spammers is critical to any analysis. At the other extreme, a comparatively small number of highly prolific spammers accounts for the bulk of spam received, thus mitigating the need for a precise estimate regarding their exact number.

Ultimately, three distinct lines of evidence all strongly support the conclusion that the number of unique spam sources is indeed relatively small (numbering in the hundreds, not thousands) and that only a few dozen spam sources account for the vast majority of spam worldwide.

URL FREQUENCY
Most spam messages are directed at motivating the recipient to visit a website to place an order. Thus, the URLs advertised in those messages, and the domain names in particular, serve as a potentially fruitful and approximately valid surrogate for spam source. While no one expects an exact one-to-one correspondence between domain name spammer, it is reasonable to expect the distribution of 'spamvertised' URLs to mirror approximately the distribution of spammers.

One of the more intriguing recent entries into the anti-spam movement is Jeff Chan's Spam URI Realtime Block List (SURBL) [3]. Among its various data sources, SURBL extracts domain names from URLs contained in messages submitted by SpamCop users. Any domain with at least 20 individual reports is included in the SURBL block list. The inherent diversity of reporting sources helps both to ensure breadth of coverage and to minimize systematic sampling bias. Recently, SURBL has initiated a 'rollup' procedure that all but eliminates the effects of spammers' inclusion of random subdomains within the URL.

For analytical purposes, some 326 domain names meeting the SURBL inclusion criteria during a four-day 'window' in the second week of June 2004 were examined. The most striking thing about the distribution of domains is that it is profoundly nonuniform. The distribution of spam domains exhibits a marked power-law characteristic. Power-law distributions (most often known as Zipf or Pareto distributions) share one feature in common: the product of frequency (in this case, spam volume) and rank (most-prolific to least-prolific) is approximately constant. A log-log plot of frequency-by-rank describes an approximately straight line.

Log-log plot of SURBL data, showing a roughly
linear relationship between frequency and rank The accompanying figure shows just such a log-log plot of the SURBL data, confirming an approximately linear relationship between frequency and rank. While that relationship is not perfectly linear, the distribution of spam URLs in no way resembles a hypothetical uniform distribution (shown in the figure as a dotted line). In this sample, barely five per cent of the URLs account for over 25 per cent of spam messages reported, and less than 20 per cent of the URLs, numbering just a few dozen in all, account for over half the total spam. To the extent that many of these domain names are simple variations on a single 'root' name, the number of actual spammers is almost certainly smaller still.

'SMALL-WORLD' PATTERNS
When two strangers meet at a party, and discover during the course of conversation that they share a friend in common, it is commonplace for one or both to exclaim, "Wow, small world!" In reality, there is a less exotic explanation: the two 'strangers' are members of a common extended social network (which explains how they both came to be invited to the same party). Discovering 'small world' connections within such locally-organized networks is dramatically more likely than within a network of truly random connections.

It is not surprising that 'small world' patterns are commonplace in spammer behaviour and activities. Consider as an example: domains A through M are all bulk registered on the same day, through the same registrar, by the same registrant, and all are subsequently used to advertise a single product. The obvious (and almost certainly correct) inference is that all of this activity originates from a single spammer. Just a few days later, domains N through Z are similarly bulk registered together (perhaps via a different registrar) and all are used to advertise a different product. Taken at face value, this pattern suggests a maximum of two spammers.

But now imagine that the 'whois' data for the registrant of domains N-Z points to an email address in domain_A. Even if the address itself is bogus, the use of domain_A requires knowledge of the existence of domain_A. Thus, two highly clustered but seemingly unconnected sub-nets 'join' to form a single, interconnected whole. Such examples of connectedness between clustered nodes is common, even ubiquitous, among domain names used in spam.

FEATURE EVOLUTION
The 2004 MIT Spam Conference included an empirical study of the evolution of spam features [4]. Perhaps the most striking result of that study is that changes to spam features strongly resemble 'punctuated equilibrium' from evolutionary biology. When analysed in aggregate, spam features remain remarkably consistent for months at a time, but then are subject to sudden and dramatic change. (These results have since been replicated multiple times, with different feature sets, different time windows, and different test corpora. Although small fluctuations in the exact values of the observations inevitably occur, the 'punctuated equilibrium' phenomenon remains unchanged.)

A sudden, wholesale shift in spam features is utterly incompatible with a large number of spam sources. It is difficult to imagine how a large number of spam sources would come to independently and simultaneously alter their tactics in virtually identical ways. However, such a wholesale shift is entirely consistent with a relatively small number of spammers, and the observed power-law distribution of spam origin. Stated differently, if a few spammers are responsible for the majority of spam, then a shift in tactics among a few individuals will inevitably result in a sudden, large variance in spam features.

CONCLUSION
Arguably, none of these lines of evidence alone is sufficient to support a definitive inference regarding the number of spammers. When taken together, however, these three sets of data converge on a single conclusion: that the number of spammers worldwide is at most a few hundred, and most spam originates from a maximum of a few dozen highly prolific sources. For the observed results to reflect a uniform distribution among a large number of sources, it would be necessary to posit extraordinary coordination (and amazingly effective enforcement) among a diverse, diffuse, globally distributed group. These results are consistent with the patterns predicted by an uneven distribution of spam, originating from a relatively small number of unique sources.

These results have specific implications for the fight against spam. The uneven distribution characteristic of spam suggests that great benefits may be obtainable from tightly focused anti-spam efforts that specifically target the most prolific sources of spam. Technologically, these results help to explain the disproportionate success of a computationally simple project such as SURBL--and suggest that similarly focused legal remedies, whether civil or criminal in nature, may also prove effective. Finally, these results illuminate potentially fruitful avenues for technology R&D efforts. In particular, robust author-identification technologies have the potential to provide broad support to both technical and forensic efforts in the fight against spam.

References

  1. Registry of Known Spam Operations
    http://www.spamhaus.org/rokso/

  2. The 10 Biggest Spam Myths
    http://www.clickz.com/experts/brand/buzz/article.php/3112021

  3. Spam URI Realtime Block List
    http://www.surbl.org

  4. The Myth of Spam Volatility
    http://www.qaqd.com/research/mit04sum.html


(Terry Sullivan, It's A Small (Spam) World, After All, Virus Bulletin, July 2004)
Copyright is held by Virus Bulletin Ltd.; made available on this site for personal use free of charge by permission of Virus Bulletin. This work may not be reproduced or redistributed without express permission from the copyright holder.