|
|
Making Judgments TangibleTM |
|||
| Technology | Services | Research | About | |
The Myth of Spam Volatility
One of the most commonly held beliefs about spam is that it is volatile--that spam and the tactics used by spammers are subject to rapid, unpredictable change. Understanding spam volatility is crucial to creating and evaluating effective anti-spam solutions. If spam is truly volatile, then robust spam identification technologies need to respond and adapt to that rapid, random change. But to the extent that spam is actually relatively stable, constantly-shifting solutions are just as likely to result in less accurate, not more accurate, spam identification. Coming up with a rigorous and reliable answer to the question of spam volatility presents several challenges, some mechanical, others methodological. It's not enough to simply "eyeball" a collection of spam and say, "Boy, that sure looks volatile to me!" Specifically, a substantive and meaningful answer to questions regarding spam volatility requires:
Requirement 1: Historical spam dataBecause historical spam collections tend to be profoundly personal (even idiosyncratic) in nature, the task of finding a "noise-free" historical spam archive is a formidable challenge. The data analyzed in the present study consist of almost 2,500 spam messages sampled from 8 different domains over a period of 2.5 years (from April 2001 through September 2003, inclusive).
Requirement 2: A meaningful measure of spaminessIn order to measure changes in "spaminess," it's critical to have a precise, consistent measurement for deciding what constitutes "spaminess" in the first place. One such way to measure spaminess is to represent the messages as a set of spam "features." (A feature is simply a fancy way of saying "characteristic," and a feature set is the list of all of the characteristics associated with all the spams in the test set.)So, examples of spam features might include things like "discusses organ enlargement," or "falsely claims to come from Yahoo." Every individual spam (and all the spams together) can be represented solely in terms of these features. (In this particular instance, the feature set used to represent spam data consisted of approximately 300 of the filtering rules from SpamAssassin 2.4. However, any "good" set of spam-sensitive rules would probably work reasonably well.)
Requirement 3: A consistent time windowOnce the spams have been reduced to features, they can be grouped into some sort of time window. The exact size of that time window is necessarily arbitrary, though it's possible to compare different time windows and see which one(s) work the best. (In this analysis, the time window used is calendar quarters. Thus, 2.5 years of spam aggregated into quarterly time windows yields 10 sets of quarterly data.)Aggregating the spam data is simply a matter of counting how many times a particular feature occurs in each time window. The resulting frequency counts can be represented in a tabular format, like this:
Period1 Period2 Period3 ...
Feature A 40 50 76
Feature B 24 25 17
...
and so on.
Requirement 4: Tracking change over timeNow, it'd be possible (at least in theory) to analyze spam volatility by doing a lot of laborious calculations directly on the tabular data. But there are two problems with that approach:
Size MattersAnalyzing over 300 features in 10 different time periods means examining over 3,000 individual data points, and literally millions of potential interactions among those data points. To complicate matters further, each time period is completely independent of all the other periods; that is, a feature that occurs a lot in one time period could occur infrequently (or not at all) in another. So, each time period can be thought of as a separate "dimension" in a 10-dimensional space. Humans are very good at identifying and understanding patterns in relatively small "spaces," typically composed of between 1 and 3 dimensions. But trying to comprehend hundreds of points in a 10-dimensional space is instantly overwhelming.
RedundancyEven more problematic is the possibility of (an initially unknown) level of redundancy in the experimental data. To understand how redundancy affects analytical results, consider an example based on something more familiar to most folks than spam features: money.Imagine that there were a lot of data available pertaining to various economic variables, covering a range of socio-economic groups. It comes as no surprise that people with larger annual incomes tend to live in more expensive (and typically larger) houses, and may very well own several different houses. They also usually own more cars (and more expensive cars, to boot), travel more, dine out more (and in more expensive restaurants), etc. Now, it would be possible to measure all of these different variables (income, number of cars, leisure travel, etc.) individually. But those individual measurements probably represent a lot of wasted effort, because many of these variables are all essentially redundant measures of a single phenomenon. What would be handy is to find a way to identify (and to remove) the redundancy from the data. In some sense, removing redundancy would permit a more pure (though admittedly a bit more abstract) measurement of the single phenomenon that's responsible for that redundancy. Then too, to the extent that it were possible to remove the redundancy in the original variables, it might just be possible to reduce all of the redundant variables to a smaller set of nonredundant "composite" variables. (It might even be possible to reduce all of the original variables to a single composite variable. We could even come up with a catchy name for it: wealth.) Happily, there are statistical modeling techniques that are very good at identifying and removing redundancies in analytical data. One such method is called principal components analysis (PCA, for short). PCA is ideal for identifying composite variables when the original variables contain a high degree of redundancy. In fact, PCA doesn't just identify a set of composite variables, it identifies the statistically best set of composite variables, based on all of the redundant information contained in all of the original data. So, if all of the individual measurements pertaining to income, number of houses, number of cars, entertainment expenses, etc. were fed into PCA, it's a pretty sure bet that PCA would end up identifying a "common denominator" variable, which a human analyst could recognize as "wealth." This common denominator is ultimately responsible for all of the correlations (redundancies) among all of these (wealth-related) variables. There are admittedly some drawbacks to doing PCA, the biggest of which is that the composite variables identified by the analysis will capture only a portion of the information contained in the original data. So, performing a PCA will almost invariably result in the loss of some degree of measurement accuracy. But there are benefits to PCA, too. Not only is the "data space" potentially smaller (perhaps much smaller), but it also becomes possible to see exactly how each of the original variables relates to the "artificial" composite variables. So, for example, PCA would allow a precise, quantitative examination of the relative "contribution" of income, entertainment expenses, etc. to the composite variable, "wealth." It's also possible to quantify precisely how much measurement accuracy is lost during PCA modeling, and thus to know a lot about the "fidelity" of the resulting statistical model. (Statistical types use the phrase "percentage of variance accounted for" when talking about the measurement accuracy of the composite variables. But it's probably easier to think of it as "percentage of original information content preserved.")
Finally... some results!So what does this have to do with spam volatility?Just as with the "wealth" data in the earlier example, it's possible to feed historical spam data into PCA, and to obtain a set of composite variables. It's also possible to see how the original variables (time periods) are related to the composite variables identified by PCA. When historical spam data are fed into PCA, the analysis returns two composite variables (which stat-wonks like to call dimensions). Initially, these two dimensions can be referred to by the utterly uninspired names dim1 and dim2. It turns out that these two dimensions (variables) capture over 86% of the information content of the original data. The single best composite variable identified by PCA recaps the original data with a remarkable 70% accuracy; the second-best variable captures a much more modest, but still respectable, 16%. Now, right away, this is an interesting finding. Under the most statistically rigorous (but also most technically correct) definition of volatility, each dimension would account for no more than 10%-11% of the variance in the original data. So the model is very high-fidelity, and, by inference, there is a lot of redundancy in the original data. More importantly, PCA makes it possible to examine and understand how each of the 10 time periods contributes to each of the two composite variables. It's even possible to visualize these contributions, by creating a biplot of the original 10 variables "as seen by" the composite variables. In such a plot, the greater the similarity between spam from any two (or more) time periods, the closer those time periods will appear on the graph. An examination of the contribution of each time period (or all of the time periods) to each composite variable will permit a richer and more accurate understanding of exactly what each composite variable is measuring.
The second dimension (dim2, shown on the vertical axis) is by far the easier to interpret, at least initially. Specifically, it's easy to see that dim2 captures time almost perfectly. Spam from early 2001 appears at the top of the plot, while spam from late 2003 is at the bottom. Simply "counting down" from top to bottom on dim2, it becomes obvious that only one time period (Q4'02) is markedly out of order. So, either dim2 is measuring change-over-time in spam features, or the nearly perfect chronological ordering of time periods on dim2 is literally a million-to-1 statistical fluke. It's reasonable to expect time to have some sort of measurable impact on spam features, so it's a very safe bet that dim2=change-over-time. The second striking thing is that, by and large, that change-over-time progresses very slowly. There are two periods where obvious "gaps" occur, indicating a sudden flurry of change in spam features. But on either side of those gaps, change to spam proceeds at a gradual, even glacial pace. Far from being volatile, change in spam looks sort of like "punctuated equilibrium" from evolutionary biology. There are similarly two striking things about the first composite variable (dim1, shown on the horizontal axis). The first is that no chronological ordering is possible based on the relative position of the points. That is, moving progressively from left to right finds spam from early 2001 mixing together with spam from late 2003. So, whatever dim1 is measuring, it is virtually certain that it is completely unrelated to time. The second striking thing about dim1 is that all 10 time periods cluster together in a tight little band, with almost no dispersion. Stated differently, dim1 is measuring something that consists of roughly "equal parts" of all 10 of the original variables, and all 10 time periods contribute about equally to whatever is being measured by dim1. So, it's reasonable to ask what all of the data from all 10 time periods have in common. One obvious answer, of course, is that they are all spam.
Note: It's also true that all the data are also all email messages. And there is probably a small (though measurable) "email-ness" effect on dim1. But, conversely, to the extent that most of the 300 features are much more applicable to spam than to legitimate email, those features primarily measure "spamminess." (It's possible to separate the "email-ness" from the "spamminess" statistically. But it's also fairly complicated, and far beyond the scope of the present discussion.) So, in the final analysis, it becomes clear that PCA has identified two distinct "composite" variables: a time-invariant core spamminess, and change-over-time. (PCA actually identifies additional composite variables, but they are all statistically indistinguishable from "noise.") Now that the two composite variables have been identified and characterized, it's important to remember that core spamminess "captures" 70% of the information content of the original data, while change-over-time captures just 16%. Thus, a clear and potentially surprising answer to the question of spam volatility emerges:
Time-invariant spamminess is over four times better at characterizing spam--regardless of time period--than 2.5 years of change-over-time.
ImplicationsThere are several intriguing technical implications of these results. The first is the notion that it ought to be possible, at any given moment in time, to use the results of a PCA to identify a relatively small set of time-stable spam features that would characterize a substantial percentage of contemporaneous spam. And because spam features seem to mutate gradually, those features will, generally speaking, exhibit extraordinary longevity (measurable in months, not minutes).Then too, because sweeping changes in spam features are rare, even the time-specific variability in those features is broadly predictable, in much the same way that weather is broadly predictable. By implication, even that time-specific variance is potentially exploitable by statistically-based spam identification technologies. Lastly, PCA has been shown to be a method that can extract both time-stability and time-sensitivity from at least one set of spam features. By implication, PCA may provide a systematic basis for quantitatively comparing competing feature sets, in search of an optimal set of time-stable spam features. Beyond these technical implications, the results of this experiment have some important theoretical implications as well. In particular, these results ultimately and unequivocally challenge the conventional wisdom that spam is "volatile." Instead, these results suggest a fundamentally different approach to thinking about spam--one that sees spam as a very-nearly-stationary target. Adaptive spam identification technologies may sound sexy, but these results suggest that a carefully selected set of fixed features can identify a very large percentage of spam, at a much lower computational cost. The net result is lower cost of ownership, improved scalability, and increased longevity of virtually any spam identification technology. There is still a lot more research that needs to be done in the area of spam characterization. In particular, there is still much to learn about the "lifespan" of an otherwise typical time-stable spam feature. Then too, should it prove possible to identify systematic source(s) of predictability in spam features, each successive battle in the fight against spam may prove dramatically shorter than its predecessors. Perhaps most heartening of all, the only real "countermeasure" to a time-stable approach to spam characterization is for spammers to succeed in making spam truly volatile. And to the extent that volatility has been their goal all along, actually achieving true volatility is probably much harder than it sounds. Copyright © Terry Sullivan All Rights Reserved This work may not be reproduced, in whole or in part, without the express consent of the author. |