QUOTE(Jeff G. @ Aug 9 2005, 03:27 AM)
However, it is rather inconsistent with "the world's email message volume (approximately 10 billion messages/day)" as stated on both
http://www.senderbase.org/?page=help and
http://www.senderbase.org/search?page=help_magnitude.
To further clarify - the 10 billion m/d is impossible for the current magnitudes quoted together with matching message counts on the SenderBase entry page. It was, I am sure, simply a convenient number for explanatory purposes (or maybe - Lord help us because the difference would be mostly spam - it was a good approximation a few months ago).
I did say the rounding error on the magnitude figures is "appreciable though minor". I was forgetting these are exponentiated. 7.9 can actually be anything between 7.85 and 7.94999 .... Consequently the maximum error from this source, on the matching number of messages, is (very nearly) 10^0.1-1 or 25.89% (which applies to all magnitude numbers) - this is a bit more than minor! As a result, the previous figures quoted could all (just barely) be attributed to a "real" total message count of 14.55 billion. Over 1½ days the range of values for magnitude 7.9, based on that total count and allowing for maximum rounding error on both the magnitude number and the matching message count is 11.78%, well within the scope. However, at the bottom end of the scale, magnitude 6.1 (the most consistent minimum in the period), the reported values vary by a maximum 30.3% with a median of 23.5%.
After just 1½ days, it is looking very unlikely that *any* static number is used and certainly not 10 billion. It was looking to me like the count is dynamic (like the real world) and further analysis is not persuading me to the contrary view. The actual volatility may be a little less than is indicated by the available "deconstruction" methods (because of the rounding errors) but I remain of the view that the treatment is useful.
[Update] Incidental - won't bother with a new post. Further analysis appears to confirm the SenderBase volumes are indeed dynamic, even in the short run.
CODE
SENDERBASE - DECONSTRUCTION TO TOTAL EMAIL VOLUME ----------- PAIRED DIFFERENCES (AS STANDARD ERRORS) ----------
CASE "DATE TIME " ESTIMATE (LR) PROB. ERROR "1 " "2 " "3 " "4 " "5 " 6
1 07-Aug-2005 Late GMT 15,501,324,250 ± 88,361,842 "0 " -2.290258953 -10.25276948 -18.45891555 -8.775687532 -12.79216103
2 08-Aug-2005 Early GMT 15,353,642,311 ± 95,602,102 2.477919946 "0 " -8.272567858 -15.91817697 -7.022663427 -10.25261346
3 08-Aug-2005 20:30 GMT 14,736,680,458 ± 110,571,352 12.82977538 9.567875519 "0 " -5.303887968 0.300838747 0.356699937
4 09-Aug-2005 00:30 GMT 14,428,388,820 ± 86,177,135 18.00252708 14.34887779 4.13374585 "0 " 3.960342935 5.658101628
5 09-Aug-2005 07:30 GMT 14,762,024,348 ± 124,900,490 12.40453631 9.1748412 -0.339824976 -5.73990742 "0 " -0.079115125
6 10-Aug-2005 03:10 GMT 14,757,423,578 ± 86,217,550 12.48173154 9.246190178 -0.278135287 -5.660755191 0.054612374 "0 "
SenderBase data (a sample of a population) is used to estimate SenderBase total email volume (the population) by the correlation method mentioned previously. Probable errors give the range where 50% of the estimates are expected to fall. Probable error is a fraction (0.67449 ...) of the Standard error. The paired differences are like Case 2, column "1": (Case 1 estimate - Case 2 estimate)/(Case 1 Standard Error). This is a shortcut, not totally rigorous, but it should be useful/close enough since really fine discrimination is not required. Where the difference in standard errors is less than -3 or more than +3 it is unlikey that the "real" (population) volumes represented by these estimates could be the same - the odds are about 1 in 370 at that point and rise rapidly thereafter. Accordingly, it seems the actual volumes behind the SenderBase data are changing (fluctuating) rapidly, if not continuously.
There are a number of unknowns (particularly how well the SenderBase total volume mirrors what is actually happening in the world) but, as supposed, the (very accessable) SenderBase figures most probably
can be converted to give a useful indication of email numbers from specific IP addresses - and the short-term changes to that trafic.
It might even be useful to try to relate total volume estimates to peaks and troughs in SpamCop reporting.