Jump to content

Block spam based on URLs they contain


jeffc

Recommended Posts

Hopefully this idea has already come up, but if so I could not find it with a quick scan of the existing topics. But even if it has been mentioned before I'd like to state my strong support for creating a mail blocking technology that blocks based on ULRs contained in the messages.

Note that this is not the same as "URL blocking" which traditionally means preventing Web browser access to certain sites, and it's also not the same as most current realtime blocklist (RBL) approaches, which block access from certain mail servers, usually based on their IP address. Blocking mail based on URLs they contain would require a mail agent that can see, parse, and deobfuscate the content of the message body, which is something many mailers such as sendmail are not designed to do today, but which others such as Postfix appear to support.

Like many other RBLs, SpamCop's RBL blocks messages from certain servers once someone has reported a spam coming from them. This is useful in that it successfully prevents much spam from the same mail server from reaching beyond the first few people, but spammers have already evolved strategies around this by using distributed trojan horse viruses, in essence stealing Internet services from many unsuspecting computers throughout the world in order to send spam in a broadly distributed way which is therefore difficult to stop since it's decentralized. That's in addition to simply exploiting existing open relay mail servers for as long as they remain open. (Certainly hundreds of thousands of spams can typically be sent through open relays before they are closed.)

However what most of the spams have in common is that they attempt to drive web traffic to spam sites, for example selling drugs or software. From reporting spams that get through the many RBLs our mail servers already use, it seems to me that many or even most of those spam sites are hosted at ISPs in China. The spams come from all over the world, but web hosting providers in China seem especially likely as destinations as the URLs in spams.

What I and presumably others propose is to build a blacklist of those sites and block messages that reference those URLs. At the same time a whitelist of the many common legitimate sites would need to be created to prevent spammers from getting legitimate sites blacklisted. A probably very successful first pass would be to blacklist the sites or IP blocks in China (or other spam friendly ISPs) and whitelist the rest. Further refinement could be made from there, but this would probably successfully stop 90% of spam that currently makes it through existing RBLs.

I believe this may be a useful and productive solution to spam and would like to encourage it's development.

I understand there is discussion in the SpamAssassin community for working on things like this. SpamCop builds a great database of spam-referenced URLs now. That databse could be used in a URL blacklist. Is anyone in the SpamCop community working on this idea?

Link to comment
Share on other sites

  • Replies 55
  • Created
  • Last Reply
What I and presumably others propose is to build a blacklist of those sites and block messages that reference those URLs.  At the same time a whitelist of the many common legitimate sites would need to be created to prevent spammers from getting legitimate sites blacklisted.  A probably very successful first pass would be to blacklist the sites or IP blocks in China (or other spam friendly ISPs) and whitelist the rest. Further refinement could be made from there, but this would probably successfully stop 90% of spam that currently makes it through existing RBLs.

I believe this may be a useful and productive solution to spam and would like to encourage it's development.

I understand there is discussion in the SpamAssassin community for working on things like this.  SpamCop builds a great database of spam-referenced URLs now.  That databse could be used in a URL blacklist. Is anyone in the SpamCop community working on this idea?

Yes, I would like to see a system that blocks based on the IP address that the site is hosted on, much the same way that you would block mail from a particular IP address, regardless of the domain in the mail.

I think that type of solution would make the issue of spammers purchasing a billion domains mute.

Link to comment
Share on other sites

OK well hopefully some other antispam effort will work on it if SpamCop won't. Based on the latest spams, it may be a good way to stop them. I mainly wanted to try to drum up some support for the idea. Good to hear other people already think it's a good idea. :)

Name resolution (i.e. blocking by the resolved IP address of the URL) may not even be necessary; just block based on the URL or second level domain. It would be quicker and not require and DNS lookup resources. On the other hand, an initial lookup of the IP address could be useful in some circumstances such as establishing a whitelist/blacklist score for a particular URL.

JeffG, if your proposal is still available in one of these forums, can you provide a URL? I'd like to see what you were thinking.

Link to comment
Share on other sites

Hopefully this idea has already come up, but if so I could not find it with a quick scan of the existing topics.  But even if it has been mentioned before I'd like to state my strong support for creating a mail blocking technology that blocks based on ULRs contained in the messages

Note that this is not the same as "URL blocking" which traditionally means preventing Web browser access to certain sites, and it's also not the same as most current realtime blocklist (RBL) approaches, which block access from certain mail servers, usually based on their IP address.  Blocking mail based on URLs they contain would require a mail agent that can see, parse, and deobfuscate the content of the message body, which is something many mailers such as sendmail are not designed to do today, but which others such as Postfix appear to support.

Like many other RBLs, SpamCop's RBL blocks messages from certain servers once someone has reported a spam coming from them.  This is useful in that it successfully prevents much spam from the same mail server from reaching beyond the first few people, but spammers have already evolved strategies around this by using distributed trojan horse viruses, in essence stealing Internet services from many unsuspecting computers throughout the world in order to send spam in a broadly distributed way which is therefore difficult to stop since it's decentralized.  That's in addition to simply exploiting existing open relay mail servers for as long as they remain open.  (Certainly hundreds of thousands of spams can typically be sent through open relays before they are closed.)

However what most of the spams have in common is that they attempt to drive web traffic to spam sites, for example selling drugs or software.  From reporting spams that get through the many RBLs our mail servers already use, it seems to me that many or even most of those spam sites are hosted at ISPs in China.  The spams come from all over the world, but web hosting providers in China seem especially likely as destinations as the URLs in spams.

What I and presumably others propose is to build a blacklist of those sites and block messages that reference those URLs.  At the same time a whitelist of the many common legitimate sites would need to be created to prevent spammers from getting legitimate sites blacklisted.  A probably very successful first pass would be to blacklist the sites or IP blocks in China (or other spam friendly ISPs) and whitelist the rest. Further refinement could be made from there, but this would probably successfully stop 90% of spam that currently makes it through existing RBLs.

<snip>

I understand there is discussion in the SpamAssassin community for working on things like this.  SpamCop builds a great database of spam-referenced URLs now.  That databse could be used in a URL blacklist. Is anyone in the SpamCop community working on this idea?

The largest isue in what you're asking for is that the use of such a BL would be based on the e-mail app doing a content evaluation of the spam. Yes, you did mention the need for deobfuscating things, decoding others, etc. ... but take all that to the issue of the horsepower needed o do this in real-time. End-users, it's a possibility, but with how many complaints / crashes / etc.?

That you'd need to have pulled the spam down for processing kills off any idea of bandwidth savings. And no way to do the bounce-at-delivery-time ....

Doing a lookup on an IPA isn't that much traffic, field size is the same for both the submittal and response .... trying to do a lookup on a URL? What field limits would one set on the submittal? what protocols allowed? Do you nail the specific site or the ISP, or is this another decision point?

And, of course, what's the logic flow when the "dedicated server for the database" isn't available?

Link to comment
Share on other sites

The largest isue in what you're asking for is that the use of such a BL would be based on the e-mail app doing a content evaluation of the spam.  Yes, you did mention the need for deobfuscating things, decoding others, etc. ... but take all that to the issue of the horsepower needed o do this in real-time.  End-users, it's a possibility, but with how many complaints / crashes / etc.?

  That you'd need to have pulled the spam down for processing kills off any idea of bandwidth savings.  And no way to do the bounce-at-delivery-time ....

  Doing a lookup on an IPA isn't that much traffic, field size is the same for both the submittal and response ....  trying to do a lookup on a URL?  What field limits would one set on the submittal?  what protocols allowed?  Do you nail the specific site or the ISP, or is this another decision point?

  And, of course, what's the logic flow when the "dedicated server for the database" isn't available?

I agree looking at the content of messages would require more processing power than MTAs that currently just look at headers.

On the other hand if spam that mentioned specific URLs were blocked, then spam would become far less effective and there would be less of it eventually. In other words the processing power question solves itself if the approach works in the first place. I suppose one could argue it wouldn't work in the first place because mail would slow to a crawl. Kind of a chicken and egg problem.

However I still believe being able to deny delivery of messages with spam URLs could largely solve most spam as we know it today. It's not so much about saving bandwidth or saving CPU cycles as it is about stopping spam. If the spammers can't reach most people then spam will cease to be a useful if unethical marketing tool. If the tool is less useful, fewer unethical people will use it and spam would decrease.

Regarding the parsing of URLs, SpamCop among others seem to have fine algorithms for it. That said, only the third or second level domain name may be enough to extract. The full URL may not be necessary since most of the spammers seem to use custom domains. Blocking the domains is quick and easy and does not even require resolution of the URL. Just block the custom spam domain. No legitimate domain owner would permit spam sites under their main domain so that's probably not an issue. For efficiency, pre-whitelist all the big legitimate domains with well-enforced AUPs.

Regarding databases and other engineering issues, that's what engineers are for. ;)

Link to comment
Share on other sites

Don't pick on engineers :D

Wazoo (Cougar?) hit on the biggest drawback with comparing this to what RBLs currently do: you actually have to receive the entire E-mail to do this check. This puts it out of the realm of MTA effort anyway. Although I'd be interested in your reference to Postfix, I didn't know they could do anything like that.

Something like this would probably be done with a call from procmail to a perl scri_pt or SpamAssassin. I understand you are proposing a DNSBL style lookup, but again, that could be done by SpamAssassin.

Regarding the Asian spamhauses, they have a very comprehensive list here: http://www.okean.com/asianspamblocks.html

Anyway, I guess I would have to think about the overall concept some more. Seems like it would be o.k.....

Would there be a vulnerability to joe-jobbing legitimate sites? What would you do if there was a reference to a ligit site and a black site in the same e-mail? How would you guard against people simply talking about black sites? Like when your buddy says, "Hey, Jeff, have you been getting spam from www.chinaspammer.com?" I guess you could make it only look for actual html code, not just the plain text reference.

Link to comment
Share on other sites

You do not have to do this processing on every e-mail coming in. The obvious spam can be discarded.

If you are going to do content filtering on the e-mail that passes the DNSbl checks, you probably also want a whitelist of I.P. addresses of trusted senders. and you want to make sure that you do not filter out valid abuse reports.

You may also want to trigger the content filtering by using a more aggressive DNSbl as a reference.

And as far as watching the URLs, I think that you will find that the I.P. addresses of almost all of them will be found in the sbl-xbl.spamhaus.org or one of the spews listings.

-John

Personal Opinion Only

Link to comment
Share on other sites

Don't pick on engineers  :D

Wazoo (Cougar?) hit on the biggest drawback with comparing this to what RBLs currently do:  you actually have to receive the entire E-mail to do this check.  This puts it out of the realm of MTA effort anyway.  Although I'd be interested in your reference to Postfix, I didn't know they could do anything like that. 

Something like this would probably be done with a call from procmail to a perl scri_pt or SpamAssassin.  I understand you are proposing a DNSBL style lookup, but again, that could be done by SpamAssassin.

Regarding the Asian spamhauses, they have a very comprehensive list here: http://www.okean.com/asianspamblocks.html

Anyway, I guess I would have to think about the overall concept some more.  Seems like it would be o.k..... 

Would there be a vulnerability to joe-jobbing legitimate sites?  What would you do if there was a reference to a ligit site and a black site in the same e-mail?  How would you guard against people simply talking about black sites?  Like when your buddy says, "Hey, Jeff, have you been getting spam from www.chinaspammer.com?"  I guess you could make it only look for actual html code, not just the plain text reference.

I wouldn't pick on engineers; I r one of them too. :) I was more suggesting engineers could "make it so".

I'm told Postfix can see and act on message bodies.

procmail, a perl or shell scri_pt or SpamAssassin could also do this. I'm told the SA developer community has already been discussing it.

Regarding existing RBLs of China and Korea spam mail servers, those no longer seem effective since many people are blocking mail from China and Korea already. That seems to be why the spammers have gone to using relays or zombied personal computers not in those countries. Remember we're talking about blocking URLs in spams, and not the spam servers. Spammers have already effectively figured out how to get around spam mail server RBLs IMO. If they want to advertise a spam URL, they can be defeated by blocking mails containing the URL or related domain(s).

The way to prevent spammers from adding legitimate sites in their spams to try to block mails referencing them is to have a whitelist of legitimate domains. Any domain or URL in the whitelist does not count against blocking, nor would those domains or URLs get added to the blacklist. You're probably right about legitimate messages mentioning spam site, but that would be a second order effect.

Link to comment
Share on other sites

Perhaps related to this discussion is that many Bayesian (or machine learning) processes are decoding URLs and probably most of them handle IPs, so that a particular IP address could end up with a high "spam value" and thus be filtered.

I admit it is not blocking per se, but it has promise if one is not as concerned with bandwidth. It will be interesting to see how blocklists and machine learning concepts team up from here into the future to stop spam...

As a spamcop email user via IMAP I have not toyed with Bayesian based filtering even though I am using Thunderbird (now 0.5)

I follow the development of POPFile pretty closely as I am waiting for IMAP support and then I will probably give it a try.

EDIT: I was just thinking that a reletively young software project named "TarProxy" may be of interest to "jeffc", coupled with a corpus containing decoded URL information...

Link to comment
Share on other sites

Sorry if this is a bit off topic, but PeterJ mentioned TarProxy. Another similar work is here: http://www.spamcannibal.org/cannibal.cgi

I've manually tarpitted MTAs before but I don't have a lot of data. Seems like sendmail will tarpit for about 10 minutes before it gives up. I wonder if others tarpit longer or even indefintely. Can you imagine if all the spammers walked into tarpits and had to sit there for 10 minutes?

Link to comment
Share on other sites

Yes, SpamAssassin uses Baysean rules.

I like the idea of tarpits, but as others have rightly pointed out, their use has not slowed down spammers very much.... The amount of resources they can steal is too vast it seems.

Which is why blocking based on contained URLs would probably be the most effective way of ending spam, even if it has some (solvable) technical hurdles.

Link to comment
Share on other sites

I created something to do this a couple of months ago and integrated it in to spamassassin. I should probably shoot them an email, though its kind of klugy since its scraping the spamvertised sites off of spamcop.net periodically. If anyone is interested, let me know and I will post the code. I basically use WWW::Mechanize to grab http://www.spamcop.net/w3m?action=inprogress&type=www every so often and pull all the links and output a file called spamcop_url.cf which lives under /etc/mail/spamassassin/. This gets picked up when a mail is checked for spam. An entry in the conf file looks something like:

uri SPAMCOP_NET_URI_5 /http\:\/\/www\.1callerid\.com\/rm\/index\.html/i

describe SPAMCOP_NET_URI_5 Matches SPAMCOP URI http://www.1callerid.com/rm/index.html (generated: Thu Feb 12 18:00:01 PST 2004)

score SPAMCOP_NET_URI_5 1.80

I keep one day back of urls in a serialized file using storable. This has worked pretty well and definitely increased the number of points that spams receive.

I think a better approach would be one where spamcop or some other service were to provide a URL that you could pass a URI and it would tell you whether its been banned, how many have reported, how long, etc. The spamfilter would then decide how many points to give it based on those params. I think this could not only be effective with links, but also with images as well, since that is how some spams seep through.

The only downside to this is the service may need to get intelligent since spammers would use wildcard A DNS records, so it would have to resolve the ip of the host instead of just doing a string comparison against what was passed in.

If someone is willing to give me hosting space, i would be happy to consider working on the service.

--eric

Link to comment
Share on other sites

You certainly are working in a good space, grabbing spam URLs from SpamCop and integrating them with SpamAssassin scoring. That's a general direction I was interested in also, though I agree with you that your first pass could use more integration.

I don't think wildcard A records would be too much of a problem, since they presumably would only be used for spam domains. In the case of wildly varying host names in spams made possible by wildcards in the DNS, we could just use the non-wildcarded part of the domain name. In other words ignore the wildcarded part and just block the whole spam domain (up to but not including the wildcard). That could be automatically detected by a relatively simple URL comparison matching from the TLD down towards the host name. A tree-shaped data structure could make a reasonable representation.

Thinking about it, perhaps the most logical approach would be to ask the SpamCop folks if they're be willing to give access to their existing spam URI database for this purpose, in other words to do something useful with it. I don't think spam hosting reports to China are doing a whole lot of good at this point.

Link to comment
Share on other sites

The realtime thing is the hangup -- until a new URL has been identified as a spamvertised website, it will slide through. But it's also true that a few services host a large percentage of websites. An example is Chinanet.

If a service that filters emails for its paying subscribers (like SpamCop) has humans who identify a service as consistent hosts for such sites, they could have a web crawler periodically go to each IP number in that service's range, update the list of URL aliases currently registered, and blacklist all of them. Then the spams could be rapidly blocked without looking up each URL in every email. It would be a valuable service that I think people would line up to pay for, even entire ISP's, since those same ISP's now block all email from the same ISP's IP addresses.

Link to comment
Share on other sites

Not withstanding that there is no task force like that available at SpamCop these days <g> ... it sure sounds like there'd be all kinds of nasty fall-out from a plan like that. Thinking of all the end-users these days that complain of being on an e-mail block-list because they're using "the only local ISP", I can't imagine the spew of legal threats coming from the same sort of "business" web-site owners. Yeah, I know from this end, that's a silly arguement, but from the naive, all they'll see is the blockage for "no apparent reason" ...

Link to comment
Share on other sites

  • 1 month later...
  • 2 weeks later...

It appears that Eric Kolve is not the only one interested in utilizing SpamCop data on URLs. Bumped into this website today:

http://spamcheck.freeapp.net/

and here a recent brief thread from the SpamAssassin general mailing list regarding SA's beta "URIDNSBL" plug-in:

http://thread.gmane.org/gmane.mail.spam.sp...n.general/45246

If using SpamCop reported URLs in an RBL manner continues to grow I wonder if SpamCop may be forced to act...

Either get involved and create an official RBL or prevent programs/people from automatically harvesting the data.

Maybe I am exagerating the potential, but it would be shame for SpamCop to start getting "blamed" for blocking email based on URLs because they are the source of the data.

Link to comment
Share on other sites

At teh top of the first link there is this statement:

SpamCop creates a database of these "Spamvertised" spam-advertised sites. That data is grabbed periodically and served up to a SpamAssassin plugin which Eric Kolve developed.

Is this correct? I was under the impression that the only database maintained was that for spam sources. If this is true, I will stop full reporting immediately as I can not always check that a reported site is innocent or not. I leave that up to the ISP and the site to decide. If that information is being used for more than warning the ISP of the site, I will stop sending them altogether.

Link to comment
Share on other sites

Is this correct? I was under the impression that the only database maintained was that for spam sources. If this is true, I will stop full reporting immediately as I can not always check that a reported site is innocent or not. I leave that up to the ISP and the site to decide. If that information is being used for more than warning the ISP of the site, I will stop sending them altogether.

SpamCop does not supply a database of "spammy" URLs for people to directly access, however at least a few people have been scraping this data from http://www.spamcop.net/w3m?action=inprogress&type=www. I personally do not have a problem with this for now, but I do think that SpamCop will need to address it in the future if it becomes more common.

As far as I know there are two implementations that are retrieving reported SpamCop URL data in some manner:

1) Eric Kolve's spamcopuri

(who has posted earlier in this conversation)

2) SURBL -- spam URI Realtime Blocklist

(I do not know who is behind this, but this link and the one in my previous post mention Jeff Chan at the bottom of each page)

Both of these can integrate with spam Assassin.

I doubt many (if any) are using the SpamCop URL data at this point, the SURBL stuff mentions "proof of concept." Does SpamCop Administration have a position on this I wonder?

Eric Kolve -- Care to jump back in for more discussion of this? Have you ever spoken with any SpamCop admins regarding this?

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.


×
×
  • Create New...