Block spam based on URLs they contain

ekolve · March 30, 2004

I have emailed both julian and jeffg, without any success (Jeff G pointed me to julian). I have been using spamcopuri on my own machine where I received 100 spams per day. Around 50 - 60% of these have URLs listed on spamcop's list.

Jeff Chan is behind the SURBL and I am looking in to helping him out a bit with this effort as well.

One user commented that they cannot be sure whether the URLs they report are 100% innocent or not, which raises some concerns. I do think this is a very reliable way of blocking spam and if we cannot do it through spamcop, another site should be started to allow people to report URLs in a similar fashion.

--eric

StevenUnderwood · March 30, 2004

One user commented that they cannot be sure whether the URLs they report are 100% innocent or not, which raises some concerns.

That was my comment and what I am talking about is the next page (like yahoo finance is now) to be referred to repeatedly until it is marked as an IB in spamcop. Since I will not follow links in any spam message I receive, I wil not know what that page is and whether it is trying to sell the stock or is only for reference. This may not be a big problem as it will probably be marked as IB well before it reaches the list, but it is possible it won't. And since this is not a primary goal of the SpamCop BL, it is likely not to be such a high priority to keep that list accurate.

Miss Betsy · March 30, 2004

It would seem to me that it would be the responsibility of the people who scrape the list to allow IB's to be removed if they have been collected before spamcop marks them as IB's. And to screen new spamcop scrapings against their own list before they use them.

There probably are not too many IB's left. Though spammers might start throwing in little known ones in just to make the list inaccurate. It might even be good to start pre-registering IB's (requiring a lot of documentation to be sure that spammers don't pre-register).

Miss Betsy

ekolve · March 30, 2004

What does 'IB' stand for?

--eric

StevenUnderwood · March 30, 2004

Innocent Bystander as in the http://finance.yahoo.com/q?d=t&s=XXXX links that are placed in the Pump & Dump spams.

eaolson · March 31, 2004

... I'd like to state my strong support for creating a mail blocking technology that blocks based on ULRs contained in the messages.

SpamPal with the URL-body plug in will do this. Yes, it runs client-side and is Windows-only, but may be adequate for your needs. I started using it a few months ago to catch the 10% or so my ISP and various other DNSBLs don't, and have become a convert.

PeterJ · March 31, 2004

I found an interesting discussion in the SpamAssassin developers mailing list inlcuding Jeff Chan and others on SA, SURBL, and URIDNSBL:

http://thread.gmane.org/gmane.mail.spam.sp...sin.devel/23334

Two other very small threads:

http://thread.gmane.org/gmane.mail.spam.sp...sin.devel/23456

http://thread.gmane.org/gmane.mail.spam.sp...sin.devel/23452

After reading this stuff I have a clearer picture of what is going on and I am curious about the whitelisting. Jeff Chan is using whitelists for domains with his implementation of SURBL, but I am more curious about what happens on the SpamCop side with whitelisting of domains.

Miss Betsy wrote:

before spamcop marks them as IB's

Miss Betsy or others --

Do you know more about the process that SpamCop uses to whitelist IB domains that are reported from URLs? Do deputies regularly do this?

In one of the short threads linked aobve Jeff Chan states:

BTW SpamCop appears to also whitelist the domains or URIs they get in their users' reports, which offers some additional protection to downstream users
of their data such as our SURBL effort.

Given that whitelisting is occuring both on SpamCop's side and also on SURBL's side of things hopefully FPs will remain low with this concept. Also given that this is meant to be plugged into SA, if one were to get a hit on this test, it ALONE would not result in the message being determined as spam.

Eric --

In your current use of the spamcopuri plugin you created for SA can you comment on what kind of scores you are attributing to a a messsage based on spammy URL hits and how they relate to the threshold you are using to block/filter?

PeterJ · March 31, 2004

Perhaps Duh...

I just realized that the OP of this topic is "jeffc" perhaps Jeff Chan? If so and you come back to this thread, have you ever spoken with Julian regarding SURBL? Just curious if it has been given any thumbs up, ignored, or whatever....

Wazoo · March 31, 2004

Miss Betsy or others --
Do you know more about the process that SpamCop uses to whitelist IB domains that are reported from URLs? Do deputies regularly do this?

Process? Technically, the receivers of the spam complaints/reports (used to?) have this as one of their options in responding to the spam report. Deputies have been given the authority and means of adding URL's to a (probably massive) database used somewhere in the spam parsing tool to decide whether or not to generate a report. Part of the same tools used to re-direct some of the parser results to another reporting address.

PeterJ · March 31, 2004

Technically, the receivers of the spam complaints/reports (used to?) have this as one of their options in responding to the spam report

Assuming that this still is the case, then the response goes to a deputy who approves of the whitelisting presumably?

Wazoo · March 31, 2004

Technically, the receivers of the spam complaints/reports (used to?) have this as one of their options in responding to the spam report

Assuming that this still is the case, then the response goes to a deputy who approves of the whitelisting presumably?

No, it's an "automatic" thing. Which then feeds back into a paid-subscriber "benefit" .. one of having the "opportunity" to challenge this type of occurance ... and this is when the Deputies would get involved.

ekolve · March 31, 2004

Here is scoring I do for spamcop urls in spamassassin:

score SPAMCOP_HOST_URI 1.25

score SPAMCOP_URI 2.5

So if the full url appears in the list, it gets a score of 3.75, which is still below the default threshold of 5.0 of spam. If just the host appears, then it gets 1.25 points. You could probably raise these values without worry of FPs, but I am pretty happy with the results I have been getting.

--eric

WB8TYW · April 1, 2004

It appears that a spammer's URL or domain has an effective lifetime of about 72 hours before it is abandoned (at least for a while).

So with your system, there is a gap where the spam will get through before it gets reported.

Now if instead of relying on a database of URLs, of which have only a transative time value, try this instead:

Resove the URL to it's I.P. address.

If the I.P. address is listed in a DNSbl that you would not accept e-mail from, you have your answer, and there really is nothing that the spammer can do to get around that.

If you are using a scoring system, you can use this check with more aggressive DNSbls to eliminate the cases where they may list a real mail server that you want to try to receive real e-mail from.

If the URL does not resolve to an I.P. address, then you either have a typo, or a spam run where the spammer does not realize that their domain has been zapped.

For instance, a bad rDNS match indicates about an 80% chance of spam.

A dul.dnsbl.sorbs.net match indicates a high chance of spam, but is not absolute, and even for a real mail server may have a bad rDNS.

The multi-hop and SORBS spamtrap zones frequently list real mail servers.

A bl.spamcop.net listing indicates a high chance of spam, but will also list real mail servers.

All of those above can be used to decided if it is worth the overhead of doing the URL scan and translating it to an I.P. for your check.

You will likely find that the amount of spam that passes all of the above checks, and also does not already appear in an open proxy/open relay/spamhuas list is almost NIL, and not worth generating the additional tests.

If you will compare this algorithm to yours, I think you will find it detecting spam runs faster and have lower maintenance.

-John

Personal Opinion Only

AlphaCentauri · April 1, 2004

A whitelist can work for sites that are routinely used by spammers because of their content, like the yahoo stock data sites.

But spammers are now also inserting completely fake URL's in their messages. They are not linked to any visible type (you can pick them out by looking for the <a href...></a> with nothing in between that could be clicked on in the message.) They just make the names of the sites up. But they will randomly choose the names of some real sites. Those sites wouldn't show up more than once, so no whitelist can be prepared. And they foil spamcop's parser, as it stops checking after a certain number of URL's and so misses the real one.

I definitely use URL filtering in my own MailWasher mail filters, and it picks up about 50% of the stuff that gets through FirstAlert's filters. But that's 48% by recent filters and pattern filters (\d\d\dhosting for instance, for all the 20000hosting, 30000 hosting, etc. sites) and 2% by filters more than a couple months old.

I also tried harvesting from SpamCop's spamvertised URL page, but since it only covers a short period of time, and spam comes in spurts, there are a lot of references to a few sites that had just sent out spam, and none for a site that had spammed say 4 hours ago.

A.J.Mechelynck · April 1, 2004

Hopefully this idea has already come up, but if so I could not find it with a quick scan of the existing topics. But even if it has been mentioned before I'd like to state my strong support for creating a mail blocking technology that blocks based on ULRs contained in the messages.

It's available for Windows as the "URL-Body" plugin for the spampal mail filter.

PeterJ · April 8, 2004

Jeff Chan recently posted an announcement regarding SURBL to the SpamAssassin general mailing list for those who might be interested:

http://thread.gmane.org/gmane.mail.spam.sp...n.general/46349

If JT reads this, any chance of trying this with SpamCop's SpamAssassin setup?

dra007 · April 8, 2004

This is not a technical comment, I think targeting url's would be great. I do see a pitfall however, a lot of url's in the spams I reported seemed to be broken, or not doirect to a real site...Seems to me the spammers have already thought about this...If this is true of many spams, then it seems that a lot of resources would be wasted just sorting junk from real url... I am still puzzled why spammers would do such thing if they indeed seek to have some monetary gains from their abuse..

turetzsr · April 8, 2004

This is not a technical comment, I think targeting url's would be great. I do see a pitfall however, a lot of url's in the spams I reported seemed to be broken, or not doirect to a real site...Seems to me the spammers have already thought about this...If this is true of many spams, then it seems that a lot of resources would be wasted just sorting junk from real url... I am still puzzled why spammers would do such thing if they indeed seek to have some monetary gains from their abuse..

...See Pinned: Spammer Rules rule # 3, especially Russell's Corollary.

Jeff G. · April 9, 2004

I suggested that SpamCop "create blocklists for the spamvertised IP Addresses and URLs" on December 28th 2003, to no avail.

JeffG, if your proposal is still available in one of these forums, can you provide a URL? I'd like to see what you were thinking.

I think it may have predated these forums. It was more of a followup to weAponX's thinking. It was in this post to Spamcop, which Google can't seem to find at present.

That Thread started with http://news.spamcop.net/pipermail/spamcop-...ber/066512.html but Frank Ellermann chose a posting method that made a new Thread (from Pipermail's point of view) with http://news.spamcop.net/pipermail/spamcop-...ber/066537.html.

dra007 · April 11, 2004

can someone enlighten me about this tracking business?

Finding links in message body
Mailing list traffic detected. Removing footer.

Recurse multipart:

Parsing HTML part

Reducing redundant links for smilingdoctor.net

Resolving link obfuscation

http://smilingdoctor.net/gv/chair.php

host 202.104.242.137 (getting name) no name

Tracking link: http://smilingdoctor.net/gv/chair.php

Resolves to 202.104.242.137

WB8TYW · April 11, 2004

can someone enlighten me about this tracking business?

Resolving link obfuscation

http://smilingdoctor.net/gv/chair.php

host 202.104.242.137 (getting name) no name

Tracking link: http://smilingdoctor.net/gv/chair.php

Resolves to 202.104.242.137

First you get the I.P. addresses associated with the URL.

PYTHON> tcpip show host smilingdoctor.net

BIND database

Server: 192.168.0.2 EAGLE

Host address Host name

202.104.242.137 SMILINGDOCTOR.NET

Then you check to see if there is an rDNS

PYTHON> tcpip show host 202.104.242.137

%TCPIP-W-NORECORD, information not found

-RMS-E-RNF, record not found

Of course somepeople in this thread are trying to track the actual URLs used in the past by spammers. However, if you take the I.P. address for the URL and do a DNSbl lookup on it:

PYTHON> tcpip show host 127.242.104.202.sbl-xbl.spamhaus.org

BIND database

Server: 192.168.0.2 EAGLE

Host address Host name

127.0.0.2 127.242.104.202.SBL-XBL.SPAMHAUS.ORG

See, even if the URL is not in the spamvertised URL database yet, it still shows up as being a positive indication of spam, unless this is an incoming abuse report, or a request for assistence in translating spam.

PYTHON> tcpip show host 127.242.104.202.l1.spews.dnsbl.sorbs.net

%TCPIP-W-NORECORD, information not found

-RMS-E-RNF, record not found

PYTHON> tcpip show host 127.242.104.202.l2.spews.dnsbl.sorbs.net

BIND database

Server: 192.168.0.2 EAGLE

Host address Host name

127.0.0.2 127.242.104.202.L2.SPEWS.DNSBL.SORBS.NET

The idea is to combine this with other scoring and reject at the mail server level.

Other scoring would be a more aggressive DNSbl that would otherwise reject real e-maili, such as a multi-hop list, rfc-ignorant, spamcop.net, spews, etc.

Having an invalid rDNS would make it a candidate for a link check.

This is what I pointed out in my previous post in this thread.

-John

Personal Opinion Only

StevenUnderwood · April 12, 2004

OT: John:

Is that VMS you are using for the tcpip commands?

WB8TYW · April 12, 2004

Yes, to be precise,

PYTHON> tcpip show version

Compaq TCP/IP Services for OpenVMS VAX Version V5.1

on a VAX 4000-500 running OpenVMS V7.3

-John

Personal Opinion Only

PeterJ · June 1, 2004

In the hopes that this topic will not die it is worthwhile to note that Jeff Chan's SURBL has become a popular and accepted rule/test for use with SpamAssassin.

Since SpamCop users are the ones taking the time to report links in spam, it only seems appropriate to allow them to reap the benefits of this by implementing URI checking in SpamCop's current SA setup. Granted this will not benefit all SpamCop users, only the ones who have mail accounts with SpamCop.

Brief and recent replies from two knowledgable SA people regarding the effectiveness of the SURBL and other URI checking is here:

http://thread.gmane.org/gmane.mail.spam.sp...n.general/49921

exposed88 · August 30, 2004

I can see it'll work with the URLs in plain text. However, I have seen two new types of spam that have evolved to outsmart this type of filtering. One is to show the URL in graphical format such at gif. This would require the recipeint to manually enter the URL to go to the support website. The second one is to use the unicode to write out the URL. Here's an example:

<html><A hREf="https://web.da-us.citibank.com/signin/scripts/login/user_setup.jsp"><map'>https://web.da-us.citibank.com/signin/scripts/login/user_setup.jsp"><map name="FPMap0"><area coords="0, 0, 610, 395" shape="rect" href="http://%36%34%2E%31%36%32%2E%32%33%33%2E%38%31:%38%37/%63%69%74/%69%6E%64%65%78%2E%68%74%6D"></map><img SRC="cid:part1.00020306.00060307[at]supprefnum584210191[at]citibank.com" border="0" usemap="#FPMap0"></A></a>Grinch in 1934 in 1948 The Simpsons Scooters USA Personals Black History Month don't go Get your News let me add we get on well Mariah Carey Warner Bross Majora's Mask The Beatles Zelda in 1870 Try to connect you settled How are you? Capital Punishment Ford I advise you </html>

Notetice the URL "http://%36%34%2E%31%36%32%2E%32%33%33%2E%38%31:%38%37" is hidden but if you click on the legitimate "https://web.da-us.citibank.com/signin/scripts/login/user_setup.jsp" you will go to http://64.162.233.81:87. The IP is on the RBL, but is the URI filtering able to block it?

Both techniques were actually used by the websites to hide their email addresses from the Spambot. How interesting!

Hopefully this idea has already come up, but if so I could not find it with a quick scan of the existing topics. But even if it has been mentioned before I'd like to state my strong support for creating a mail blocking technology that blocks based on ULRs contained in the messages.
Note that this is not the same as "URL blocking" which traditionally means preventing Web browser access to certain sites, and it's also not the same as most current realtime blocklist (RBL) approaches, which block access from certain mail servers, usually based on their IP address. Blocking mail based on URLs they contain would require a mail agent that can see, parse, and deobfuscate the content of the message body, which is something many mailers such as sendmail are not designed to do today, but which others such as Postfix appear to support.

Like many other RBLs, SpamCop's RBL blocks messages from certain servers once someone has reported a spam coming from them. This is useful in that it successfully prevents much spam from the same mail server from reaching beyond the first few people, but spammers have already evolved strategies around this by using distributed trojan horse viruses, in essence stealing Internet services from many unsuspecting computers throughout the world in order to send spam in a broadly distributed way which is therefore difficult to stop since it's decentralized. That's in addition to simply exploiting existing open relay mail servers for as long as they remain open. (Certainly hundreds of thousands of spams can typically be sent through open relays before they are closed.)

However what most of the spams have in common is that they attempt to drive web traffic to spam sites, for example selling drugs or software. From reporting spams that get through the many RBLs our mail servers already use, it seems to me that many or even most of those spam sites are hosted at ISPs in China. The spams come from all over the world, but web hosting providers in China seem especially likely as destinations as the URLs in spams.

What I and presumably others propose is to build a blacklist of those sites and block messages that reference those URLs. At the same time a whitelist of the many common legitimate sites would need to be created to prevent spammers from getting legitimate sites blacklisted. A probably very successful first pass would be to blacklist the sites or IP blocks in China (or other spam friendly ISPs) and whitelist the rest. Further refinement could be made from there, but this would probably successfully stop 90% of spam that currently makes it through existing RBLs.

I believe this may be a useful and productive solution to spam and would like to encourage it's development.

I understand there is discussion in the SpamAssassin community for working on things like this. SpamCop builds a great database of spam-referenced URLs now. That databse could be used in a URL blacklist. Is anyone in the SpamCop community working on this idea?

1559[/snapback]

Block spam based on URLs they contain

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived