Jump to content

does a robots.txt file stop spam bots?


abyrne

Recommended Posts

I run a website for a non-profit. In an effort to prevent spambots from collecting the "mailto:" email addresses in the HTML, I created a form to send email to staff from the website. Unfortunately, the web server (IIS) email feature is notoriously unreliable- some of the emails are lost entirely, many of them are delayed for weeks, and as they are coming from a website- our mail host tends to treat them as spam.

So my question- if I use the regular mailto: HTML links to make it easy to email from our website, and exclude the page with that link in the robots.txt file- will this protect our email addresses from being collected for adverse purposes?

-Amanda

Link to comment
Share on other sites

...So my question- if I use the regular mailto: HTML links to make it easy to email from our website, and exclude the page with that link in the robots.txt file- will this protect our email addresses from being collected for adverse purposes?
No. Regular robots (only) will heed your robots.txt file but generally publishing an email address in text on a webpage (or a mailto link) will result in substantial and enduring spamming to it as a result of skimming by less well-mannered (and illegal to all intents) robots/spiders. Perhaps others "here" will have some suggestions for you on a more effective contact form.
Link to comment
Share on other sites

  • 4 weeks later...
So my question- if I use the regular mailto: HTML links to make it easy to email from our website, and exclude the page with that link in the robots.txt file- will this protect our email addresses from being collected for adverse purposes?

As Farelf said, the robots.txt file is only used by those robots/spiders that choose to follow it. I've seen sites that use several methods, but all require manual entry of something the allows the message to be valid.

What I mean by that is that a person must either type in the correct email address or they must edit the email address in order to correct explicit errors in the email address or they must type in a correct interpretation of text in an image to complete the transaction. Here are some examples:

1) I've seen web pages that have pictures of email addresses instead of text. This requires the person to manually type the email address into their mail program.

2) I've seen people use things like blocks of repeated characters in their host or user name and then specifically request the character block to be removed to form the correct address. Like: bobbyXXXsocks[at]mydomainXXX.com (remove the XXX blocks).

3) There are services that show you a picture of text which is difficult for a machine to parse and then will only do the action (send the email) if you type in the correct translation of the text.

Those are my suggestions.

Link to comment
Share on other sites

As Farelf said, the robots.txt file is only used by those robots/spiders that choose to follow it. I've seen sites that use several methods, but all require manual entry of something the allows the message to be valid.

Example 2 would be a bit of a pain for everyone, however .... 1 and 3 impact the visually handicapped and/or disabled in an adverse way.

Link to comment
Share on other sites

As said, only well-behaved bots follow the robots.txt file, but this can also be used to your advantage. Here is what I do:

I have a robots.txt file that specifically disallows access to a file called bt.asp (my bot trap). Next, in the header and footer of each web page, I have the following:

<!-- <a href=bt.asp>bt.asp</a> -->

Poorly behaved bots are generally also poorly written, and don't know about html comment tags, all they see is the hyperlink tag, and add it to their list of links to follow. Most bots will either work the list top to bottom, or bottom to top, since those are the easiest ways to code the algorithm. By putting the hidden link as both the first and last link in the page, it ensures that 9 times out of 10 it will be the next page that the bad bot visits.

The bt.asp page is fairly simple. I returns a list of randomly generate email addresses, as well as an address that I use to monitor just how much spam those email addresses are receiving. The randomly generated addresses are designed to be unlikely to actually exist, and are primarly to "poison" the harvested list. Next, it stores the IP address in a database for later use.

Now, also in the header of each of my pages is a bit of code that checks to see if the IP addresses has been listed in the database as a "bad bot". If it has, the page returns a "403 Access Forbidden" error, along with the reason for exclusion "Bot does not obey robots.txt". This keeps the poorly behaved and downright malicious bots from accessing anything other than the first few landing pages of the site, and seems to do an excellent job of preventing address harvesting.

I have a hidden address in a couple pages on the site that I monitor compared to the monitor address in the bt.asp file. By comparing the amount of spam received at these two addresses, I can get a good idea how effective this strategy has been. The email in bt.asp receives about 200 spams per day, the protected test addresses receives less than 5 per week. This is after almost 2 years of exposure for both addresses.

As to your problem with IIS, I would talk to your web host about resolving this problem. Using IIS to send emails should be nearly 100% reliable, if you are losing emails, there is either a problem with the configuration of the web server, or there is a problem with a mail server on one end or the other.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...