Could we improve the link parser ?

slb · October 7, 2004

Below you'll find an exemple of text and link content made to elude content filters but also SC link parser in message body.

----08761910132950075070
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit
&lt;html&gt;
&lt;p&gt;30 2mg Xana&lt;input type="hidden" value=""&gt;x for $119&lt;br&gt;
30 10mg V&lt;input type="hidden" value=""&gt;alium for $119&lt;br&gt;
You don't need a prior prescription! Quick Shipping!&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;http://www.cutler.mangs&lt;input type="hidden" value=""&gt;sis.c&lt;input type="hidden" value=""&gt;om&lt;/p&gt;
&lt;/html&gt;
----08761910132950075070--

Although there is no HREF tag SC could find, the link in the mail is very clear when one read it in html mode and moreover the user could cut and paste the link in his browser: it works !

Would it be possible to first parse the mail with html and then search for URL syntax like "www." or "http://" ?

P.S. Of course this one didn't get through cause I edited myself the spam and put the URL in an HREF tag >:-)

GraemeL · October 7, 2004

P.S. Of course this one didn't get through cause I edited myself the spam and put the URL in an HREF tag >:-)

Please note that making material changes to the body to force the parser to find links that it would otherwise ignore is against the rules.

From the web site: Rules - everybody read!

"SpamCop does what it does and doesn't do for a reason. Do not make any material changes to spam before submitting or parsing which may cause SpamCop to find a link, address or URL it normally would not, by design, find."

qjvgpuryy · October 8, 2004

This "links/URLs in text not found" seems to come up a lot, especially when people first start reporting. Should it be in the FAQ (maybe under "Regarding specific reporting problems:")?

GraemeL · October 8, 2004

The OP sent me a message saying that the link I supplied does not work. I accidently sent a link to the mailsc server for email account users. Replacing mailsc with www will fix the problem.

This is the world readable FAQ entry.

bigchiefbc · October 8, 2004

Spamcop doesn't seem to be picking up ANY of my links in the body. here is an example of what I pasted in, which came back "No links found". Shouldn't links that are in normal <a> tags be caught by the parser?:

<pre><tt>Rated NO.1 Penis Enlargement Pill on the Market!

Gain Up To 3+ Full Inches In Length

Increase Your Penis Width (Girth) By 20%

Stop Premature Ejaculation

Produce Stronger and Rock Hard Erections

<a href="http://www.herbalicious.net/pgf/track.php?id=23" target=_blank>http://www.herbalicious.net/pgf/track.php?id=23</a>

No More Receiving offers

<a href="http://www.herbalicious.net/r/" target=_blank>http://www.herbalicious.net/r/</a>

</tt></pre>

StevenUnderwood · October 8, 2004

Spamcop doesn't seem to be picking up ANY of my links in the body

Posting part of the body does nobody any good. Whether the parser is searching for <A HREF...> (or whatever else) depends also on the headers of the message. If the headers say the body is text only, it will not search out html code.

If you post a tracking URL for one of your problem parses, we will beter be able to tell you why the parser is doing what it is doing.

Wazoo · October 8, 2004

The OP sent me a message saying that the link I supplied does not work. I accidently sent a link to the mailsc server for email account users. Replacing mailsc with www will fix the problem.
This is the world readable FAQ entry.

18499[/snapback]

That same FAQ entry is incorporated within the FAQ here .... though saying that now might catch me in a lie, as I haven't yet gone through item by item to see if Courtney's work on the "latest new look" and FAQ cleanup stuff has actually moved pointers around .....

bigchiefbc · October 11, 2004

Here is the tracking URL for a post whick totally missed the link:

http://www.spamcop.net/sc?id=z681403107z70...35e1969545c4cez

Posting part of the body does nobody any good. Whether the parser is searching for <A HREF...> (or whatever else) depends also on the headers of the message. If the headers say the body is text only, it will not search out html code.
If you post a tracking URL for one of your problem parses, we will beter be able to tell you why the parser is doing what it is doing.

18521[/snapback]

Wazoo · October 11, 2004

Here is the tracking URL for a post whick totally missed the link:
http://www.spamcop.net/sc?id=z681403107z70...35e1969545c4cez

Those missing links are due to a bad construct. The headers contain the lines;

Content-Type: multipart/alternative; boundary="--04017693903892090"

but the body has no Boundary line info. This could be due to the spammer's intentional screw-up, it may be because of tools in use and/or handling of this spam, ...... stuff not defined I don't believe. ... OK, earlier you say "pasted in" ... but software/apps aren't mentioned. Are you using Outlook or Eudora for example?

bigchiefbc · October 11, 2004

No, I am using Yahoo mail and this is the complete message that comes up in Yahoo. The headers pasted in are what is displayed in Yahoo (I have display full headers selected), and the message body is the source, copied and pasted in whole.

Those missing links are due to a bad construct. The headers contain the lines;
Content-Type: multipart/alternative; boundary="--04017693903892090"

but the body has no Boundary line info. This could be due to the spammer's intentional screw-up, it may be because of tools in use and/or handling of this spam, ...... stuff not defined I don't believe. ... OK, earlier you say "pasted in" ... but sotware/apps aren't mentioned. Are you using Outlook or Eudora for example?

18621[/snapback]

bigchiefbc · October 11, 2004

Here is another tracking URL of another email with links that were missed.

http://www.spamcop.net/sc?id=z681439389z79...56d6c7b9eddcc1z

I understand that the spammer may intentionally format the source or headers incorrectly, but how can I overcome this and have my emails parsed correctly?

No, I am using Yahoo mail and this is the complete message that comes up in Yahoo. The headers pasted in are what is displayed in Yahoo (I have display full headers selected), and the message body is the source, copied and pasted in whole.

18626[/snapback]

Wazoo · October 11, 2004

I have to key on your words "parse correctly" ... it is being "parsed correctly" ... the only methods available to force the parsing engine to "additionally" include report outputs for these links would put you in violation of the "material modification" of the spam, thus putting your account in jeopardy. If a paid member, one could track down the contact points for those links and add this address as an additional notify (with a note of explanation) ... or, and if a free reporter, generate your own manual report to the appropriate contact address.

bigchiefbc · October 15, 2004

I have another spam in which the parser missed a URL in the body, and this time, the content type is listed as text/html. Why did the parser totally miss the link?

http://www.spamcop.net/sc?id=z682528879zba...72d1b796431283z

I have to key on your words "parse correctly" ... it is being "parsed correctly" ... the only methods available to force the parsing engine to "additionally" include report outputs for these links would put you in violation of the "material modification" of the spam, thus putting your account in jeopardy. If a paid member, one could track down the contact points for those links and add this address as an additional notify (with a note of explanation) ... or, and if a free reporter, generate your own manual report to the appropriate contact address.

18640[/snapback]

Wazoo · October 15, 2004

I have another spam in which the parser missed a URL in the body, and this time, the content type is listed as text/html. Why did the parser totally miss the link?
http://www.spamcop.net/sc?id=z682528879zba...72d1b796431283z

Don't take this as gospel, but I'll point out that the parser doesn't do graphics, frames, redirections, etc. The "target" code carries the possibility that an innocent third-party is being pulled into play ..... so we're back to erring on the cautious side of things.

Could we improve the link parser ?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived