Jump to content

HTML Decimal "。" throws off link parser


Recommended Posts



Browsers read this as a functional link and take user to


SC sees and parses as:

scumbag.scumbags_site is not a routeable IP address

Cannot resolve http://scumbag.scumbags_site/。biz/GXz0ycHpZ

I've looked through the following, and this situation isn't mentioned (other than to say SC isn't a URL parser by trade), and I'm seeing more of it.




That's funny, the forum converted it too.

HTML Entity (decimal) " 12290 ;" concatenate.

Link to comment
Share on other sites

...The reply from Don referenced by Wazoo in the first article to which you provided a link seems to me to be the nub of the answer to your point, especially this part:

There are many reasons why a URL that a sentient being can see in a spam may not be seen by the parser. The reasons range from problems with boundaries, mime parts, how the URL is presented in the spam when it is seen by the parser (i.e. does it agree with the content specification) thru creative attempts by spammers to avoid programmatic recognition of a link. Obviously the latter leads us to try to make code changes to accommodate those when possible.

Specifically, the parser has code to attempt to avoid non-resolution by nameservers which recognize that the query is coming from SpamCop servers and this code may work better at some times than others. It would not be beneficial to go into more detail on this.... The code tries to strike a balance between hanging on a query for an unacceptably long period of time and the failure to resolve, but when you think about the fact that the parser is trying to handle huge numbers of spam with a limited amount of resource then it becomes understandable that the parser cannot wait interminably for an answer to a query. Most browsers will wait longer than the parser. Some browsers also accept mal-formed URLs which should never bring up a page.

It is frustrating to everyone when there is not consistent parsing of URLs, but I am not entirely sure that there is a way to solve this problem for every URL and every instance of parsing. Of course we do try to improve the overall situation within the limitations that we have.

As an aside -- the number of things that go into whether a URL can be "seen" in the spam, whether SpamCop can get it to resolve, whether some browser will display a page, etc, is so complex that you would probably overrun the capability of the forum disk space trying to explain it all :-) I know it frustrates people, but I am afraid that there isn't a whole lot that can be done. In the ideal world we would have limitless resources which would allow for no restrictions on parses, DNS servers, sneaky code to poke at this and to poke at that. In the real world we do what we can. And we are skewed towards injection sources rather than URLs. Not saying that URLs are less important but it is what it is ...

Link to comment
Share on other sites

Yes, that was what I was afraid of and why I gave the hat tip caveat about "SC isn't a URL parser by trade." General policy has been to take what you get, and manually report the rest, or use other tools.

I understand this scenario is a bit off SC's primary mission objectives, and iterations could get complex to account for all scenarios.

I guess my hope was that, in all the changes from 2005 when Wazoo wrote that, to 2014, perhaps the new Cisco overlords might fancy throwing in a couple of quick data validation routines to catch a few simple tricks like this. Perhaps the code is more amenable to alterations in its current manifestation; hardware has certainly gotten cheaper and faster to deal with a bit more work in the intervening time.

And if it isn't documented, then they won't know what to throw into their next system requirements development meetings.

Link to comment
Share on other sites


This topic is now archived and is closed to further replies.

  • Create New...