Jump to content

Parsing Failure due to mangled <HEAD>


bdurrett

Recommended Posts

It looks as if some clever spammer has figured out a way to keep SpamCop from parsing a message body for URLs by mangling the opening tags of an HTML file. The one that has been most successful has been a spam for a cable TV filter. I end up getting an error that says that SpamCop needs a full and complete copy of the message body although I have done exactly that ("View Source" -> "Select All" -> "Copy" -> Paste into report form - Yes, I am using Outlook so I have to use the Outlook hack)

The sender appears to be a ConCast zombie. Here is a copy of the header and the body text / HTML associated:

Received: from c-24-7-52-94.client.comcast.net ([24.7.52.94])

by prserv.net (in9) with SMTP

id <20040319074032109026lo4ke>; Fri, 19 Mar 2004 07:40:58 +0000

X-Originating-IP: [24.7.52.94]

X-AntiVirus: Checked by Dr.Web (http://www.drweb.net)

Received: from 228.59.148.78 by 24.7.52.94; Thu, 18 Mar 2004 10:39:16 +0300

Message-ID: <FDEHBKWJXHDWHXPQPYECI[at]yes.com>

From: "Brent Self" <bpkyff[at]yes.com>

Reply-To: "Brent Self" <bpkyff[at]yes.com>

To: <...snip...>

Subject: Re: WXKX, voice that yeshua

Date: Thu, 18 Mar 2004 13:39:16 +0600

X-Mailer: AOL 1.0 for Windows US sub 215

MIME-Version: 1.0

Content-Type: multipart/alternative;

boundary="--984907409932252537"

X-Priority: 3

X-MSMail-Priority: Normal

X-IP:9.97.228.180

----984907409932252537

Content-Type: text/html;

Content-Transfer-Encoding: 7Bit

<HTML><HEAD>

<BODY>

<p>Fr</maldistribute>ee Ca</surge>ble%RND_SYB TV</p>

<a href="http://www.8001hosting.com/cable/">

<img border="0" src="http://www.8001hosting.com/fiter1.jpg"></a>

taxonomy primordial buttery intransigent shortage nitrogenous freedom banks gnaw louisa comedy farnsworth ho flaky glycogen alliance diamond musket individualism easternmost viceroy passband beaux regulate cantilever almighty campsite <BR>

chevalier deja dumbbell monocular mammoth utah moen daimler thunderbolt turnstone atrocity infest pinball decrease heartfelt began amoco nebulous alexandre bagley aforementioned budd d seder academician saloonkeeper apocrypha <BR>

</BODY>

</HTML>

----984907409932252537--

and here is the error that gets produced:

Finding links in message body

Parsing text part

error: couldn't parse head

Message body parser requires full, accurate copy of message

More information on this error..

no links found

As you can see, there IS a link in the body but something with the malformed <HEAD> is preventing SpamCop from being able to parse it. When one clicks on the "More information on this error" link, you get the page that says something about headers not being complete. It has nothing to do with the error that is reported.

Link to comment
Share on other sites

Yeah, it's the header all right. You should find that if the part

X-IP:9.97.228.180

----984907409932252537

is like

X-IP: 9.97.228.180

----984907409932252537

(definitely line return before the declared boundary, maybe space after the colon in the preceding X-line too), the parser should then handle it - presumably to get reporting details (if any) for manual (your own) reporting rather than through SpamCop.

That's all I can see. The learning curve is certainly a bit steep for us ordinary end-users and, as Sylvester said, "What a revolting turn of events this is!", when we have get involved in all this esotery. Better than being a total victim though.

Anyway, there's usually some helpful and more clued-up people trawling through from time to time if that doesn't get you over the line.

Link to comment
Share on other sites

No need for confirmation, Farelf pointed to the problem, that blank line missing between the header and the body is the problem. Now the question is ... "use the Outlook Hack" ... does that mean the two part paste-in window? was it just a slip in the mouse capture moments in the copy part of the manipulation? The confusing part is that the "Outlook hack" is needed as these boundary lines are destroyed by Outlook's handling, yet here's showing in your example, which would point to your suggested theory that the spammer is definitly doing a direct manipulation of the header ... Interesting ...

Link to comment
Share on other sites

OK, just a little clarification seems in order.....

===============================

.

.

.

.

X-MSMail-Priority: Normal

X-IP:9.97.228.180

----984907409932252537

Content-Type: text/html;

Content-Transfer-Encoding: 7Bit

------------- Header is above - Body below this line -----------

<HTML><HEAD>

<BODY>

<p>Fr</maldistribute>ee Ca</surge>ble%RND_SYB TV</p>

.

.

.

.

===================

This is how it looks when I copy it in. In response to your question, yes, I am using the 2-part submission form and simply selecting all in the header (under "Options) and then selecting all in the body (after using "View Source"), copying and pasting each into the appropriate box. I would say that this is DEFINITELY purposeful manipulation. In addition, where the parsing fails is NOT in the header but in parsing the BODY of the message for URLs, as is referenced by the following error message:

+++++++++++++++++++++

Finding links in message body

Parsing text part

error: couldn't parse head

Message body parser requires full, accurate copy of message

++++++++++++++++++

The spam header is parsed correctly. Where it appears to fail is where it tries to parse the <HEAD> of the HTML block in the body. From what I have been able to find by poking around a bit is that a) there is no </HEAD> statement in the text body and b ) the ALL CAPS seems to be somewhat of a problem. In other spams that are missing the </head> but in lower case, it doesn't complain at all.

Link to comment
Share on other sites

?? we're agreeing with you <g> In the header, the critical lines are:

Content-Type: multipart/alternative;

boundary="--984907409932252537"

This indicates the 'need' for the lines (in this case) to befound in the body;

Content-Type: text/html;

----984907409932252537

Agreement is made that it does appear that the spammer moved these lines from the body to the header (and reversed the order) ... The reason that the body parse fails is due to the above data, combined with the remanats of the 'real' original, ike the bottom boundary line still in existence, which again, based on the "Outlook hack" shouldn't be there. And yes, also agree that as displayed, the entire crud would be attempted as all being part of the <HEAD> pertion of an HTML document. Once again, not much clarifiaction needed, think we've all agreed that the spammer did this all intentionally.

Link to comment
Share on other sites

Wazoo (extraordinarily helpful person) has covered it but to dot the i and cross the t because the devil is in the ... and I know what it is like to come de novo to this stuff.

It only looks confusing because it is. Parsing is presumably a two-part process, the headers are parsed first, no problem, but the body part fails, having been told to look for a boundary ----984907409932252537 and not finding it (until the end) because it is scrunched into the header. The message "error: couldn't parse head" is, no doubt, technically correct but utterly confusing to you and I. It really has nothing to do with the HTML <head> or <HEAD> (exactly the same) and </head> tags, any browser can get by without these and so can the parser. Possibly the only one which will throw the parser is the <html> without a closing </html> though browsers don't seem to mind.

Trust me, try parsing a text version through the single part submission box and it will work with the modification I described. But you will first (for this box) need to indent the continuation lines like

Received: from c-24-7-52-94.client.comcast.net ([24.7.52.94])

          by prserv.net (in9) with SMTP

          id <20040319074032109026lo4ke>; Fri, 19 Mar 2004 07:40:58 +0000

.. and

Received: from 228.59.148.78 by 24.7.52.94; Thu, 18 Mar 2004 10:39:16

          +0300

and

(To: line needs to have continuation indented as well - won't repeat here and you may wish to edit your post to munge them like SpamCop does, you probably all get more than enough spam already ;-)

and

Content-Type: multipart/alternative;

     boundary="--984907409932252537"

and of couse, make it (including indented continuation)

X-IP: 9.97.228.180

          

----984907409932252537

Content-Type: text/html;

     Content-Transfer-Encoding: 7Bit

and the rest is okay, needs to all be pasted in, down to and including the final

----984907409932252537--

If you do this, you will get

Finding links in message body

Recurse multipart:

   Parsing HTML part

and all the stuff about the links which follows. I did this already and you can find it at

Your report

QED

Keep at it, I doubt *anyone* (apart from the terminally confused) ever pretends to have mastered it all.

Link to comment
Share on other sites

Farelf, .."to dot the i and cross the t " ... whew! you forgot to mention the glass and stained oak presentation case <g> .. great job at the play by play!

bdurrett ... shooting the spammer's PC .. some would say you can't aim worth a hoot <g>

Link to comment
Share on other sites

bdurrett ... shooting the spammer's PC .. some would say you can't aim worth a hoot <g>

I modified the original sentence, just in case someone out there might have seen it as condoning murder. Actually, I was rated "Expert" with a .45 in the Navy <weg>

One better though would be a way to remotely "charge" their keyboard. "Send a spam, Get a jolt!" You could always tell the spammers then. Just listen for the voice of the little child (who have NO filter between mouth and brian).... "Hey Mom, why is that guy's hair curly and smoking?" :lol:

Link to comment
Share on other sites

Okay - you still have some "yourgroup" addresses in your posting. You sure you want to leave them there?

Meantime, I've become nastier (since some cheeky sod sent me spam from "myself"). Nothing less than permanent stimulation of the trigeminal nerve will do, just a few implementation issues to sort out, then maybe SpamCop can help with the deployment.

Link to comment
Share on other sites

Okay - you still have some attglobal addresses in your posting.  You sure you want to leave them there?

Whoops! Thanks for pointing that out, especially since one of them was mine.....

Meantime, I've become nastier (since some cheeky sod sent me spam from "myself").  Nothing less than permanent stimulation of the trigeminal nerve will do, just a few implementation issues to sort out, then maybe SpamCop can help with the deployment.

And this nerve would be connected to .... :huh: what? (Aside from 220vAC at about 30 Amps)

Strange, I get spam from myself regularly, as well as bounce messages for viruses I never sent. See, I don't really USE this account but it is the one I "advertise" or give out to people. My real account is... well .... that is a secret! :D

Link to comment
Share on other sites

The other members of your "group" will thank you (or not, depending ...)

Trigeminal neuralgia is supposed to be the worst pain that can be experienced, apart from that I don't know (implementation issues ;-) - I don't even want to think about how they know that unless they use spammers for medical experiments (I can think about that).

Have a good weekend ...

Link to comment
Share on other sites

  • 2 weeks later...

I've seen about 30% of my emails with this same error. My email client parses the spam properly and displays the message, recognizes attachments, etc. in spite of the missing blank line.

The purists can certainly say that the message does not follow the standards, but that's not the real question here. Our goal is to identify the websites referenced in the body (etc.) and report them. Will the spamcop experts fix the parser so that it ignores this error, or do we let the spammers get away with this bit of obfuscation.

Sample:

. . . {snip}
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="--031389564895004611"
X-Priority: 3
X-MSMail-Priority: Normal
X-IP:16.176.176.232
X-ContentStamp: 1:1:3376879774
----031389564895004611
Content-Type: text/html;
Content-Transfer-Encoding: quoted-printable

&lt;html&gt;&lt;head&gt;&lt;title&gt;&lt;/title&gt; 
 &lt;/head&gt; 
 &lt;body background=3D"http://lksjdni.info/ads2/whtbg.gif"&gt; 
. . . {snip}

It's relatively simple to fix this when I paste the email into spamcop's web page, but I usually don't parse all the headers manually until after I get the "can't parse head" message so I end up reporting it twice (well, sort of).

Link to comment
Share on other sites

The only item I saw in your sample ... SpamCop parseing tool doesn't do images. Way too many occurrences of spammers stealing bandwidth from innocent victims to stuff thier spam. And, yes, this causes much bitching when (haven't seen one in a long while) when the spammer incorporates the entire spam within a page-sized graphic image ... but again, SpamCop doesn't bark at image files.

Link to comment
Share on other sites

... SpamCop parseing tool doesn't do images.

Red herring, I'm afraid. The problem is that seen throughout this topic - lack of a blank line above the boundary open, causing the dreaded

error: couldn't parse head

Message body parser requires full, accurate copy of message

My email client parses the spam properly and displays the message, recognizes attachments, etc. in spite of the missing blank line.
(b_borden Apr 2 2004, 10:36 AM)

This is the bugbear - there is usually no indication without doing a manual "parse" that the SpamCop parser is going to have difficulty before the event, then follows the rigmarole of cancelling, going back and inserting the line (if such doesn't breach the "material alterations" guideline) and doing it all again.

The frustration is, if the mail client can handle it, why can't the parser. Same goes for the "too many links" except in that case Ellen has confirmed the spam must not be "amended" to accommodate the parser - in fact there is a general prohibition against "tricking" the parser to see what usually it can't (except when a minor change to content type is all it needs and ...)

In the case of too many links (at least the no object <a href = "bogus site"></a> type), last I looked, one can recall the reports confirmation and find, as if by magic, that the parser actually did resolve the real links and so get the details to manually contact the hosts (as if they care). I'm not sure in how many other cases (if any) the parser might actually produce useable results but is just too bashful to put them forward for routine reporting.

A long-running request (relative my recent viewing of these pages) is along the lines of "please make the parser see what my mail client sees" and the inquiring/complaining posters here are effectively doing the same. I don't doubt that many low-volume reporters just "fix" their submissions in quiet rebellion against the prohibition anyway.

Link to comment
Share on other sites

... SpamCop parseing tool doesn't do images.

Red herring, I'm afraid. The problem is that seen throughout this topic - lack of a blank line above the boundary open, causing the dreaded

For the parsing error, you're absolutely correct.

Our goal is to identify the websites referenced in the body (etc.) and report them

We've been beating "format" to death lately, so I chose to go with the specific item remarked on and the only URL posted in the sample.

Link to comment
Share on other sites

I chose to go with the specific item remarked on and the only URL posted
As you wish, b_borden can speak for him/her self but I would have thought you've missed the point of that post (and we don't go after image sources anyway). But I've been wrong before. Incidentally, if format issues have been "beaten to death lately" then that might be a significant datum.
Link to comment
Share on other sites

I chose to go with the specific item remarked on and the only URL posted
As you wish, b_borden can speak for him/her self but I would have thought you've missed the point of that post (and we don't go after image sources anyway). But I've been wrong before. Incidentally, if format issues have been "beaten to death lately" then that might be a significant datum.

Well, in the sample I provided, there were more links after the image, so for two reasons, the image is a red herring. I just clipped out the significant lines where it was missing the blank line. I didn't really intend for it to find the image.

My point was (as some have and others have not) that the missing blank line seems to be a useful trick to defeat the spamcop parser and still get the spam message through to the addressees.

Fix the parser so it understands "improperly formatted" messages - - at least for this one type of (intentional) error.

Link to comment
Share on other sites

Fix the parser so it understands "improperly formatted" messages

From Julian's end, he's pretty much chosen to go with the "ease" of having the SpamCop tool set work within the framework of a "proper" spam submittal. This became painfully clear a year or so ago when he tightened things up a bit which pit a major crimp in the reporting that was being done by Outlook and Eudora users in particular. From the SpamCop parsing tool's view point, there are already so many decisions to try to sort out ... did the reporter screw up, did the submittal get hosed during transmission, did the spammer do something ... so the first imposed "design limit" is ... spam submittal must have a valid format .... only then can it start looking at the actual data involved.

There are a number of discussion on-going over in the newsgroups dealing specifically with mangled headers ... and yet another series that's now focusing on blank BCC: lines .... Another perspective is that there's Julian doing the codebase at one end as compared to spammers around the world trying to get around any kind of filtering / tracking ..

Link to comment
Share on other sites

Another perspective is that there's Julian doing the codebase at one end as compared to spammers around the world trying to get around any kind of filtering / tracking ..
Good answer Wazoo, in fact there must/ought to be a law about it. Something along the lines of:

"Software development time is proportional to the inverse of the usefulness of the software minus one." [corrected, (1/x)-1 *not* 1/(1-x) I've always found arithmetic to be a challenge.]

Thus about an infinite number of years of "virus time" occur in each year of real time, dozens of years of "spam time", zero years of "deamware time" and, on that scale, we contemplate something less than unity for the less than critical features of SpamCop. Fair enough.

Notwithstanding, those wanting their say on what they consider critical/desirable features and enhancements in terms of their use of SpamCop's functionality can contribute via the current survey (from my "Welcome registered user/paste-here" page, presumably the same for others). Sorry guys (bdurett, b_borden, et al), as a mole who is satisfied with the current functions and facilities of that role, email body parsing doesn't affect my own reporting function and I shouldn't and haven't rated it in my own survey response; I certainly understand your points though.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...