jongrose Posted December 18, 2007 Share Posted December 18, 2007 First off, I want to state that I realize that the main objective of SpamCop is to report the source of spam, and that identifying and reporting the spamvertized URLs contained within email is secondary. Nonetheless, because this feature is included in SpamCop, I believe that all attempts should be made to keep up with the methods employed by spammers to stop their criminal activities. SpamCop is an automated tool used to fight spam not only for its members but for the good of the worldwide internet community. In keeping with this goal, a simple modification to the SpamCop parsing engine should be made in order to allow it to detect URLs that are currently being missed. So, for the sake of this discussion, I would appreciate it if the argument of SpamCop's URL detection philosophy be left out of this topic. The Problem For quite some time, spammers have been abusing the use of MIME (Multipurpose Internet Mail Extensions) email headers in an attempt to bypass detection and avoid anti-spam techniques. By adding a malformed MIME header line in an email, the spammer causes what essentially amounts to a broken encoding method for the email. The MIME type used by spammers for this purpose is known as the "Alternative Subtype". In the headers of the email, the spammer will add an incomplete Content-Type and boundary line that is commonly used for sending messages in both plain text and HTML format in a situation when it is unknown which format the email client supports. It is my understanding, based on this thread, that SpamCop cannot properly parse URLs contained in emails that include malformed and incomplete MIME headers. In this thread I will attempt to explain MIME to the best of my knowledge and put forth the argument that SpamCop should modify the parsing engine to allow it to detect and report URLs currently bypassed by exploiting this technique. What is MIME? Below is an example of the correct implementation of the MIME alternative subtype. I have included numbers prefixing the code to explain it's usage below. 1 MIME-Version: 1.0 2 Content-Type: multipart/alternative; 3 boundary="=_ba87f495fb100f8dc950f0cef0ffa800" 4 5 --=_ba87f495fb100f8dc950f0cef0ffa800 6 Content-Type: text/plain; charset="ISO-8859-1" 7 Content-Transfer-Encoding: 7bit 8 9 --=_ba87f495fb100f8dc950f0cef0ffa800 10 Content-Type: text/html; charset="ISO-8859-1" 11 Content-Transfer-Encoding: quoted-printable 12 In lines 1-3 are what is included in the headers of the email. Line 1 defines that the email includes a MIME section. Line 2 and 3 then set that the content is multipart and will include more than one encoding type. The boundary is a set of random characters and may include a timestamp or other information, it will tell the email client where to find and identify the MIME content type. In line 5 we see the boundary code again, prefixed by two hyphens. Lines 6-7 inform the email client that this section is made up of plain text, along with the character set and the encoding. After this is displayed in the body of the email, the message will be shown to the end user in plain text format. In line 9 we again see the boundary code and in 10-11 the content is now HTML. This would normally follow with the same message shown after the previous plain text version for HTML compliant email clients. As you can see, the purpose of the usage of this MIME encoding was to send the email to a client which the sender did not know if it would view (or prefer) the message in HTML or plain text. How is MIME abused? When a spammer incorrectly uses MIME, it is similar to using a broken or incomplete syntax. For example, when writing the code to create a link in HTML, the correct syntax would be to use <a href="http://www.website.xyz/">Click here</a>. However, when using a malformed MIME Content-type, it would be like leaving the trailing "</a>" off the end of the HTML a href code. When the email client first sees that the message is MIME encoded and then looks for its follow up boundary code to display the email message in it's preferred format for the reader. If it does not find this, it will do certain things depending upon how it's configured or setup. In most instances, it will simply display the email message without difficulties. An example of an invalid MIME alternative subtype simply looks like the following: 1 MIME-Version: 1.0 2 Content-Type: multipart/alternative; boundary="0-1466100096-1197442086=:47221" 3 Content-Transfer-Encoding: 8bit Lines 1-3 are included in the headers of the spam email. As you can see, the implementation is correct. However, nowhere in the body of the email is there a boundary follow up code to let the email client know where to look for any content type that what's including in either plain text, HTML, or any other format. This could be caused by a poorly written email program or some other type of error, but in this case it is simply used in a malicious attempt to trick the email client from employing it's spam filters to check the body of the email or any URLs that may be included. Where does SpamCop come in? SpamCop trusts the MIME Content-Type/boundary and when the bogus lines are added in the headers it fouls up the parsing engine causing it to bypass or ignore any URLs, no matter how obvious they are to the reader. Why or how this happens, I do not know, as I am not familiar with the specific workings of the SpamCop parsing engine. When an email with bogus MIME Content-Type is passed through the parsing engine, the message Finding links in message body no links found will show up, indicating that SpamCop has missed the URL(s) in the email. Here are some examples: http://www.spamcop.net/sc?id=z1570835087z5...49feaba719fe77z http://www.spamcop.net/sc?id=z1561617737z1...814be496d226adz http://www.spamcop.net/sc?id=z1570175158z8...843f2459f4a92dz http://www.spamcop.net/sc?id=z1561673499zb...de9f8cca4bc21dz The third and fourth links are both phishing emails, which is all the more reason that these URLs need to be reported. Here is a previous discussion on this topic: Parsing: Spamcop not finding links in email when there are links Resources & References MIME - Wikipedia RFC 2387: The MIME Multipart/Related Content-type RFC 2046: MIME Part Two: Media Types - 5.1.4. Alternative Subtype Content boundary - Wikipedia Link to comment Share on other sites More sharing options...
StevenUnderwood Posted December 18, 2007 Share Posted December 18, 2007 Good luck. I would not mind this change, but hold little hope of seeing it implemented. There have been very few changes made to the parser over the last few years beyond fixing source locating issues, especially since Julian is not visible any more and a corporation "owns" the code. Link to comment Share on other sites More sharing options...
Farelf Posted December 18, 2007 Share Posted December 18, 2007 Very thoughtful jongrose, well done. I guess, in the beginning, the intention was that the parse would see what a "real" mailreader would see (or act upon) because it would actually be counter-productive to complain about content that had no real world effect. So it was made "standards compliant". Of course it is now apparent that certain applications will happily ignore standards (or try to lead/force variation) in order to maximize the viewing, listening and feeling pleasure of the user - and one's olfactory and gustatory organs are aquiver in anticipation of those further breakthroughs. I really can see no sense in the SC parsing process following the mime standards except to the extent that some (unknowable number of) reports might become irrelevant if they report on the basis of content that an ordinary application cannot "see". Anyway, I think you propose a change worth flagging for consideration, for sure. I think some, if not most, of the header and body mangling - including mime declarations and boundaries - is inadvertent (because there are easier ways to bomb the body-read part of the process) but that wouldn't detract from the case put. Link to comment Share on other sites More sharing options...
Wazoo Posted December 18, 2007 Share Posted December 18, 2007 Fantastic write-up that should end up in the Wiki. Just noting that the description of the MIME lines/Boundaries didn't include the "closing" line. SpamCop trusts the MIME Content-Type/boundary and when the bogus lines are added in the headers it fouls up the parsing engine causing it to bypass or ignore any URLs, no matter how obvious they are to the reader. Why or how this happens, I do not know, as I am not familiar with the specific workings of the SpamCop parsing engine. When an email with bogus MIME Content-Type is passed through the parsing engine, the message will show up, indicating that SpamCop has missed the URL(s) in the email. Just a bit of history, philosophy, detail .... the RFC-Compliant structure of a submitted spam e-mail was checked/tested to try to help insure that the user actually submitted something the parser could use and not stumble over. There came a day when Julian really tightened that part of the code up. The problem that next arrived was the fact that so many people were impacted, as their submittals failed that 'structure' test everytime. AT issue was that Outlook, Eudora, and a few other e-mail clients didn't 'keep' all that MIME structure stuff in a (SpamCop.net Reporting and/or RFC-Compliant) usable format. Thus begat the Eudora/Outlook work-around screens on the web-form submittal page to make an attempt at allowing Eudora/Outlook users to continue to report their spam. So at present, it should be seen that there are at least four separate/major paths in the Parsing & Reporting code at present. Most start with checking for the correct 'structure' ... as has been noted in some other recent traffic here, even that has been somewhat modified to allow certain garbage to flow through. And it has been noted that those four major paths have some different sub-paths involved. Anyway, the 'connection point' between your query and this post is that not all users handle things properly, not all e-mail clients handle things properly, it is a known fact that spammers abuse all kinds of standards and policies ..... and the Parsing & Reporting code has to work with/around/decypher any and all of the above, with the emphasis on working at developing technically-correct values for suggested Report recipients. Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.