Jump to content

Feature: Parser to OCR Attached Images for URLs


Jeff G.

Recommended Posts

the meat of the spam is a picture of text. So when you report to spamcop you just report the nonsense, not the included picture with the actual spam text.

30173[/snapback]

If, instead, it was a picture of a URL or hints of a URL, there is nothing the Parser can currently do about that.  Sorry.

30176[/snapback]

Please have that the Parser do something about that by throwing fast hardware configured to do OCR (Optical Character Recognition) at the problem, so that pictures of URLs could be reported. If not enough fast hardware is available, please add a new Parsing Preference Checkbox like "Take additional time to OCR attached images" that defaults to "Unchecked". Thanks!
Link to comment
Share on other sites

by throwing fast hardware configured to do OCR (Optical Character Recognition) at the problem

30261[/snapback]

Hmmm... server load would be my first concern. Seems like it would be a big cost to solve a 'secondary' problem.

That and you're now storing image attachments, not just plain text. (or, parser could just store OCR'd text and delete image after ocr.) And what about multiple email submissions... which image attachment to which email... could only accept as base64...

anyway, other than how to implement an auto-OCR on the server, all seems within realm of possibility. But, I don't see it getting too high on the priority list.

Link to comment
Share on other sites

That and you're now storing image attachments, not just plain text.

30278[/snapback]

The Parser is already storing image attachments (up to 50KB total size of encoded spam).
(or, parser could just store OCR'd text and delete image after ocr.)

30278[/snapback]

Good idea.
And what about multiple email submissions... which image attachment to which email...  could only accept as base64...

30278[/snapback]

This already works as designed, using nested MIME encoding.
anyway, other than how to implement an auto-OCR on the server, all seems within realm of possibility.  But, I don't see it getting too high on the priority list.

30278[/snapback]

Neither do I, unless Julian/Ironport have some spare hardware they haven't told us about and/or haven't decided what to do with yet. :)
Link to comment
Share on other sites

  • 4 months later...
Please have that the Parser do something about that by throwing fast hardware configured to do OCR (Optical Character Recognition) at the problem, so that pictures of URLs could be reported.  If not enough fast hardware is available, please add a new Parsing Preference Checkbox like "Take additional time to OCR attached images" that defaults to "Unchecked".  Thanks!

30261[/snapback]

WOW!

This is a good answer (and idea) for an helpfull spamcop moderator.

thanks, efa

Link to comment
Share on other sites

  • 1 month later...

I think Jeff has a point. About 10%-20% of the spam I am receiving is now messages on graphics. You can't even tell they're graphics until you try to cut-and-paste them. I don't think this is a new desktop publishing concept. It obviously was cleverly conceived to cripple SpamCop, and any other similar anti-spam sites. None of the message gets picked up, but gibberish is typed in under the graphic, and I am now seeing the recipient's email address included with the gibberish in the hope the recipient ends up reporting himself.

The critical issue is to keep with the original mission of SpamCop, and try to figure out how to address this problem. If it is immediately dismissed as I see in this thread, SpamCop will only be able to handle a diminishing number spam complaints as this practice increases and new tactics develop. To be effective, you have to focus on the problem, not on the difficulties encountered in addressing the problem.

Link to comment
Share on other sites

The critical issue is to keep with the original mission of SpamCop, and try to figure out how to address this problem.  If it is immediately dismissed as I see in this thread, SpamCop will only be able to handle a diminishing number spam complaints as this practice increases and new tactics develop.  To be effective, you have to focus on the problem, not on the difficulties encountered in addressing the problem.

39393[/snapback]

Of course that presumes that reporting spamvertised URLs is the original mission of SpamCop.

Since the prime objective is to identify the source IP addresses of those mail servers that forward spam, this OCR idea would look likely to be exceedingly low priority. Especially since at this point it will not assist SpamCop's main purpose. See FAQ - Philosophy on reporting spamvertised websites.

Andrew

Link to comment
Share on other sites

Of course that presumes that reporting spamvertised URLs is the original mission of SpamCop.  Since the prime objective is to identify the source IP addresses of those mail servers that forward spam, this OCR idea would look likely to be exceedingly low priority.  Especially since at this point it will not assist SpamCop's main purpose.  See FAQ - Philosophy on reporting spamvertised websites.

Andrew

39394[/snapback]

:huh:

I never referred to the reporting of spamvertised URLs. I was addressing the issue of getting the spam reported at all. spam I've submitted with messages on graphics were rejected as having no body to the spam.

I also never referred to any OCR idea. How this matter is handled is best addressed by more technical people. I don't care if they decide to use OCR or any other option.

I see the primary mission of any anti-spam program or website to be diminishing spam. If your position is that SpamCop's objective is not to diminish spam, but just to identify the source IP addresses of those mail servers that forward spam, what do you think the plan called for after the source IP address was ascertained? Do you think it might have been to use that information to diminish spam?

Link to comment
Share on other sites

I never referred to the reporting of spamvertised URLs.  I was addressing the issue of getting the spam reported at all.  spam I've submitted with messages on graphics were rejected as having no body to the spam.

39424[/snapback]

Maybe you have not, but that is what this topic is about. You should have no issue reporting the source of any message. If there is no body presented, you are allowed to modify the body section with something like <NO BODY RECEIVED> as long as the abuse desk does not refuse modified reports.

Please provide more information because even an email with a gif message needs to have a source to present that image to your email program.

What email application are you using? How are you submitting the message? etc.

Link to comment
Share on other sites

About 10%-20% of the spam I am receiving is now messages on graphics.  You can't even tell they're graphics until you try to cut-and-paste them.

39393[/snapback]

I was addressing the issue of getting the spam reported at all.  spam I've submitted with messages on graphics were rejected as having no body to the spam.

39424[/snapback]

Actually, the above seems to indicate an issue with some handling / reporting steps / procedures / tools ....?????

I also never referred to any OCR idea.

But you posted into a Topic titled Parser to OCR Attached Images for URLs

Link to comment
Share on other sites

I never referred to the reporting of spamvertised URLs.  I was addressing the issue of getting the spam reported at all.  spam I've submitted with messages on graphics were rejected as having no body to the spam.

I also never referred to any OCR idea.  How this matter is handled is best addressed by more technical people.  I don't care if they decide to use OCR or any other option.

39424[/snapback]

Hi LouMessina!

It's sometimes tough being new into a set of forums. As others have noted you posted into a discussion of OCR for the content of gif images. The only reason for wanting to OCR the gifs is to obtain the links within the image which would be the spamvertised URLs I referred to. So I don't feel I was unreasonable in interpreting your contribution as I did.

However, if you're making a general point then perhaps we can link the post to a different thread.

Andrew

Link to comment
Share on other sites

You should have no issue reporting the source of any message.  If there is no body presented, you are allowed to modify the body section with something like <NO BODY RECEIVED> as long as the abuse desk does not refuse modified reports.  Please provide more information because even an email with a gif message needs to have a source to present that image to your email program.  What email application are you using?  How are you submitting the message? etc.

When registering for SpamCop, I recall very emphatic instructions about not altering spam except to munge your email address.

I use Outlook, and submit spam using both SpamCop's website (Outlook workaround form) and via email. When spam with a graphic message is submitted by either method, the return message is "SpamCop encountered errors while saving spam for processing: SpamCop could not find your spam message in this email." However, I just encountered a method that appears to work. I entered the headers as usual at the website, and obtained the HTML codes for the body from Outlook. I entered that at the SpamCop website, and it worked. However, I assume SpamCop is not accessing any information or URLs in the graphic message. More important is the fact that the spam is getting reported, which was my primary concern.

Hi LouMessina!  It's sometimes tough being new into a set of forums.  As others have noted you posted into a discussion of OCR for the content of gif images.  The only reason for wanting to OCR the gifs is to obtain the links within the image which would be the spamvertised URLs I referred to.  So I don't feel I was unreasonable in interpreting your contribution as I did.  However, if you're making a general point then perhaps we can link the post to a different thread.  Andrew

Not having had the experience where spam with graphics were reported, but dismissed as being unable to report, my fault lied in jumping the gun and I interpreted the initial message as addressing what I had experienced.

However, although Jeff suggested OCR parsing, it was obviously for the purpose of accessing messages in the graphics to better report spam. I found the thread amply loaded with attacks on the suggestion, but woefully lacking in any alternate suggestions to address the same problem.

Link to comment
Share on other sites

When registering for SpamCop, I recall very emphatic instructions about not altering spam except to munge your email address.

39465[/snapback]

I assume you are referring to: http://www.spamcop.net/fom-serve/cache/283.html

It has been discussed here and generally accepted that this line (Do not make any material changes to spam before submitting or parsing which may cause SpamCop to find a link, address or URL it normally would not, by design, find.) allows the <NO MESSAGE BODY> body modification because nothing has been added to change where reports would go if the body were there properly. I also thought this was in the FAQ somewhere, but I have not been able to locate it specifically.

Link to comment
Share on other sites

I also thought this was in the FAQ somewhere, but I have not been able to locate it specifically.

39471[/snapback]

I guess this is it at http://forum.spamcop.net/forums/index.php?showtopic=122

Miss Betsy suggested some alternate/additional title links in http://forum.spamcop.net/forums/index.php?...findpost&p=5271 - clearly her advice should be heeded!

Link to comment
Share on other sites

When registering for SpamCop, I recall very emphatic instructions about not altering spam except to munge your email address.

I use Outlook, and submit spam using both SpamCop's website (Outlook workaround form) and via email.  When spam with a graphic message is submitted by either method, the return message is "SpamCop encountered errors while saving spam for processing: SpamCop could not find your spam message in this email."  However, I just encountered a method that appears to work.  I entered the headers as usual at the website, and obtained the HTML codes for the body from Outlook. 

39465[/snapback]

Microsoft Outlook (all versions)

That FAQ entry is years old ....

Link to comment
Share on other sites

  • 8 months later...

It's now time to add to this request similar OCR scanning for SpamCop Email System Customers before SpamAssassin, so that SpamAssassin (and perhaps later content filtering apps) can read and recognize all the P&D buzzwords, and score accordingly.

Keywords: Optical Character Recognition Pump&Dump Pump and Dump

Link to comment
Share on other sites

  • 4 months later...

It's now time to add to this request similar OCR scanning for SpamCop Email System Customers before SpamAssassin, so that SpamAssassin (and perhaps later content filtering apps) can read and recognize all the P&D buzzwords, and score accordingly.

Image spam OCR is a lost case. In the image spams that I actually get to see (few, thanks to Spamcop), I've noticed that spammers have already started to overlay the text with funny pixel patterns that are designed to make OCR difficult, or use exotic fonts with the same intention. They have the counter-measures already in place, well before Spamcop even got that feature...

Link to comment
Share on other sites

Image spam OCR is a lost case. In the image spams that I actually get to see (few, thanks to Spamcop), I've noticed that spammers have already started to overlay the text with funny pixel patterns that are designed to make OCR difficult, or use exotic fonts with the same intention. They have the counter-measures already in place, well before Spamcop even got that feature...

Yeah, that seems to be the conclusion a few of us came to awhile back when looking at the P&D spam images. They are almost always base64 encoded and now they are using techniques to try and scatter the text so it's unreadable to OCR.

See here:

http://forum.spamcop.net/forums/index.php?...=7467&st=20

I could have sworn I posted a news article about the subject too, but I found a white paper by a company claiming to be able to decipher image spam, so that might shed some light on the subject (although I didn't read the paper):

http://whitepapers.zdnet.com/abstract.aspx...amp;kw=ocr+spam

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...