Jump to content

Update - System Failure etc


Recommended Posts

The point of failure was identified and we are working on getting the system back up. We do have lots of redundancy but this was a single point failure. We are, and have been, onsite in the colo.

To answer a couple of the questions that have been asked:

1) There have been no updates during the day because there was nothing to say other than 'working on it'. As those of you who have shepherded hardware (or software) know it's almost always impossible to give an ETA to resolution until you have finished with replacement/restore/installation/checkout. Frustrating for everyone!

2) The last problem a couple of weeks ago was due to a spammer assault using the compromised credentials of a user. The spammer pumped a huge number of spams through the system and that resulted in over 500,000 bounces - or I should say we cleared 500,000 bounces out of the inbound queues after the queues became clogged. The problem was not immediately obvious as some inbound mail was being processed before the queues became completely blocked. Then the legit backed up mail had to be processed. As you know that mail is run through a variety of checks - spam Assassin, blocklists, personal whitelists and blacklists. It took many hours to work through all the queued mail and at the same time new mail was showing up. This was a form of DoS altho not probably intended as one.

Again I urge you to all change your passwords to strong passwords, check your systems for infection and to be circumspect in where you log into webmail.

3) A bit of history - back when Spamcop was a one man system, Julian asked Jeff to create an email system because of user demand. So CESmail came into being. When Spamcop was bought by Ironport they did not buy the email system. Then Ironport was bought by Cisco. So the reporting system is owned by Cisco and the email system is owned by CESmail. The historically close integration of reporting with the email system has been maintained so that users can report spam easily.

4) and we would rather be fishing ...

Link to comment
Share on other sites

The point of failure was identified and we are working on getting the system back up. We do have lots of redundancy but this was a single point failure. We are, and have been, onsite in the colo.

To answer a couple of the questions that have been asked:

1) There have been no updates during the day because there was nothing to say other than 'working on it'. As those of you who have shepherded hardware (or software) know it's almost always impossible to give an ETA to resolution until you have finished with replacement/restore/installation/checkout. Frustrating for everyone!

2) The last problem a couple of weeks ago was due to a spammer assault using the compromised credentials of a user. The spammer pumped a huge number of spams through the system and that resulted in over 500,000 bounces - or I should say we cleared 500,000 bounces out of the inbound queues after the queues became clogged. The problem was not immediately obvious as some inbound mail was being processed before the queues became completely blocked. Then the legit backed up mail had to be processed. As you know that mail is run through a variety of checks - spam Assassin, blocklists, personal whitelists and blacklists. It took many hours to work through all the queued mail and at the same time new mail was showing up. This was a form of DoS altho not probably intended as one.

Again I urge you to all change your passwords to strong passwords, check your systems for infection and to be circumspect in where you log into webmail.

3) A bit of history - back when Spamcop was a one man system, Julian asked Jeff to create an email system because of user demand. So CESmail came into being. When Spamcop was bought by Ironport they did not buy the email system. Then Ironport was bought by Cisco. So the reporting system is owned by Cisco and the email system is owned by CESmail. The historically close integration of reporting with the email system has been maintained so that users can report spam easily.

4) and we would rather be fishing ...

Sorry, but "we're working on it," every couple of hours, is more informative than nothing.

"Nothing" means you're not working on it. :(

Link to comment
Share on other sites

Sorry, but "we're working on it," every couple of hours, is more informative than nothing.

Nothing means you're not working on it. :(

Okay ,Did the NSA break in trying to collect meta data on all it's users? Wait let me take off my tin foil hat for this post. reality check..... :o:o

The best way to make a strong password is by using base 64 encode. So if you take a password like "123456789101112" (don't use this) and encode it it to base 64 then you get "MTIzNDU2Nzg5MTAxMTEy" It would take millions of years to crack that password. If you forget it, then take your password that you can remember and just encode it again and there you have it. I do this with many of my passwords including the one for Cesmail.net No I don't use "MTIzNDU2Nzg5MTAxMTEy" :rolleyes:

At any rate, it looks like a failed hard drive due to the stress of to many junk emails at once. Clearly it took two weeks before the hard drive gave up and then tapped out. Are you telling us Jeff that you in fact didn't have any hard drive monitors that would show when it was developing to many read write errors or over heating due to stress . What gets me is why you are not using using RAID drives, or some type of redundancy with daily clones for the purpose of backups in case of total failure.

I had a dell laptop with Vista installed which had the hard drive fail. About a week and a half before it gave up, the SMART hard drive software gave a warning to backup my data because the end was almost near. Why you wouldn't have some type of software warning on this particular server to warn of pending hard drive failures is beyond me. Now people are without email for over 24 hours because no warning signs occurred or you just are not telling people the whole truth, and nothing but the truth. I think you are more pre-occupied with Phishing , real :--) All in fun here...... :D:D

Link to comment
Share on other sites

Clearly it took two weeks before the hard drive gave up and then tapped out.

Where did they say it was a hard drive? And are you absolutely sure Jeff is posting as Email_Support? You know what they say about "assume"--it makes an...

DT

Link to comment
Share on other sites

I did have a team that shepherded large data systems and when customers went down, I did have to give ETAs of when the systems would be back up and had to live with my estimates. I missed some but hit most. Plus I kept the customers up to date on what was happening. In a datacenter environment I would have been out of a job with a 24 plus hour downtime. Sorry but true.

Since you identified the failure point, what is the prediction of return to normal operations?

Link to comment
Share on other sites

The point of failure was identified and we are working on getting the system back up. We do have lots of redundancy but this was a single point failure. We are, and have been, onsite in the colo.

So, what steps has CESMail taken to ensure this "single point failure" won't happen again? Do you have redundant servers? Are you running RAID5 (minimum, possibly RAID 10)? Do you have redundant NICs on your redundant servers going to redundant switches??? C'mon... I realize there's only so much redundancy that can be put in, but let's get real -- people depend on email these days. Whether or not it was designed to be instantaneous, it needs to "just f-ing work!" We have come to rely on 5-nines of uptime. CESmail has definitely not lived up to that in recent history.

I'm seriously thinking of bailing out and just going to reporting only what doesn't get sent to the "spam" folders in my various emails. I have to say I appreciate having ONE email to concentrate all my various email addresses out there, but if push comes to shove, I can do that with Fastmail or something similar.

CESmail needs to step up and a) make sure there is sufficient redundancy for 5-nines of uptime or B) sell the system to someone (Cisco?) who CAN and WILL ensure 5 nines of uptime.

FYI, I, too, have been an IT professional having to live with giving ETAs on fixing stuff.

Link to comment
Share on other sites

There have been no updates during the day because there was nothing to say other than 'working on it'.

You seem to be the only one on the planet who believes progress reports aren't necessary for paying customers.

The last problem a couple of weeks ago was due to a spammer assault using the compromised credentials of a user. The spammer pumped a huge number of spams through the system and that resulted in over 500,000 bounces

I'm no expert, but I can't believe there is nothing that can rate-limit such things.

the reporting system is owned by Cisco and the email system is owned by CESmail.

So who "owns" my email address, which ends with 'spamcop.net?'

and we would rather be fishing ...

I believe there are a number of people here who are going to do everything in their power to ensure you have more time for fishing.

Link to comment
Share on other sites

3) A bit of history - back when Spamcop was a one man system, Julian asked Jeff to create an email system because of user demand. So CESmail came into being. When Spamcop was bought by Ironport they did not buy the email system. Then Ironport was bought by Cisco. So the reporting system is owned by Cisco and the email system is owned by CESmail. The historically close integration of reporting with the email system has been maintained so that users can report spam easily.

4) and we would rather be fishing ...

Is the email system being run as a business, or as a hobby? It certainly appears that it's a hobby.

Link to comment
Share on other sites

Is the email system being run as a business, or as a hobby? It certainly appears that it's a hobby.

I wonder if that, or if simply this is a legacy email with not that many users? How many people actually use Spamcop email at $30/yr? I wish they would either 100% support it, or just let us know they are going to kill it as the current status isn't good for anyone.

Link to comment
Share on other sites

Is the email system being run as a business, or as a hobby? It certainly appears that it's a hobby.

Then find another email system with reliable spam reporting!

Unless you can prove you have knowledge of how the email system is set up, you have no way to tell what happened! Even redundant systems go down. Even a RAID array, for example, can have a single point of failure (like a power supply).

At one point, any system in the spamcop.net domain refused connections. That wasn't just CESemail's system, but Cisco's system as well! How about complaining at them?

Link to comment
Share on other sites

At one point, any system in the spamcop.net domain refused connections. That wasn't just CESemail's system, but Cisco's system as well! How about complaining at them?

I don't know about you, but I had no problems getting to "mailsc.spamcop.net" I just couldn't report held email.

Link to comment
Share on other sites

At one point, any system in the spamcop.net domain refused connections. That wasn't just CESemail's system, but Cisco's system as well!

I'm not sure that's correct--please supply links to forum posts backing that assertion. AFAIK, this outage was limited to the email system's resources in Georgia.

DT

Link to comment
Share on other sites

Even redundant systems go down. Even a RAID array, for example, can have a single point of failure (like a power supply).

Then -- if you are committed to reliability -- you have backup power.

A number of people in the last couple of days have told these forums they would be happy to pay more IF they could trust that it was being invested in improved reliability.

Link to comment
Share on other sites

Then -- if you are committed to reliability -- you have backup power.

A number of people in the last couple of days have told these forums they would be happy to pay more IF they could trust that it was being invested in improved reliability.

Exactly. I even said I'd be willing to pay double what we pay now. However, I'd want some assurance that it was not going to line Jeff's pockets, but to add more redundancy and reliability, possibly including physically separate duplicate servers.

Unless you can prove you have knowledge of how the email system is set up, you have no way to tell what happened! Even redundant systems go down. Even a RAID array, for example, can have a single point of failure (like a power supply).

Ah, there's the rub. Jeff won't TELL us what happened, just that there was a "major system failure" that they fixed after nearly 24 hours. And most SAN/NAS devices have redundant power supplies, and if it's on-board RAID, most server-class machines have redundant power supplies. The point is you have redundant EVERYTHING so that in case one thing dies, you're covered until you get it fixed. If Jeff/CESmail can't afford duplicate hardware, then he ought to come to the subscribers and say "I'm raising the rates to pay for upgraded/redundant hardware."

Link to comment
Share on other sites

The last problem a couple of weeks ago was due to a spammer assault using the compromised credentials of a user. [...] Again I urge you to all change your passwords to strong passwords, check your systems for infection and to be circumspect in where you log into webmail.
I've reported several times that SpamCopy email (cesmail side) loginids and passwords are exposed in the clear by the SpamCop reporting system. Most recent is http://forum.spamcop.net/forums/index.php?showtopic=13664. And others have posted similar concerns.

But I'm having a hard time getting anyone to understand the risk.

Link to comment
Share on other sites

I've reported several times that SpamCopy email (cesmail side) loginids and passwords are exposed in the clear by the SpamCop reporting system. Most recent is http://forum.spamcop.net/forums/index.php?showtopic=13664. And others have posted similar concerns.

But I'm having a hard time getting anyone to understand the risk.

And as long as the system has avoidable vulnerabilities like that, the users will continue to "bark." Consider us watchdogs -- ignored watchdogs, if you will.

WOOF!

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...