For those of you who have been following my postings on the recent spate of network outages and data breaches that have disrupted the services of Microsoft, Twitter, Google and Amazon (just to drop a few of the biggies), you are aware of my major issue with how such events have been handled. To be brief, what all of these events have had in common has been the lack of customer interaction and transparency about what has happened and what a customer should do.
This has been nothing short of astounding. You would have thought that at this point everyone in the tech industry would have learned a simple lesson—Communicate! Communicate! Communicate!
The latest example of what not to do comes from Intermedia. The Mountain View, CA-based company happens to be one of the larger cloud service providers in the U.S. In fact, Intermedia manages more than 500,000 Exchange mailboxes and other Microsoft Exchange services, which makes it arguably the largest Exchange hosting company.
Just after the end of the long Labor Day holiday, it suffered what can politely be called a major service disruption that left its customers unable to access email, sign on to Microsoft Lync, access and synchronize files, or use their VoIP phone and conferencing services. We keep hearing that it is safe to move to the cloud because of its resilience and redundancy. Indeed, a prime selling point of cloud service providers is the assurance that enterprises can trust their mission critical interactions in the cloud. This failure to demonstrate service resiliency has blown a not insubstantial hole in that assertion. Let’s hope the collateral damage on the technical side of this is short-lived.
Intermedia has slowly brought all of its services back up, but the price they have paid in damage to their brand, along with the black eye this created for the cloud in general, could have lasting impact because of how they did not handle the situation.
Here we go again
As we all know, “stuff happens.” No computing and communications capability is 100 percent indestructible. Outages, for whatever reason, do occur. Customers have grown to expect that their service providers are not and cannot be perfect. They can be relatively patient and forgiving but only if they are kept informed.
By all accounts, including a letter of apology on the company’s blog September 3 by CEO Phil Koen, the communications company failed badly at communicating.
It would be almost cruel to post the Tweets that exploded from partners and customers on the Intermedia feed. This is an instance where a visual is not needed. By all accounts, customers were literally left to fend for themselves as the company dealt with the outage—the second in two weeks that seems to be the result of routing table issues, although that has yet to be confirmed. Not only did the company not effectively use email and social media to keep customers informed, it also appears that sales contacts and support were unreachable. Hence, there was no indication of what was going on or when service might be restored. Yikes!
To his credit, CEO Koen wrote one of the better apologies. It starts out by saying:
“As you know, Intermedia experienced a significant network interruption today. As CEO of Intermedia, I want to offer a profound apology.
I also want to explain what happened, what we’ve learned, and what we’re going to change.
First, you should know that our services are now stable. There has been no data loss and there were no security breaches. And all of the emails that were sent to you during this issue will be delivered as the backlog is processed.”
It goes on to describe, as of the day and time posted, what Intermedia knows (a rarity from other companies who have had issues). Koen says problems actually started on August 28th with a core router issue, which the company thought it had addressed. However, at 7:00 AM September 3, further anomalies with the core routing were experienced, forcing a reboot of all impacted devices. Network restoration was completed by 6:30 PM EDT, and the support team was able to take incoming calls from customer three and a half hours earlier. Unfortunately, the company website was down for most of the day, and thus was not useful when it needed to be.
Koen goes on to describe what will come next:
“As of now, all our services are back online. We now have three tasks ahead of us.
The first task is to complete our RFO (Reason For Outage) report to fully identify the root causes. We’re working with our network equipment vendor to complete this report, and we will share it with our customers and partners as soon as we can.
Second, we will take the recommendations from this RFO and make any changes necessary to improve our stability and resiliency.
Third, we will also improve the responsiveness and robustness of our customer notification tools and systems. Although we were successful in notifying many of our customers about the issues via alternate email addresses, text messages and HostPilot, not all customers were reached. Going forward, we will make more timely use of our website and social media—especially Twitter and Facebook.”
He then concludes with a nice apology and a promise to do better.
There is an old saying that the definition of insanity is doing the same thing over again but expecting a different result. It seems clear from the past month that our industry is suffering from a rather large case of communal insanity.
What have we learned?
What the past few weeks have demonstrated over and over is that technology will break and, though some are better than others with mean time-to restoration, given time, can be fixed. In Intermedia’s case, there is little doubt that more redundancy will be deployed, more social media will be used as a crisis breaks, and the website will have a hot stand-by so customers can get some, if not all, of the answers. We have also learned that when trust is broken, it is difficult to restore.
In previous articles I have cited the words from the hit Peter, Paul and Mary song, Where Have All the Flowers Gone, whose refrain is “When will they ever learn?” It is my sincere hope that the next article I write on this subject is about best, and not worst, practices for dealing with such situations.
Edited by Blaise McNamee