In today’s always-on, always-connected business, something as small as an operational glitch can bring an extended enterprise to its knees. So how IT communicates in the first few moments of even the smallest service outage is crucial. But manual processes, diversified IT infrastructures and dispersed workforces can complicate these communications, increasing downtime and impacting the business.
Here are the best practices for automating the IT incident management communications process to ensure IT incidents are resolved as quickly and effectively as possible:
The hallmark of a major incident is service disruption. In most cases, an operations manager identifies a major incident and escalates to a major incident manager, but some companies automate the routing of major incidents. Establish resolution processes for less critical issues as well, so they don’t go to major incident managers unnecessarily.
A communication system should prepopulate major incident managers with their contact information and on-call schedules, enabling the operations manager to instantly locate them and target notifications to them. Automating this initial engagement can have huge benefits, reducing a lengthy (20 minutes) process by up to 90 percent.
The manager who accepts the case determines whether the alert is a false alarm and what the incident actually is. The 2014 SANS survey (sponsored by AlienVault (News - Alert)) revealed that 15 percent of organizations have issues with false positives.
Find and Assemble
Once the major incident manager understands the nature of the incident, he assembles the appropriate resolution team members based on the skills required. Common roles include a service desk manager, service-level agreement (SLA) manager, change manager, software developer, quality assurance, operations engineer, infrastructure engineer and problem manager. When an IT outage occurs and a notification is received, these members drop everything else to work to resolve the issue.
Some companies choose to have a SWAT team at the ready, instead of making sure someone from each required department can free himself to help.
Just assembling people and initiating the conference call can take 45 minutes. Companies that use mass communications during this phase often get way too many people on the conference bridge. Each new person that joins interrupts the flow of the call. Repeating the background information for each person can waste an additional 10 minutes.
With a leading communication platform, the major incident manager can customize the message so resolution team members understand the basics before the call, and can join the bridge with just a button push instead of having to dial in. These messages can often be technical in nature, using IT jargon and specific server names.
Proactively and intelligently informing those affected with a more business-friendly message enables Marketing, PR, and executives to communicate responsibly, effectively and consistently.
Members of the major incident team use all available communication channels integrated with their communication platform, including chat, text, email, phone, Skype (News - Alert), Slack, and more to identify and resolve the underlying cause of the issue. The communications also enable the major incident manager to keep stakeholders up to date and let his or her team members resolve.
Once the underlying issue is resolved, the team members can restore service and end the incident.
Review (Post Mortem)
A review is a fundamental piece of the incident resolution process, and all relevant parties should attend. The incident should have been documented and recorded, so the major incident manager and the problem manager should walk the group through the incident record, so they can assess the resolution process together.
The review can also identify improvements that can prevent a similar incident from occurring again.
If an enterprise’s data, information and processes become compromised, the business can suffer irreparable damage. So when major incidents occur, how the communication is managed is everything. These tips will help enterprises implement the automated communications processes needed to get the business up, running and restored quickly and easily in the event of an IT outage of any size.
About the Author: Troy McAlpin brings more than 20 years of experience to his leadership role at xMatters, with expertise in process automation, strategic initiatives and corporate strategy. His domain experience includes technology strategy and vertical market expertise including high tech, banking, consumer and retail industries. Prior to xMatters he managed marketing, sales, development, M&A and financial aspects at two successful start-up companies and also worked at AT&T (News - Alert) (News - Alert) Solutions and Andersen.