How to Resolve Critical IT Incidents Fast

By TechZone360 Special Guest
Troy McAlpin, CEO, xMatters, inc.
May 14, 2015

In today’s always-on, always-connected business, something as small as an operational glitch can bring an extended enterprise to its knees. So how IT communicates in the first few moments of even the smallest service outage is crucial. But manual processes, diversified IT infrastructures and dispersed workforces can complicate these communications, increasing downtime and impacting the business.

Here are the best practices for automating the IT incident management communications process to ensure IT incidents are resolved as quickly and effectively as possible:

Identify        

The hallmark of a major incident is service disruption. In most cases, an operations manager identifies a major incident and escalates to a major incident manager, but some companies automate the routing of major incidents. Establish resolution processes for less critical issues as well, so they don’t go to major incident managers unnecessarily.

Engage

A communication system should prepopulate major incident managers with their contact information and on-call schedules, enabling the operations manager to instantly locate them and target notifications to them. Automating this initial engagement can have huge benefits, reducing a lengthy (20 minutes) process by up to 90 percent.

Triage

The manager who accepts the case determines whether the alert is a false alarm and what the incident actually is. The 2014 SANS survey (sponsored by AlienVault) revealed that 15 percent of organizations have issues with false positives.

Find and Assemble

Once the major incident manager understands the nature of the incident, he assembles the appropriate resolution team members based on the skills required. Common roles include a service desk manager, service-level agreement (SLA) manager, change manager, software developer, quality assurance, operations engineer, infrastructure engineer and problem manager. When an IT outage occurs and a notification is received, these members drop everything else to work to resolve the issue.

Some companies choose to have a SWAT team at the ready, instead of making sure someone from each required department can free himself to help.

Collaborate effectively

Just assembling people and initiating the conference call can take 45 minutes. Companies that use mass communications during this phase often get way too many people on the conference bridge. Each new person that joins interrupts the flow of the call. Repeating the background information for each person can waste an additional 10 minutes.

With a leading communication platform, the major incident manager can customize the message so resolution team members understand the basics before the call, and can join the bridge with just a button push instead of having to dial in. These messages can often be technical in nature, using IT jargon and specific server names.

Proactively and intelligently informing those affected with a more business-friendly message enables Marketing, PR, and executives to communicate responsibly, effectively and consistently.

Resolve

Image via Shutterstock

Members of the major incident team use all available communication channels integrated with their communication platform, including chat, text, email, phone, Skype, Slack, and more to identify and resolve the underlying cause of the issue. The communications also enable the major incident manager to keep stakeholders up to date and let his or her team members resolve.

Restore

Once the underlying issue is resolved, the team members can restore service and end the incident.

Review (Post Mortem)

A review is a fundamental piece of the incident resolution process, and all relevant parties should attend. The incident should have been documented and recorded, so the major incident manager and the problem manager should walk the group through the incident record, so they can assess the resolution process together.

The review can also identify improvements that can prevent a similar incident from occurring again.

If an enterprise’s data, information and processes become compromised, the business can suffer irreparable damage. So when major incidents occur, how the communication is managed is everything. These tips will help enterprises implement the automated communications processes needed to get the business up, running and restored quickly and easily in the event of an IT outage of any size.

About the Author: Troy McAlpin brings more than 20 years of experience to his leadership role at xMatters, with expertise in process automation, strategic initiatives and corporate strategy. His domain experience includes technology strategy and vertical market expertise including high tech, banking, consumer and retail industries.  Prior to xMatters he managed marketing, sales, development, M&A and financial aspects at two successful start-up companies and also worked at AT&T (News - Alert) Solutions and Andersen.




Edited by Dominick Sorrentino
SHARE THIS ARTICLE
Related Articles

Bloomberg BETA: Models Are Key to Machine Intelligence

By: Paula Bernier    4/19/2018

James Cham, partner at seed fund Bloomberg BETA, was at Cisco Collaboration Summit today talking about the importance of models to the future of machi…

Read More

Get Smart About Influencer Attribution in a Blockchain World

By: Maurice Nagle    4/16/2018

The retail value chain is in for a blockchain-enabled overhaul, with smarter relationships, delivering enhanced transparency across an environment of …

Read More

Facebook Flip-Flopping on GDPR

By: Maurice Nagle    4/12/2018

With GDPR on the horizon, Zuckerberg in Congress testifying and Facebook users questioning loyalty, change is coming. What that change will look like,…

Read More

The Next Phase of Flash Storage and the Mid-Sized Business

By: Joanna Fanuko    4/11/2018

Organizations amass profuse amounts of data these days, ranging from website traffic metrics to online customer surveys. Collectively, AI, IoT and eve…

Read More

Satellite Imaging - Petabytes of Developer, Business Opportunities

By: Doug Mohney    4/11/2018

Hollywood has programmed society into believing satellite imaging as a magic, all-seeing tool, but the real trick is in analysis. Numerous firms are f…

Read More