An AI-First Approach to IT Operation Analytics

By

Artificial Intelligence (AI) is finally coming of age after many a false start. The days of runaway robots are still futuristic, but the time has come when the confluence of AI, Big Data and human domain knowledge is happening, with exceptional results. AI is being applied in multiple domains. IT operations is one such domain that is ripe for taking an AI-first approach.

Today’s hybrid cloud environments continue to undergo a massive transformation. These IT infrastructures are increasingly dynamic and agile but at the same time extraordinarily complex. Humans are no longer able to sift through the variety, volume and velocity of Big Data streaming out of IT infrastructures in real time, making AI—especially machine learning—a powerful and necessary tool for automating analysis and decision making. By helping teams bridge the gap between Big Data and humans, and by capturing human domain knowledge, machine learning is able to provide the necessary operational intelligence to significantly relieve this burden of near real-time, informed decision-making. Industry analysts agree. In fact, Gartner named machine learning among the top 10 strategic technologies for 2016, noting, “The explosion of data sources and complexity of information makes manual classification and analysis infeasible and uneconomical.”

In an IT operations environment, there are domain experts—typically IT administrators, IT operators for TechOps and site reliability engineers (SRE) for DevOps—who must manually gather this disparate information and apply their domain expertise in an attempt to make informed decisions. While these professionals are great at what they do, trying to analyze so much data from multiple tools leaves the door wide open for human error. On the other hand, analytics that are based on machine learning are quickly becoming a necessity to ensure the availability, reliability, performance and security of applications in today’s digital, virtualized and hybrid-cloud network environments.

Until now, IT operations teams have relied on siloed tools for monitoring that provided them with information about their network, virtual and physical infrastructure and application performance. While these tools provide pieces of the puzzle, they offer a narrow view of the IT infrastructure and, therefore, only one aspect of the toolchain. The other aspect is service desk tools that manage tickets and change management. Humans more often than not bridge this gap between the siloed monitoring tools of yesterday and service desk applications with their domain expertise.

 

The New Age of Analytics

In TechOps and DevOps environments, there is a need to automate, learn and make intelligent, informed decisions based on real-time analysis of Big Data arising out of the entire application infrastructure stack. Following are key analytics for IT operations:

  1. Topology Analysis– This is the understanding of the hierarchal, peer-to-peer and temporal relationship between hybrid cloud elements. Topology is something every IT administrator or SRE should be aware of. This type of analysis should be able to self-learn the inter-relationships of objects and the impact of their performance on one another. Learning those relationships and maintaining that understanding in order to spot trouble in time is extremely important for both TechOps and DevOps environments. 
  2. Behavior Profiling – Refers to the understanding of the behavior profile of each and every metric, how that rolls up into the object behavior and then how the object behaviors relate to other object behaviors across the hybrid cloud environment. It is a multi-dimensional problem, and understanding and adapting to “normal” behavior is extremely important.
  3. Anomaly Detection – Best-of-breed machine learning algorithms should be able to look at contextual, historical and sudden changes in the behavior of objects to detect anomalies. Understanding when there is a real anomaly and more importantly, when there is not, is critical to avoid generating false alarms. This is the bedrock of what is typically referred to as diagnostic analytics.
  4. Root Cause – By pinpointing the cause and impact of an incident, root-cause analysis is able to fast-track the resolution and reduce mean time to repair substantially.
  5.  Predictive – These analytics help operators identify early indicators and provide insights into looming problems that may eventually lead to performance degradation and outages.  Predictive analytics are also good at providing early insights into anomalies to better plan for what’s ahead.
  6. Prescriptive – These analytics provide intelligent and actionable recommendations to remediate an incident. These recommendations should capture tribal knowledge gathered over the years in the organization, best practices in the industry, and may even be crowd-sourced to capture state-of-the-art knowledge. These analytics provide the opportunity to finally close the loop in automated IT Operations Management.

Beyond Monitoring

IT operations teams have been in firefighting mode for a while now, with humans reacting to incidents as well as trying to resolve them after they have spun out of control. Instead, AI provides technologies to help automate many of these tasks in order to handle incidents in advance. The whole notion of automating IT operational tasks, as well as preventing outages in the first place, and getting to the root cause quickly and in an automated way is the next frontier in remediating these issues.

It’s become apparent that it’s no longer humanly possible to review monitoring data for the purpose of identifying incidents. In fact, AI is leapfrogging traditional monitoring solutions for DevOps and TechOps teams. They are finally beginning to understand how incredibly useful moving beyond simple monitoring tools can be, and how analytics is indispensable for real time, intelligent decision making.

Dr. Akhil Sahai is an accomplished management and technology leader with 25+ years of experience at large enterprises and at startups. Akhil came to Perspica from HP Enterprise where as Sr. Director of Product Management, he envisaged, planned and managed the Solutions Program. At Dell, as Director of Products, Akhil led Product Strategy and Management of Dell’s Converged Infrastructure product line. He also led Gale Technologies, as VP of Products to its successful acquisition by Dell. Prior to that, at Cisco he undertook business development for VCE Coalition, and at VMware, he managed global product strategy and management for vCloud Software with focus on applications, and Virtual Appliances product line. He has published 80+ peer-reviewed articles, authored a book, edited another, and chaired multiple International IEEE/IFIP Conferences. He has filed 20 technology Patents (with 16 granted). He has a Ph.D. from INRIA France and an MBA from Wharton School.




Edited by Stefania Viscusi
Get stories like this delivered straight to your inbox. [Free eNews Subscription]
SHARE THIS ARTICLE
Related Articles

ChatGPT Isn't Really AI: Here's Why

By: Contributing Writer    4/17/2024

ChatGPT is the biggest talking point in the world of AI, but is it actually artificial intelligence? Click here to find out the truth behind ChatGPT.

Read More

Revolutionizing Home Energy Management: The Partnership of Hub Controls and Four Square/TRE

By: Reece Loftus    4/16/2024

Through a recently announced partnership with manufacturer Four Square/TRE, Hub Controls is set to redefine the landscape of home energy management in…

Read More

4 Benefits of Time Tracking Software for Small Businesses

By: Contributing Writer    4/16/2024

Time tracking is invaluable for every business's success. It ensures teams and time are well managed. While you can do manual time tracking, it's time…

Read More

How the Terraform Registry Helps DevOps Teams Increase Efficiency

By: Contributing Writer    4/16/2024

A key component to HashiCorp's Terraform infrastructure-as-code (IaC) ecosystem, the Terraform Registry made it to the news in late 2023 when changes …

Read More

Nightmares, No More: New CanineAlert Device for Service Dogs Helps Reduce PTSD for Owners, Particularly Veterans

By: Alex Passett    4/11/2024

Canine Companions, a nonprofit organization that transforms the lives of veterans (and others) suffering PTSD with vigilant service dogs, has debuted …

Read More