Why Collecting More Metrics and Creating More Graphs Is a Slippery Slope


The rapid adoption of cloud-native and containerized microservices, distributed data processing frameworks, and continuous delivery practices has led to a massive increase in application complexity. This rapid change has very quickly led to obsolescence of current monitoring solutions. Several monitoring tools have taken an approach to collect more metrics, and do so efficiently. However, they simply offer a means to plot these metrics in as several 100 graphs, sometimes allowing the user to manually group/tag some graphs, but still leaving the user to manually correlate and make sense of what is going on. Let’s look into it in a little more detail:

Traditional monitoring approaches fall short.

Most traditional and current monitoring and troubleshooting products emphasize one standard way of visualizing an application and troubleshooting failures – dashboards. These dashboards are used to visualize metrics and employ simple static threshold-based rules to alert when metrics go beyond their normal operating range.

Troubleshooting workflow using dashboards and alerts.

A dashboards- and alerts-driven troubleshooting workflow in a traditional monitoring product is structured in the following way:

  1. A user tags their hosts, containers and services with the appropriate tags. For example, the user may tag all their production MongoDB machines with the tags -{ environment = production, stack = mongodb }
  2.       A user configures static threshold-based alerts on all the important microservices that they would like to monitor. These alerts could be configured on a single metric, or on a class of metrics, across a set of hosts, containers or services. For example, there’s a rule that triggers an alert if memory-used percentage on any production MongoDB machine goes above 75 percent.
  1. A user sets up various custom dashboards, potentially for each microservice, and for the overall application. The overall dashboard tracks the most critical metrics for the entire application. The per-service dashboard tracks the critical metrics for that specific service. For complex microservices architecture, this process ends up creating 10s or 100s of dashboards.
  2. When an alert triggers on a metric, the user logs into the product, and opens up the overall dashboard. From this point on, the user goes through a series of completely manual steps that include exploring various dashboards, browsing different metrics from each microservice over multiple time-ranges, and analyzing events that are overlaid on the metric graphs. Eventually, with enough past experience about the system's behavior in the presence of the specific failure, the user is able to hone in on the right microservice responsible for the root cause.

The above workflow is standard process for most monitoring solutions available today. However, there are several issues with is approach, especially when dealing with web-scale applications:

  1. The above workflow can involve a single ops engineer spending a few minutes or entire teams investing several hours poring over complex dashboards. The efficiency of the workflow completely depends on the experience level of the ops team and their ability to:
    • Start with a correct initial setup by defining the right dashboards and alerts.
    • Having the expertise to ignore noise being generated by false alerts, while knowing which alerts to focus on and troubleshoot.
    • Having the knowledge to examine the right dashboards and metrics (out of the 10s of dashboards and 1000s of metrics) during the troubleshooting process.
  1. This workflow is further complicated by the hyper-change nature of the infrastructure. When hosts, containers, and services are ephemeral, browsing metrics on a service or container that is no longer active is a waste of time.
  2. Finally, the lack of a service or application topology means that the troubleshooting phase lacks a very critical element - knowledge and visual representation of the dependencies between microservices.

A data-science approach for complex and distributed applications.

The OpsClarity platform has several analytics constructs that are specifically designed to manage the hyper-scale, hyper-change microservices architecture of modern web-scale applications. The platform was built with the specific goal of significantly improving the troubleshooting workflow for these applications. It was designed from the ground up to handle the massive volume of data generated by modern web-scale applications by applying data science and advanced correlation and anomaly algorithms. However, since every application and metric is different, the same algorithms or anomaly detection techniques cannot be applied to all the metrics that are collected. Based on the context and history of the application and metrics, the platform constantly learns system behavior, understands the context, and chooses the appropriate combination of algorithms to apply. This intelligence is built into the platform’s engine, called the Operational Knowledge Graph (OKG).

Powered by the intelligence and knowledge curated by the OKG, the OpsClarity approach to web-scale application monitoring and troubleshooting is summarized as follows:

  1. OpsClarity maintains an always up-to-date Operational Knowledge Graph of the application infrastructure. This includes information about the set of hosts and containers, and the services that run on them. It also includes information about how these services communicate with each other. In short, this is the application or service topology.
  2. The OKG automatically configures agents with the right set of plugins to collect data from the right services. It also informs which metrics are the most critical metrics to collect for each service.
  3. The anomaly detection engine automatically baselines metrics, and identifies regions where a metric behaves in an anomalous manner. The OKG allows customization of baselining algorithms to each service, significantly improving accuracy.
  4. The platform automatically synthesizes various signals from hosts and containers to create an aggregated health model for each service. The signals include port and http level network checks, anomalies, and other information. An aggregated health view is critical for quick insight into which microservices are unhealthy.
  5. The events generated due to anomalies and health changes flow into an Event Log that is easy to browse in the context of the application topology. It has various filtering and ranking mechanisms to reduce alert noise, including selecting specific events of choice, ranking events by importance, and clustering related events together.
  6. The Event Log in combination with the OKG makes the troubleshooting process faster and more reliable than traditional monitoring solutions.

Amit is co-founder and CTO at OpsClarity. As a seasoned technologist, Amit is adept at finding innovative solutions to hard problems across a diverse set of domains like operational analytics, web search and advertising platforms, and drives the technology roadmap at OpsClarity. Prior to founding OpsClarity, he built-large scale crawling, indexing and web-graph analysis systems for web-search at Google and Yahoo. Amit holds multiple patents and has authored research papers in the areas of web-search and machine learning. He holds a MS in Computer Science from Stony Brook University.

Edited by Kyle Piscioniere

Related Articles

Coding and Invention Made Fun

By: Special Guest    10/12/2018

SAM is a series of kits that integrates hardware and software with the Internet. Combining wireless building blocks composed of sensors and actors con…

Read More

Facebook Marketplace Now Leverages AI

By: Paula Bernier    10/3/2018

Artificial intelligence is changing the way businesses interact with customers. Facebook's announcement this week is just another example of how this …

Read More

Oct. 17 Webinar to Address Apache Spark Benefits, Tools

By: Paula Bernier    10/2/2018

In the upcoming webinar "Apache Spark: The New Enterprise Backbone for ETL, Batch and Real-time Streaming," industry experts will offer details on clo…

Read More

It's Black and White: Cybercriminals Are Spending 10x More Than Enterprises to Control, Disrupt and Steal

By: Cynthia S. Artin    9/26/2018

In a stunning new report by Carbon Black, "Hacking, Escalating Attacks and The Role of Threat Hunting" the company revealed that 92% of UK companies s…

Read More

6 Challenges of 5G, and the 9 Pillars of Assurance Strategy

By: Special Guest    9/17/2018

To make 5G possible, everything will change. The 5G network will involve new antennas and chipsets, new architectures, new KPIs, new vendors, cloud di…

Read More