Machine learning can elevate an IT organization’s performance monitoring strategy – and, given the wide array of mature machine learning algorithms and frameworks now available, it’s not as complicated to learn as it used to be.

Copyright by

SwissCognitiveMachine learning plays a growing role in IT organizations. Check out these articles on how IT teams can use this type of artificial intelligence for log analysis and anomaly detection . Organizations not ready to adopt machine learning can also consider time-series monitoring .

Let’s review some of the key concepts related to machine learning in IT performance monitoring, in general, and then walk through an example using Apache Mesos and the K-means clustering algorithm. Collect and define metrics

Most monitoring systems vacuum up logs, parse individual fields and then display them on a dashboard . But to predict or detect an outage, or anticipate a surge in demand, IT teams require metrics from a wide variety of systems — business, technical and external — fed into an algorithm.

However, each application and business is different, so there cannot be one single algorithm built into performance monitoring tools.

Instead, IT admins must write this code themselves. This process isn’t terribly complicated, but does require knowledge of machine learning, as well as programming skills. An organization’s existing monitoring systems provide most of the data it needs.

Think of machine learning algorithms as a black box: You throw a wealth of data into it and then hope something useful comes out the other end. Machine learning models work best when they process hundreds or thousands of data points on which to draw conclusions.

There are metrics that affect IT system performance, such as spikes in traffic volume, and metrics that reflect them, such as web page latency. Machine learning enables admins to use both, which is another improvement over the log-scraping-in-isolation approach to IT monitoring. Some metrics an IT admin could plug into that “black box” include:

Thank you for reading this post, don't forget to subscribe to our AI NAVIGATOR!


  • network traffic volume, by source and target IP address;
  • memory;
  • storage in use;
  • end-to-end app latency;
  • replication latency; and
  • message queue length.

An admin with a spreadsheet of this data would deduce which data points might be correlated. Data science eliminates the guesswork, as it points out which items actually correlate, and provides the tools to flag anomalies and make predictions regarding system health and demand. And it can do this with hundreds of metrics, whereas a human with a spreadsheet can look at only two or three.

The challenge with labelled data

There are two primary kinds of machine learning algorithms: supervised and unsupervised. Supervised machine learning algorithms enable predictive models to, for example, predict system outages before they happen. This is possible as an outage is usually preceded by a cascading series of events.

To support these predictive models, however, IT teams require classified or labelled data. This data captures relationships between a cause and an effect — such as a certain IT metric that results in a certain system status. A lack of comprehensive labelled data sets remains one of the biggest hurdles to the use of machine learning in IT performance monitoring.

Let’s look at an example of how to use machine learning for IT performance monitoring using Apache Mesos and the K-means clustering algorithm for data clustering and analysis. […]


Read more –