Top 8 APM metrics that IT teams use to monitor their apps
A superior customer experience (CX) is built on accurate and timely application performance monitoring (APM) metrics. You can’t fine-tune your apps or system to improve CX until you know what the problem is or where the opportunities are.
APM solutions typically provide a centralized dashboard to aggregate real-time performance metrics and insights to be analyzed and compared. They also establish baselines to alert system administrators to deviations that indicate actual or potential performance issues. IT teams, DevOps and site reliability engineers can then quickly identify and address application issues.
Application performance monitoring is the initial phase of application performance management. Monitoring tracks app performance and enables the management of that app. An APM solution brings administrators the instrumentation tools needed to quickly gather data and conduct root cause analysis; they then isolate, troubleshoot and solve that problem.
Key APM metrics to monitor
There are a number of metrics you can choose from, but we recommend focusing on these eight metrics to reap the most benefits within your IT organization.
1. Apdex and SLA scores
Let’s start with application performance index (Apdex) and service level agreement (SLA) scores, since they are the foundation of superior customer experience. The speeds and feeds you’ll measure are the specific aspects that ought to add up to fast performance, but they are the means, not the end. Happy customers are your goal—hopefully leading to increased sales.
The Apdex and SLA scores are the most popular way to view end-user experience monitoring. The Apdex score tracks the relative performance of an app by specifying a goal for the time a web request or transaction should normally take. The SLAs are the metrics in your customer contract and anything lower than the defined SLA risks a drop in CX (and possibly predefined penalties).
2. Application availability (also known as uptime or web performance monitoring)
This is the most basic metric: Are the lights on? You are monitoring and measuring if your application is online and available. Most companies use this to measure service level agreement (SLA) compliance. Uptime is often a shorthand for assessing overall system reliability and health. Excessive downtime can negatively impact user satisfaction for organizations delivering online services. For a web application, you can verify availability with a simple, regularly scheduled HTTP check.
3. CPU usage (also known as resource usage)
A high percentage of CPU capacity being used by an application can be a sign of a performance problem. A sudden spike in CPU usage can result in slower response times. Fluctuations in demand for an app might also be an indication that you need to add more application instances. A general rule is if CPU usage exceeds 70% more than 30% of the time, you could be running out of CPU capacity.
Resource usage can also include memory and disk usage. Tracking RAM helps identify memory leaks that could lead to failure or the need for greater memory. Disk usage metrics can help prevent an app from running out of persistent storage, which could cause it to fail. High disk usage could also be a sign of inefficient backend data storage or faulty data retention policies.
4. Error rates
Your APM metrics software should monitor applications to record the percentage of requests that result in failures. This helps to identify and prioritize the resolution of issues that impact the user experience. Application errors can include server errors, a 404 response or timeout in a web app. You can configure your APM solution to send notifications when an error rate goes above a set parameter. For example, send an alert when 2.5% of the previous 25 requests have resulted in an error.
5. Garbage collection
Garbage collection (GC) can improve performance by identifying and eliminating the ongoing heavy memory usage of Java or other languages. The good news is that GC automation reclaims memory devoted to unused or redundant objects or data that are no longer being used by an application. Unused objects or data are deleted and live objects are copied to a later-generation memory pool. This is a metric you want to keep in the happy middle. If GC is run too often, it might require too much overhead; but if GC is not run often enough, then your system could be left with too little memory.
6. Number of instances
Tracking instances enables you to scale your application to meet actual user demand, based on how many app or server instances are running at any time. This can be especially important for cloud applications. Auto-scaling can help you ensure modern applications scale to meet demand and save budget during off-peak hours. This can also create infrastructure-monitoring challenges. For example, if your app automatically scales up on CPU usage, you might not ever see your CPU usage rise—instead, you could see the number of server instances rise too far, along with your hosting bill.
7. Request rates
You can measure the traffic received by an application to identify any significant decreases, increases or coinciding users. Correlating request rates with other application performance metrics will help you understand the scalability of your software applications. APM software can also monitor traffic to identify anomalies. User monitoring showing an unexpected increase in requests could be a denial of service (DoS) attack. A large number of requests from the same user could be an indication of a hacked account. Even unusually low requests could be bad—inactivity or no traffic at all could mean a failure in almost any part of your system.
8. Response times (also known as duration)
By tracking the average response time to a request—that is, how long it takes an application to return a request for resources—you can assess app performance. These requests can be inclusive of transactions initiated by end-users, such as a request to load a web page, or can include internal requests from one portion of your application to another, such as a process or microservice requesting data from disk or memory. The total response time includes server response time (the time it takes your server to process a request) plus network latency (the total time it takes the request to move across the network).
A related metric is page load time, which measures the time it takes a webpage to load into a browser. Tracking page load times enables your application performance monitoring tools to identify the issues causing slow-loading pages and then improve the digital experience. Slow page loads can mean page abandonment and lost business. APM solutions can be set for a baseline of performance for this metric and then alert you when that benchmark is not met.
Additional application metrics
For those who are looking for a more comprehensive set of metrics related to application performance monitoring, you might want to consider the following metrics:
- Database queries: Measures the number of queries requested from a database by an application. Your APM tools can then help identify slow or inefficient queries that may be slowing overall performance of your application.
- I/O (Input/output): I/O shows the rate at which apps read or write data. You can track the performance of persistent storage media (such as HDD or SSD) and I/O rates for memory or virtual disks.
- Network usage: Network usage represents the total network bandwidth used by an application. Increased network usage might indicate performance problems slowing the application’s response time or creating bottlenecks.
- Node availability: A measurement similar to the number of instances is node availability, but it’s specific to cloud. When you deploy apps to a Kubernetes cluster, the number of nodes available and responding (of the total nodes in a cluster) can help identify problems within your infrastructure. Cloud spend metrics can also be important, giving you real-time visibility into cloud costs by tracking API calls, running time for cloud-based virtual machines (VMs) and total data egress rates.
- Throughput: Throughput is the volume of data that can be transferred between an app and users or other systems. It can be used to determine if an app is able to handle the expected traffic volume.
- Transaction tracing: This gives you a picture of single transactions carried out by an application. Data captured can include database calls, external calls and function calls—monitoring the transaction request from start to finish.
- Transaction volume: Transaction volume measures the number of transactions processed by an application. This enables APM tools to identify issues with scalability and capacity planning.
Get started with choosing your APM solution
IBM Instana Observability provides real-time observability that everyone—and anyone—can use. It delivers quick time to value while ensuring your observability strategy can keep up with the dynamic complexity of today’s environments and tomorrow’s. From mobile to mainframe, Instana supports over 250 technologies and growing.
Learn more about application performance monitoring with IBM Instana