FACT: Performance slows when a critical resource is exhausted
Initial problem triage for performance issues frequently start and end with observations of CPU utilisation. CPU is a resource, like many others in a technology stack. When any resource is exhausted, application response time and stability are adversely impacted, and users experience slow response times. CPU is one of the most visible infrastructure metrics, and is frequently associated with slowing and erratic performance, but it is often blamed unnecessarily.
A myth has evolved over recent decades that slow performance is generally due to insufficient CPU capacity, and that increasing CPU will solve most performance problems. This may be correct in some instances, but is usually 'jumping to conclusions' if a comprehensive investigation has not been conducted to identify the root of the problem.
Take for example, a J2EE application that processes sales orders. Imagine that the Java code runs a number of validations, calculations, stock checks, and then updates a number of database records each time an order is placed. The Java code may require significant (heap) memory to perform these functions, but when the processing is complete, the JVM reclaims the memory. Imagine now that the database server slows down significantly during during the peak sales period due to a disk storage issue. This has the side effect of causing the processing of orders to wait much longer than normal while waiting for the database to process requests. If the slow down caused Order Processing to take several times longer than normal, then several times more Heap memory will be required for such processing, compared with normal operation. This has the side effect of leaving much less available heap memory available for other processing, which may result in the JVM running aggressive garbage collection to continously attempt to free up heap space. The amount of CPU consumed by JVM Garbage Collection may typically be less than 5%, but could increase to 70 - 80% when it is aggressively trying to free up memory. This could cause the application server to run its CPUs at nearly 100% at about the same time that users complain of slow response times. Without proper investigation, the proposed solution may be to add more CPU resource to the application server tier, when the real focus should be on the operation of the Disk Storage Subsystem that the database relies on. Adding additional CPU Cores to the Application Server in this instance would not really help (for long). Increasing the JVM heap would help, but would have other side effects. The only real solution is to isolate the slow storage as the root cause of the problem and fix it.
All performance problems have at their core the problem of requiring a resource that is currently unavailable. That resource could be CPU cycles, or connections to a database or available processing threads. If any critical resource is consumed to it's limit, then performance will suffer. The identification of such issues is a key driver for conducting Load Tests with realistic workloads on production like configurations, as such tests are most likely to identify resource limits that are likely to impact on performance in a production environment.
A myth has evolved over recent decades that slow performance is generally due to insufficient CPU capacity, and that increasing CPU will solve most performance problems. This may be correct in some instances, but is usually 'jumping to conclusions' if a comprehensive investigation has not been conducted to identify the root of the problem.
Take for example, a J2EE application that processes sales orders. Imagine that the Java code runs a number of validations, calculations, stock checks, and then updates a number of database records each time an order is placed. The Java code may require significant (heap) memory to perform these functions, but when the processing is complete, the JVM reclaims the memory. Imagine now that the database server slows down significantly during during the peak sales period due to a disk storage issue. This has the side effect of causing the processing of orders to wait much longer than normal while waiting for the database to process requests. If the slow down caused Order Processing to take several times longer than normal, then several times more Heap memory will be required for such processing, compared with normal operation. This has the side effect of leaving much less available heap memory available for other processing, which may result in the JVM running aggressive garbage collection to continously attempt to free up heap space. The amount of CPU consumed by JVM Garbage Collection may typically be less than 5%, but could increase to 70 - 80% when it is aggressively trying to free up memory. This could cause the application server to run its CPUs at nearly 100% at about the same time that users complain of slow response times. Without proper investigation, the proposed solution may be to add more CPU resource to the application server tier, when the real focus should be on the operation of the Disk Storage Subsystem that the database relies on. Adding additional CPU Cores to the Application Server in this instance would not really help (for long). Increasing the JVM heap would help, but would have other side effects. The only real solution is to isolate the slow storage as the root cause of the problem and fix it.
All performance problems have at their core the problem of requiring a resource that is currently unavailable. That resource could be CPU cycles, or connections to a database or available processing threads. If any critical resource is consumed to it's limit, then performance will suffer. The identification of such issues is a key driver for conducting Load Tests with realistic workloads on production like configurations, as such tests are most likely to identify resource limits that are likely to impact on performance in a production environment.