System Performance Analysis
By analyzing patterns of important log events generated by systems from across our Community we have identified 5 areas that people tend to look into when it comes to using log data for performance analysis. Using our Tags and Alerts capabilities, you can easily create notifications based on the sample patterns we have provided below to help analyze your own system from a performance perspective.
1. Slow Response Time
Response times are one of the most common and useful performance measures that are available from your log data. They give you an immediate understanding of how long a request is taking to be returned. For example web server logs can give you insight into how log a request takes to return a response to a client device. This can include time taken for the different components behind your web server (application servers, DBs) to process the request so it can give an immediate view as to how well your application is performing. Recording response times from the client device/browser can give you an even more complete picture since it also captures page load time in the app/browser as well as network latency. The following search pattern can be used to assist with this type of issue:
2. Memory Issues and Garbage Collection
Although modern day programming languages hide a lot of the complexities of memory management, you can still run into memory-related issue in your applications
For example, out of memory errors can be pretty catastrophic when they occur as they often result in the application crashing due to lack of resources. Thus you want to know about these when they occur and creating tags and generating notifications via alerts when these events occur is always recommended.
However a leading indicator of out of memory issues can be your garbage collection behavior, thus tracking this and getting notified if heap size vs. free heap space is over a particular threshold, or if garbage collection is taking a long time can be particularly useful and can often point you in the direction of memory leaks. Identifying a memory leak before an out of memory exception can be the difference between a major system outage and a simple server restart until the issue is patched.
Furthermore slow or long garbage collection can also be once of the reasons for users experiencing slow application behavior as during garbage collection your system can slow down or in some situations it blocks until garbage collection is complete. The following search pattern can be used to assist with this type of issue:
- Out of memory
- exceeds memory limit
- memory leak detected
- memwatch:leak: Ended heapDiff
- GC AND stats
3. Deadlocks and Threading Issues
In short, a deadlock is a situation in which two or more competing actions are each waiting for the other to finish, and thus neither ever does. For example, we say that a set of processes or threads is deadlocked when each thread is waiting for an event that only another process in the set can cause.The impact of deadlocks can range from degraded system performance to a major outage.
Not surprisingly deadlocks feature as one of our top 5 performance related issues that our users write patterns to detect in their systems. The following search pattern can be used to assist with this type of issue:
- ‘Deadlock found when trying to get lock’
- ‘Unexpected error while processing request: deadlock;’
In many cases a slow down in system performance may not be as a result of any major software flaw, but can be a simple case of the load on your system increasing, yet not having increased resources available to deal with this. Tracking resource usage can allow you to see when you require additional capacity such that you can kick off more server instances for example. The following search pattern can be used to assist with this type of issue:
- metric=/CPUUtilization/ AND minimum>X
- disk is at or near capacity
- not enough space on the disk
- java.io.IOException: No space left on device
- insufficient bandwidth
5. Database Issues
Knowing when a query failed can be useful as it allows you to identify situations when a request may have returned without the relevant data and thus helps you identify when users are not getting the data they need. However more subtle issues can be when a user is getting the correct results but the results are taking a long time to return and while technically the system may be fine and bug free a slow user experience may be hurting your top line.
Tracking slow queries allows you to track how your DB queries are performing. Setting acceptable thresholds for query time and reporting on anything that exceeds these thresholds can help you quickly identify when your users experience is being effected. The following search pattern can be used to assist with this type of issue:
- SQL Timeout
- Long query
- Slow query
- WARNING: Query took longer than X
- Query_time > X