Your message has been sent.
This article has been saved to your account.
Go to my account
This article has been emailed to your Kindle.
Send this article
In this article series by Marcus Hirt and Marcus Lagergren, authors of Oracle JRockit: The Definitive Guide, you will learn:
- How to hunt down latency-related problems
- How to detect indications of a memory leak in an application
- How to use the operative set in the JRA latency analyzer component
In the earlier part of this article series —Working with JRockit Runtime Analyzer- A Sequel, we have covered the following:
- Different ways to create a JRA recording
- How to find the hot spots in your application
- How to interpret memory-related information in JRA
The Code tab group contains information from the code generator and the method sampler. It consists of three tabs —the Overview, Hot Methods, and Optimizations tab.
This tab aggregates information from the code generator with sample information from the code optimizer. This allows us to see which methods the Java program spends the most time executing. Again, this information is available virtually "for free", as the code generation system needs it anyway.
For CPU-bound applications, this tab is a good place to start looking for opportunities to optimize your application code. By CPU-bound, we mean an application for which the CPU is the limiting factor; with a faster CPU, the application would have a higher throughput.
In the first section, the amount of exceptions thrown per second is shown. This number depends both on the hardware and on the application—faster hardware may execute an application more quickly, and consequently throw more exceptions. However, a higher value is always worse than a lower one on identical setups. Recall that exceptions are just that, rare corner cases. As we have explained, theJVM typically gambles that they aren't occurring too frequently. If an application throws hundreds of thousands exceptions per second, you should investigate why. Someone may be using exceptions for control flow, or there may be a configuration error. Either way, performance will suffer.
In JRockit Mission Control 3.1, the recording will only provide information about how many exceptions were thrown. The only way to find out where the exceptions originated is unfortunately by changing the verbosity of the log.
An overview of where the JVM spends most of the time executing Java code can be found in the Hot Packages and Hot Classes sections. The only difference between them is the way the sample data from the JVM code optimizer is aggregated. In Hot Packages, hot executing code is sorted on a per-package basis and in Hot Classes on a per-class basis. For more fine-grained information, use the Hot Methods tab.
As shown in the example screenshot, most of the time is spent executing code in the weblogic.servlet.internal package. There is also a fair amount of exceptions being thrown.
This tab provides a detailed view of the information provided by the JVM code optimizer. If the objective is to find a good candidate method for optimizing the application, this is the place to look. If a lot of the method samples are from one particular method, and a lot of the method traces through that method share the same origin, much can potentially be gained by either manually optimizing that method or by reducing the amount of calls along that call chain.
In the following example, much of the time is spent in the method com.bea.wlrt.adapter.defaultprovider.internal.CSVPacketReceiver.parseL2Packet(). It seems likely that the best way to improve the performance of this particular application would be to optimize a method internal to the application container (WebLogic Event Server) and not the code in the application itself, running inside the container. This illustrates both the power of the JRockit Mission Control tools and a dilemma that the resulting analysis may reveal—the answers provided sometimes require solutions beyond your immediate control.
Sometimes, the information provided may cause us to reconsider the way we use data structures. In the next example, the program frequently checks if an object is in a java. util.LinkedList. This is a rather slow operation that is proportional to the size of the list (time complexity O(n)), as it potentially involves traversing the entire list, looking for the element. Changing to another data structure, such as a HashSet would most certainly speed up the check, making the time complexity constant (O(1)) on average, given that the hash function is good enough and the set large enough.
This tab shows various statistics from the JIT-compiler. The information in this tab is mostly of interest when hunting down optimization-related bugs in JRockit. It shows how much time was spent doing optimizations as well as how much time was spent JIT-compiling code at the beginning and at the end of the recording. For each method optimized during the recording, native code size before and after optimization is shown, as well as how long it took to optimize the particular method.
The Thread/Locks tab group contains tabs that visualize thread- and lock-related data. There are five such tabs in JRA—the Overview, Thread, Java Locks, JVM Locks, and Thread Dumps tab.
The Overview tab shows fundamental thread and hardware-related information, such as the number of hardware threads available on the system and the number of context switches per second.
A dual-core CPU has two hardware threads, and a hyperthreaded core also counts as two hardware threads. That is, a dual-core CPU with hyperthreading will be displayed as having four hardware threads.
A high amount of context switches per second may not be a real problem, but better synchronization behavior may lead to better total throughput in the system.
There is a CPU graph showing both the total CPU load on the system, as well as the CPU load generated by the JVM. A saturated CPU is usually a good thing — you are fully utilizing the hardware on which you spent a lot of money! As previously mentioned, in some CPU-bound applications, for example batch jobs, it is normally a good thing for the system to be completely saturated during the run. However, for a standard server-side application it is probably more beneficial if the system is able to handle some extra load in addition to the expected one.
The hardware provisioning problem is not simple, but normally server-side systems should have some spare computational power for when things get hairy. This is usually referred to as overprovisioning, and has traditionally just involved buying faster hardware. Virtualization has given us exciting new ways to handle the provisioning problem.
This tab shows a table where each row corresponds to a thread. The tab has more to offer than first meets the eye. By default, only the start time, the thread duration, and the Java thread ID are shown for each thread. More columns can be made visible by changing the table properties. This can be done either by clicking on the Table Settings icon, or by using the context menu in the table.
As can be seen in the example screenshot, information such as the thread group that the thread belongs to, allocation-related information, and the platform thread ID can also be displayed. The platform thread ID is the ID assigned to the thread by the operating system, in case we are working with native threads. This information can be useful if you are using operating system-specific tools together with JRA.
This tab displays information on how Java locks have been used during the recording. The information is aggregated per type (class) of monitor object.
This tab is normally empty. You need to start JRockit with the system property jrockit.lockprofiling set to true, for the lock profiling information to be recorded. This is because lock profiling may cause anything from a small to a considerable overhead, especially if there is a lot of synchronization.
With recent changes to the JRockit thread and locking model, it would be possible to dynamically enable lock profiling. This is unfortunately not the case yet, not even in JRockit Flight Recorder.
For R28, the system property jrockit.lockprofiling has been deprecated and replaced with the flag -XX:UseLockProfiling.
This tab contains information on JVM internal native locks. This is normally useful for the JRockit JVM developers and for JRockit support.
An example of a native lock would be the code buffer lock that the JVM acquires in order to emit compiled methods into a native code buffer. This is done to ensure that no other code generation threads interfere with that particular code emission.
The JRA recordings normally contain thread dumps from the beginning and the end of the recording. By changing the Thread dump interval parameter in the JRA recording wizard, more thread dumps can be made available at regular intervals throughout the recording.
eBook Price: $41.99
Book Price: $69.99
The latency tools were introduced as a companion to JRockit Real Time. The predictable garbage collection pause times offered by JRockit Real Time made necessary a tool to help developers hunt down latencies introduced by the Java application itself, as opposed to by the JVM. It is not enough to be able to guarantee that the GC does not halt execution for more than a millisecond if the application itself blocks for hundreds of milliseconds when, for example, waiting for I/O.
When working with the tabs in the Latency tab group, we strongly recommend switching to the Latency perspective. The switch can be made from the menu Window | Show Perspective. In the latency perspective, two new views are available to the left — the Event Types view and the Properties view. The Event Types view can be used to select which latency events to examine and the Properties view shows detailed information on any selected event.
Similar to in the GCs tab, at the top of all the latency tabs is the Range Selector. The range selector allows you to temporarily select a subset of the events and only examine this subset. Any changes either to the Event Types view or the range selector will be instantly reflected in the tab. The graph in the backdrop of the range selector, normally colored red, is the normalized amount of events at a specific point in time. The range selector also shows the CPU load as a black graph rendered over the event bars. It is possible to configure what should be displayed using the range selector context menu.
The operative set is an important concept to understand when examining latencies and subsets of latencies. It is a set of events that can be added to and removed from, by using the different tabs in the latency tab group. Think of it as a collection of events that can be brought from one tab to another and which can be modified using the different views in the different tabs. Understanding and using the operative set is very important to get the most out of the latency analysis tool.
The Overview tab provides an aggregated view of the latency events in the recording. At the top, just under the range selector, the Latency Threshold used during the recording can be seen. As potentially very large volumes of data can be recorded, the Latency Threshold is normally used to limit the amount of events so that only the ones longer than a certain threshold are actually recorded. The default is to only record latency events that are longer than 20 milliseconds.
A latency event is any time slice longer than a preset value that the JVM spends not executing Java code. For example, it may instead be waiting for data on a socket or performing a phase of garbage collection that stops the world.
The Event Types histogram and accompanying pie chart show breakdowns per event type. This is useful to quickly get a feel for what kind of events were recorded and their proportions. Note that the pie chart shows the number of events logged, not the total time that the events took. A tip for finding out where certain types of events are occurring is to use the context menu in the Event Types histogram to add the events of a certain type to the operative set. This causes the events of that type to be colored turquoise in the range selector.
Another tip is to go to the Traces view and click on Show only Operative Set, to see the stack traces that led up to the events of that particular type.
The latency Log tab shows all the available events in a table. The table can be used for filtering events, for sorting them, and for showing the events currently in the operative set. It is mostly used to quickly sort the events on duration to find the longest ones. Sometimes, the longest latencies may be due to socket accepts or other blocking calls that would actually be expected to take a long time. In such cases, this table can be used to quickly find and remove such events from the operative set, concentrating only on the problematic latencies.
The latency Graph tab contains a graph where the events are displayed, aggregated per thread. In the following example, we only show the events from the Java and Garbage Collector event types. The normal rendering procedure for the graph is that each event producer gets its own per-thread lane where the events for that particular producer are displayed. For each thread, only one event can occur in a lane at any given time. Garbage collection events are drawn in a slightly different way—they are always shown at the top of the latency event graph, and each GC is highlighted as a backdrop across all the other threads. This is useful for seeing if a certain latency event was caused by a garbage collection. If part of a lane in the latency graph is green, it means that the thread is happily executing Java code. Any other color means that a latency event is taking place at that point in time. To find out what the color means, either hover the mouse over an event and read the tooltip, or check the colors in the Event Types view.
The latency Graph can also show thread transitions. The application in the next screenshot has an exaggerated pathological behavior, where all threads in an application are sharing a logger through a static field, and where the logger is using a synchronized resource for writing log data. As can be seen, the worker threads are all waiting on the shared resource. The worker threads that are there to execute parts of the work in parallel, rely on the shared logger. This actually causes the entire application to execute the work in a sequential fashion instead.
If you see an event that you want to get more information on, you can either select it in the graph and study it in the properties view, hover the mouse pointer over it and read the tooltip that pops up, or add it to the operative set for closer examination in one of the other views.
To summarize, the Graph view is good for providing an overview of the latency and thread behavior of the application.
The Threads tab is basically a table containing the threads in the system, much like the threads tab in the Management Console. The most useful properties in this table tend to be the event Count and the Allocation Rate per thread.
This view is mainly used to select a specific thread, either based on an attribute, such as the one with the highest allocation rate, or for finding a specific thread by name. The operative set is then used to select the events for that particular thread, so that the events can be studied in the other views. Sets of threads can also be selected.
This is where the stack traces for sets of events get aggregated. For instance, if the operative set contains the set of allocation events for String arrays, this is the tab you would go to for finding out where those object allocations take place. In the next example, only the traces for the operative set are visible. As can be seen, the Count and Total Latency columns have bar chart backdrops that show how much of the max value, relative to the other children on the same level, the value in the cell corresponds to. Everything is normalized with respect to the max value.
Normally, while analyzing a recording, you end up in this tab once you have used the other tabs to filter out the set of events that you are interested in. This tab reveals where in the code the events are originating from.
This very powerful tab is used to build histograms of the events. The histogram can be built around any event value. A few examples of useful ones are:
- Object Allocation | Class Name: For finding out what types of allocations are most common
- Java Blocked | Lock Class: For finding out which locks are most frequently blocking in the application
- Java Wait | Lock Class: For finding out where the application spends the most time in wait()
Once the interesting events have been found, they are usually added to the operative set and brought to another tab for further investigation.
Using the Operative Set
Sadly, one of the most overlooked features in JRA is the operative set. This example will explain how the operative set can be used to filter out a very specific set of events to solve a specific problem. In this example, we have been given a JRA recording from another team. The team is unhappy with the garbage collections that take place in an application. Garbage collections seem to be performed too frequently and take too long to complete. We start by looking at the GCs tab in the Memory section.
The initial reaction from examining the garbage collections is that they are indeed quite long, tens of milliseconds, but they do not happen all that often. For the duration of the recording only five GCs took place. We could probably ask the clients to switch to deterministic GC. This would cause more frequent but much shorter garbage collections. However, as we are curious, we would like to get back to them with more than just a GC strategy recommendation. We would still like to know where most of the pressure on the memory system is being created.
As this is an example, we'll make this slightly more complicated than necessary, just to hint at the power of JRA. We switch to the latency data thread tab and look for the most allocation intensive threads. Then we add these events to the Operative Set. It seems that almost all allocation is done in the top three threads.
We have now added all the events for the three threads to our Operative Set, but we are only interested in studying the allocation behavior. In order to do this, we move to the Histogram tab and build a histogram for the allocation events in our Operative Set.
Note that object allocations are weird latency events, as we are normally not interested in their duration, but rather the number of occurrences. We therefore sortthem on event count and see that most of the object allocation events in our three threads are caused by instantiating Strings. As we're eager to find out where these events take place, we limit the operative set to the String allocation events.
We proceed to the Traces view with our trimmed Operative Set, and once again check the Show only Operative Set button.
We quickly see that almost all the String instances in our three most allocation intensive threads are allocated by the same method — readPaddedASCIIString(char , int). If we also add char arrays, we can see that the char arrays are mostly created as a result of allocating the String objects in the same method. This makes sense as each String wraps one char array.
We can now recommend the team not only to change GC strategies, but we can also say for sure, that one way to drastically reduce the pressure on the memory system in this particular application would be to either create less strings in the method at the top in the Traces view, or to reduce the number of calls along the path to that method.
Sometimes, there will be no data in the Code | Hot methods tab, even if the method sampling was enabled. The most common reason is that the application did not generate enough load to provide the method sampler with data. The method sampler only samples threads that are actively running Java code. If the recording is short and the threads are spending most of their time waiting and doing nothing, chances are that no samples will be recorded at all. If you aren't profiling a production system, try to stress your application during profiling, if possible in a manner that is close to the real deployment scenario.
When using native method sampling, all samples will be stored. Normally, only the samples for threads executing Java will be stored. Don't be surprised if you find almost all of the native samples to be in some native method, such as ntdll. dll#KiFastSystemCallRet, even when the system is mostly idle.
In this article series, we showed how to use the JRockit Runtime Analyzer to analyze the behavior of the JRockit runtime and the application running in JRockit.
We showed various different ways to create recordings. We also discussed the information available in JRA recordings with examples.
Several use cases were demonstrated, such as:
- How to find methods that would make good candidates for manual optimization, also known as hot spots
- How to interpret memory-related information, such as the live set, garbage collection-related information, fragmentation, the object summary, and the heap histogram
- How to hunt down latency-related problems
We also provided an example on how to use the Operative Set in the latency analyzer to quickly narrow down the amount of data being studied and to focus on the data of interest.
eBook Price: $41.99
Book Price: $69.99
About the Author :
Marcus Hirt is one of the founders of Appeal Virtual Machines, the company that created the Java Virtual Machine JRockit. He is currently working as Team Lead, Engineering Manager and Architect for the JRockit Mission Control team. In his spare time he enjoys coding on one of his many pet projects, composing music and scuba diving. Marcus has been an appreciated speaker on Oracle Open World, eWorld, BEAWorld, EclipseCon, Nordev and Expert Zone Developer Summit, and has contributed JRockit related articles and webinars to the JRockit community. Marcus Hirt got his M.Sc. education in computer science from the Royal Institute of Technology in Stockholm.
Marcus Lagergren holds a M.Sc. in computer science from the Royal Institute of Technology in Stockholm, Sweden. He has a background in computer security but has worked with runtimes since 1999. Marcus Lagergren was one of the founding members of Appeal Virtual Machines, the company that developed JRockit. Marcus has been team lead and architect for the JRockit code generators and has been involved in pretty much every other aspect of the JRockit JVM internals. Since 2007 Marcus works for Oracle on fast Virtualization technology. Marcus lives in Stockholm with his wife and two daughters. He likes power tools, heavy metal and scuba diving.