Oracle JRockit: The Definitive Guide — Save 50%
Develop and manage robust Java applications with Oracle's high-performance JRockit Java Virtual Machine with this Oracle book and eBook
In this article series by Marcus Hirt and Marcus Lagergren, authors of Oracle JRockit: The Definitive Guide, you will learn:
- Different ways to create a JRA recording
- How to find the hot spots in your application
- How to interpret memory-related information in JRA
In the sequel—Working with JRockit Runtime Analyzer- A Sequel, we will cover the following:
- How to hunt down latency-related problems
- How to detect indications of a memory leak in an application
- How to use the operative set in the JRA latency analyzer component
The JRockit Runtime Analyzer, or JRA for short, is a JRockit-specific profiling tool that provides information about both the JRockit runtime and the application running in JRockit. JRA was the main profiling tool for JRockit R27 and earlier, but has been superseded in later versions by the JRockit Flight Recorder. Because of its extremely low overhead, JRA is suitable for use in production.
This article is mainly targeted at R27.x/3.x versions of JRockit and Mission Control.
The need for feedback
In order to make JRockit an industry-leading JVM, there has been a great need for customer collaboration. As the focus for JRockit consistently has been on performance and scalability in server-side applications, the closest collaboration has been with customers with large server installations. An example is the financial industry. The birth of the JRockit Runtime Analyzer, or JRA, originally came from the need for gathering profiling information on how well JRockit performed at customer sites.
One can easily understand that customers were rather reluctant to send us, for example, their latest proprietary trading applications to play with in our labs. And, of course, allowing us to poke around in a customer's mission critical application in production was completely out of the question. Some of these applications shuffle around billions of dollars per week. We found ourselves in a situation where we needed a tool to gather as much information as possible on how JRockit, and the application running on JRockit, behaved together; both to find opportunities to improve JRockit and to find erratic behavior in the customer application. This was a bit of a challenge, as we needed to get high quality data. If the information was not accurate, we would not know how to improve JRockit in the areas most needed by customers or perhaps at all. At the same time, we needed to keep the overhead down to a minimum. If the profiling itself incurred significant overhead, we would no longer get a true representation of the system. Also, with anything but near-zero overhead, the customer would not let us perform recordings on their mission critical systems in production.
JRA was invented as a method of recording information in a way that the customer could feel confident with, while still providing us with the data needed to improve JRockit. The tool was eventually widely used within our support organization to both diagnose problems and as a tuning companion for JRockit.
In the beginning, a simple XML format was used for our runtime recordings. A human-readable format made it simple to debug, and the customer could easily see what data was being recorded. Later, the format was upgraded to include data from a new recording engine for latency-related data. When the latency data came along, the data format for JRA was split into two parts, the human-readable XML and a binary file containing the latency events. The latency data was put into JRockit internal memory buffers during the recording, and to avoid introducing unnecessary latencies and performance penalties that would surely be incurred by translating the buffers to XML, it was decided that the least intrusive way was to simply dump the buffers to disk.
To summarize, recordings come in two different flavors having either the .jra extension (recordings prior to JRockit R28/JRockit Mission Control 4.0) or the .jfr (JRockit Flight Recorder) extension (R28 or later). Prior to the R28 version of JRockit, the recording files mainly consisted of XML without a coherent data model. As of R28, the recording files are binaries where all data adheres to an event model, making it much easier to analyze the data.
To open a JFR recording, a JRockit Mission Control of version 3.x must be used. To open a Flight Recorder recording, JRockit Mission Control version 4.0 or later must be used.
The recording engine that starts and stops recordings can be controlled in several different ways:
- By using the JRCMD command-line tool.
- By using the JVM command-line parameters. For more information on this, see the -XXjra parameter in the JRockit documentation.
- From within the JRA GUI in JRockit Mission Control.
The easiest way to control recordings is to use the JRA/JFR wizard from within the JRockit Mission Control GUI. Simply select the JVM on which to perform a JRA recording in the JVM Browser and click on the JRA button in the JVM Browser toolbar. You can also click on Start JRA Recording from the context menu. Usually, one of the pre-defined templates will do just fine, but under special circumstances it may be necessary to adjust them. The pre-defined templates in JRockit Mission Control 3.x are:
- Full Recording: This is the standard use case. By default, it is configured to do a five minute recording that contains most data of interest.
- Minimal Overhead Recording: This template can be used for very latency-sensitive applications. It will, for example, not record heap statistics, as the gathering of heap statistics will, in effect, cause an extra garbage collection at the beginning and at the end of the recording.
- Real Time Recording: This template is useful when hunting latency-related problems, for instance when tuning a system that is running on JRockit Real Time. This template provides an additional text field for setting the latency threshold. The latency threshold is explained later in the article in the section on the latency analyzer. The threshold is by default lowered to 5 milliseconds for this type of recording, from the default 20 milliseconds, and the default recording time is longer.
- Classic Recording: This resembles a classic JRA recording from earlier versions of Mission Control. Most notably, it will not contain any latency data. Use this template with JRockit versions prior to R27.3 or if there is no interest in recording latency data.
All recording templates can be customized by checking the Show advanced options check box. This is usually not needed, but let's go through the options and why you may want to change them:
- Enable GC sampling: This option selects whether or not GC-related information should be recorded. It can be turned off if you know that you will not be interested in GC-related information. It is on by default, and it is a good idea to keep it enabled.
- Enable method sampling: This option enables or disables method sampling. Method sampling is implemented by using sample data from the JRockit code optimizer. If profiling overhead is a concern (it is usually very low, but still), it is usually a good idea to use the Method sample interval option to control how much method sampling information to record.
- Enable native sampling: This option determines whether or not to attempt to sample time spent executing native code as a part of the method sampling. This feature is disabled by default, as it is mostly used by JRockit developers and support. Most Java developers probably do fine without it.
- Hardware method sampling: On some hardware architectures, JRockit can make use of special hardware counters in the CPU to provide higher resolution for the method sampling. This option only makes sense on such architectures.
- Stack traces: Use this option to not only get sample counts but also stack traces from method samples. If this is disabled, no call traces are available for sample points in the methods that show up in the Hot Methods list.
- Trace depth: This setting determines how many stack frames to retrieve for each stack trace. For JRockit Mission Control versions prior to 4.0, this defaulted to the rather limited depth of 16. For applications running in application containers or using large frameworks, this is usually way too low to generate data from which any useful conclusions can be drawn. A tip, when profiling such an application, would be to bump this to 30 or more.
- Method sampling interval: This setting controls how often thread samples should be taken. JRockit will stop a subset of the threads every Method sample interval milliseconds in a round robin fashion. Only threads executing when the sample is taken will be counted, not blocking threads. Use this to find out where the computational load in an application takes place. See the section, Hot Methods for more information.
- Thread dumps: When enabled, JRockit will record a thread stack dump at the beginning and the end of the recording. If the Thread dump interval setting is also specified, thread dumps will be recorded at regular intervals for the duration of the recording.
- Thread dump interval: This setting controls how often, in seconds, to record the thread stack dumps mentioned earlier.
- Latencies: If this setting is enabled, the JRA recording will contain latency data. For more information on latencies, please refer to the section Latency later in this article.
- Latency threshold: To limit the amount of data in the recording, it is possible to set a threshold for the minimum latency (duration) required for an event to actually be recorded. This is normally set to 20 milliseconds. It is usually safe to lower this to around 1 millisecond without incurring too much profiling overhead. Less than that and there is a risk that the profiling overhead will become unacceptably high and/or that the file size of the recording becomes unmanageably large. Latency thresholds can be set as low as nanosecond values by changing the unit in the unit combo box.
- Enable CPU sampling: When this setting is enabled, JRockit will record the CPU load at regular intervals.
- Heap statistics: This setting causes JRockit to do a heap analysis pass at the beginning and at the end of the recording. As heap analysis involves forcing extra garbage collections at these points in order to collect information, it is disabled in the low overhead template.
- Delay before starting a recording: This option can be used to schedule the recording to start at a later time. The delay is normally defined in minutes, but the unit combo box can be used to specify the time in a more appropriate unit — everything from seconds to days is supported.
Before starting the recording, a location to which the finished recording is to be downloaded must be specified. Once the JRA recording is started, an editor will open up showing the options with which the recording was started and a progress bar. When the recording is completed, it is downloaded and the editor input is changed to show the contents of the recording.
Analyzing JRA recordings
Analyzing JRA recordings may easily seem like black magic to the uninitiated, so just like we did with the Management Console, we will go through each tab of the JRA editor to explain the information in that particular tab, with examples on when it is useful.
Just like in the console, there are several tabs in different tab groups.
|Develop and manage robust Java applications with Oracle's high-performance JRockit Java Virtual Machine with this Oracle book and eBook|
eBook Price: $41.99
Book Price: $69.99
The tabs in the General tab group provide views of key characteristics and recording metadata. In JRA, there are three tabs—Overview, Recording, and System.
The first tab in the General tab group is the Overview tab. This tab contains an overview of selected key information from the recording. The information is useful for checking the system health at a glance.
The first section in the tab is a dial dashboard that contains CPU usage, heap, and pause time statistics.
What to look for depends on the system. Ideally the system should be well utilized, but not saturated. A good rule of thumb for most setups would be to keep the Occupied Heap (Live Set + Fragmentation) to half or less than half of the max heap. This keeps the garbage collection ratio down.
All this, of course, depends on the type of application. For an application with very low allocation rates, the occupied heap can be allowed to be much larger. An application that does batch calculations, concerned with throughput only, would want the CPU to be fully saturated while garbage collection pause times may not be a concern at all.
The Trends section shows charts for the CPU usage and occupied heap over time so that trends can be spotted. Next to the Trends section is a pie chart showing heap usage at the end of the recording. If more than about a third of the memory is fragmented, some time should probably be spent tuning the JRockit garbage collector Benchmarking and Tuning and the JRockit Diagnostics Guide on the Internet). It may also be the case that the allocation behavior of the application needs to be investigated. See the Histogram section for more information.
At the bottom of the page is some general information about the recording, such as the version information for the recorded JVM. Version information is necessary when filing support requests.
In our example, we can see that the trend for Live Set + Fragmentation is constantly increasing. This basically means that after each garbage collection, there is less free memory left on the heap. It is very likely that we have a memory leak, and that, if we continue to let this application run, we will end up with an OutOfMemoryError.
This tab contains meta information about the recording, such as its duration and the values of all the recording parameters used. This information can, among other things, be used to check if information is missing from a recording, or if that particular piece of information had simply been disabled for the recording.
This tab contains information about the system the JRockit JVM was running on, such as the OS. The JVM arguments used to start the JVM can also be viewed here.
The Memory tab group contains tabs that deal with memory-related information, such as heap usage and garbage collections. In JRA there are six such tabs, Overview, GCs, GC Statistics, Allocation, Heap Statistics, Heap Contents, and Object Statistics.
The first tab in the Memory tab group is the Overview tab. It shows an overview of the key memory statistics, such as the physical memory available on the hardware on which the JVM was running. It also shows the GC pause ratio, i.e. the time spent paused in GC in relation to the duration of the entire recording.
If the GC pause ratio is higher than 15-20%, it usually means that there is significant allocation pressure on the JVM.
At the bottom of the Overview tab, there is a listing of the different garbage collection strategy changes that have occurred during recording.
Here you can find all the information you would ever want to know about the garbage collections that occurred during the recording, and probably more.
With the GCs tab, it is usually a good idea to sort the Garbage Collections table on the attribute Longest Pause, unless you know exactly at what time from the start of the JVM you want to drill down. You might know this from reading the application log or from the information in some other tab in JRA. In the following example, the longest pause also happens to be the first one.
It is sometimes a good idea to leave out the first and last garbage collections from the analysis, depending on the recording settings. Some settings will force the first and last GC in the recording to be full garbage collections with exceptional compaction, to gather extra data. This may very well break the pausetime target for deterministic GC. This is also true for JRockit Flight Recorder.
At the top of the screen is the Range Selector. The Range Selector is used to temporally select a set of events in the recording. In this case, we have zoomed in on a few of the events at the beginning of the recording. We can see that throughout this range, the size of the occupied heap (the lowest line, which shows up in green) is around half the committed heap size (the flat topmost line, which shows up in blue), with some small deviations.
In an application with a very high pause-to-run ratio, the occupied heap would have been close to the max heap. In that case, increasing the heap would probably be a good start to increase performance. There are various ways of increasing the heap size, but the easiest is simply setting a maximum heap size using the –Xmx flag on the command line. In the example, however, everything concerning the heap usage seems to be fine.
In the Details section, there are various tabs with detailed information about a selected garbage collection. A specific garbage collection can be examined more closely, either by clicking in the GC chart or by selecting it in the table.
Information about the reason for a particular GC, reference queue sizes, and heap usage information is included in the tabs in the Details section. Verbose heap information about the state before and after the recording, the stack trace for the allocation that caused the GC to happen, if available, and detailed information about every single pause part can also be found here.
In the previous screenshot, a very large portion of the GC pause is spent handling the reference queues. Switching to the References and Finalizers chart will reveal that the finalizer queue is the one with most objects in it.
One way to improve the memory performance for this particular application would be to rely less heavily on finalizers.
The recordings shown in the GCs tab examples earlier were created with JRockit R27.1, but are quite good examples anyway, as they are based on real-life data that was actually used to improve a product. As can be seen from the screenshot, there is no information about the start time of the individual pause parts. Recordings made using a more recent version of JRockit would contain such information. We are continuously improving the data set and adding new events to recordings.
Following is a more recent recording with an obvious finalizer problem. The reasons that the pause parts differ from the previous examples is both that we are now using a different GC strategy as well as the fact that more recent recordings contain more detail. The finalizer problem stands out quite clearly in the next screenshot.
The data in the screenshot is from a different application, but it nicely illustrates how a large portion of the garbage collection pause is spent following references in the finalizers. Handling the finalizers is even taking more time than the notorious synchronized external compaction. Finalizers are an obvious bottleneck.
To make fewer GCs happen altogether, we need to find out what is actually causing the GCs to occur. This means that we have to identify the points in the program where the most object allocation takes place. One good place to start looking is the GC Call Trees table introduced in the next section. If more specific allocation-related information is required, go to the Object Allocation events in the Latency tab group.
For some applications, we can lower the garbage collection pause times by tuning the JRockit memory system.
This tab contains some general statistics about the garbage collections that took place during the recording. One of the most interesting parts is the GC Call Trees table that shows an aggregated view of the stack traces for any garbage collection. Unfortunately, it shows JRockit-specific internal code frames as well, which means that you may have to dig down a few stack frames until the frames of interest are found—i.e., code you can affect.
Prior to version R27.6 of JRockit, this was one of the better ways of getting an idea of where allocation pressure originated. In more recent versions, there is a much more powerful way of doing allocation profiling, which will be described in the Histogram section.
In the interest of conserving space, only the JRockit internal frames up to the first non-internal frame have been expanded in the following screenshot. The information should be interpreted as most of the GCs are being caused as the result of calls to Arrays.copyOf(char, int) in the Java program.
The Allocation tab contains information that can be used mainly for tuning the JRockit memory system. Here, relative allocation rates of large and small objects are displayed, which affects the choice of the Thread Local Area (TLA) size. Allocation can also be viewed on a per-thread basis, which can help find out where to start tuning the Java program in order to make it stress the memory system less.
Again, a more powerful way of finding out where to start tuning the allocation behavior of a Java program is usually to work with the Latency | Histogram tab, described later in this article.
Information on the memory distribution of the heap can be found under the Heap Contents tab. The snapshot for this information is taken at the end of the recording. If you find that your heap is heavily fragmented, there are two choices—either try to tune JRockit to take better care of the fragmentation or try to change the allocation behavior of your Java application. JVM combats fragmentation by doing compaction. In extreme cases, with large allocation pressure and high performance demands, you may have to change the allocation patterns of your application to get the performance you want.
The Object Statistics tab shows a histogram of what was on the heap at the beginning and at the end of the recording. Here you can find out what types (classes) of objects are using the most memory on the heap. If there is a large positive delta between the snapshots at the beginning and at the end of the recording, it means that there either is a memory leak or that the application was merely executing some large operation that required a lot of memory.
In the previous example, there is actually a memory leak that causes instances of Double to be held on to forever by the program. This will eventually cause an OutOfMemoryError.
The best way to find out where these instances are being created is to either check for Object Allocation events of Double (see the second example in the Histogram section) or to turn on allocation profiling in the Memory Leak Detector.
|Develop and manage robust Java applications with Oracle's high-performance JRockit Java Virtual Machine with this Oracle book and eBook|
eBook Price: $41.99
Book Price: $69.99
About the Author :
Marcus Hirt is one of the founders of Appeal Virtual Machines, the company that created the Java Virtual Machine JRockit. He is currently working as Team Lead, Engineering Manager and Architect for the JRockit Mission Control team. In his spare time he enjoys coding on one of his many pet projects, composing music and scuba diving. Marcus has been an appreciated speaker on Oracle Open World, eWorld, BEAWorld, EclipseCon, Nordev and Expert Zone Developer Summit, and has contributed JRockit related articles and webinars to the JRockit community. Marcus Hirt got his M.Sc. education in computer science from the Royal Institute of Technology in Stockholm.
Marcus Lagergren holds a M.Sc. in computer science from the Royal Institute of Technology in Stockholm, Sweden. He has a background in computer security but has worked with runtimes since 1999. Marcus Lagergren was one of the founding members of Appeal Virtual Machines, the company that developed JRockit. Marcus has been team lead and architect for the JRockit code generators and has been involved in pretty much every other aspect of the JRockit JVM internals. Since 2007 Marcus works for Oracle on fast Virtualization technology. Marcus lives in Stockholm with his wife and two daughters. He likes power tools, heavy metal and scuba diving.