# Using JDK FlightRecorder and JDK Mission Control

JDK Flight Recorder, the profiler you can use in production, continuously. Flight Recorder has been available before in the JDK, e.g. it shipped as part of the JDK 8, but to use it, it demanded a to set specific commercial VM flags to unlock FlightRecorder, this is not anymore necessary with Java 11.

 This page assumes JDK 11, that means there is no need to set `-XX:+UnlockCommercialFeatures`, `-XX:+FlightRecorder` flags. Or more particular to JFR the use of `-XX:+UnlockDiagnosticVMOptions` and `-XX:+DebugNonSafepoints` source

## Brief introduction to JDK Flight Recorder

Flight Recorder is a feature backed in the JVM that traces what’s happening inside the JVM with events. FlightRecorder development started a long time ago in the early 2000, on a JVM named JRockit.

This JVM was first developed by a swedish company named Appeal, then Appeal got acquired by BEA. Oracle acquired BEA, then acquired Sun, and merged the source of JRockit to Hotspot.

At some point BEA/Appeal engineers needed insight in their JVM, in production, logging was not an option as it had too much overhead, so they designed an event based system to work reliably in production.

JDK Mission Control on the other end is a software that will analyze these events and build consolidated views for us to analyze.

JFR with JDK Mission Control together are similar to system architectured as Event Sourcing system.

JFR big picture

I won’t dive in, but most profiler do have a bias when profiling and don’t report accurate results. The issue lies in how profiler handle JVM safepoints, most don’t account properly the time spent in safepoints. Some great profiler like the great async-profiler, are not subject to these biases, JFR is also not subject to it.

Here’s great material provided by Nitsan Wakart :

## Using JFR

### How to start a recording ?

One of the first interesting thing you can do is to start a recording at runtime ; `jcmd` is the tool for this job, it’s the Swiss Army knife tool for to service your Java app. If you don’t know it, use it, it’s really great !

Prior to Java 11 FlightRecorder had to be enabled before use. Now it’s not necessary anymore it’s possible to start a JFR session.

FlightRecorder is on by default
``````$jcmd$(pgrep java) VM.flags -all | grep FlightRecorder
bool FlightRecorder                           = true                                      {product} {management}
ccstr FlightRecorderOptions                    =                                           {product} {default}``````

Just run `JFR.start` on the running java pid.

``````$jcmd$(pidof java) JFR.start
6:
Started recording 2. No limit specified, using maxsize=250MB as default.

Use jcmd 6 JFR.dump name=2 filename=FILEPATH to copy recording data to file.``````

The output guides the user to the next useful commands, in particular `JFR.dump`. Also, this commands tells you that a recording by default is limited to 250Mb. `jcmd` provides the `help` command that describes document for each command options.

JFR.start example
``````$jcmd$(pgrep java) JFR.start \
name=app-profile \
duration=300s \
filename=/tmp/app-profile-$(date +%FT%H-%M-%S).jfr \ settings=profile 6: Started recording 2. The result will be written to: /tmp/app-profile-2020-03-26T16-41-48.jfr`````` JFR.start help ``````$ jcmd $(pgrep java) help JFR.start 6: JFR.start Starts a new JFR recording Impact: Medium: Depending on the settings for a recording, the impact can range from low to high. Permission: java.lang.management.ManagementPermission(monitor) Syntax : JFR.start [options] Options: (options must be specified using the <key> or <key>=<value> syntax) name : [optional] Name that can be used to identify recording, e.g. \"My Recording\" (STRING, no default value) settings : [optional] Settings file(s), e.g. profile or default. See JRE_HOME/lib/jfr (STRING SET, no default value) delay : [optional] Delay recording start with (s)econds, (m)inutes), (h)ours), or (d)ays, e.g. 5h. (NANOTIME, 0) duration : [optional] Duration of recording in (s)econds, (m)inutes, (h)ours, or (d)ays, e.g. 300s. (NANOTIME, 0) disk : [optional] Recording should be persisted to disk (BOOLEAN, no default value) filename : [optional] Resulting recording filename, e.g. \"/home/user/My Recording.jfr\" (STRING, no default value) maxage : [optional] Maximum time to keep recorded data (on disk) in (s)econds, (m)inutes, (h)ours, or (d)ays, e.g. 60m, or 0 for no limit (NANOTIME, 0) maxsize : [optional] Maximum amount of bytes to keep (on disk) in (k)B, (M)B or (G)B, e.g. 500M, or 0 for no limit (MEMORY SIZE, 0) dumponexit : [optional] Dump running recording when JVM shuts down (BOOLEAN, no default value) path-to-gc-roots : [optional] Collect path to GC roots (BOOLEAN, false)`````` In production, you’ll be most likely using `duration`, `maxsize`, `filename` and `settings` options. We’ll briefly look at other JFR commands after discussing the `settings`.  Later in this article we’ll see how to start recording from the command line using `-XX:StartFlightRecording`. ### Setting files The `settings` option refers to a configuration of the FlightRecorder, the JDK ships with two : `default` and `profile`. A configuration is an XML file, with `event` elements that describe how JFR in the JVM will handle events, if they are enabled, their threshold, if stacktraces are recorded, etc. And there is also a `control` element that is used by JDK Mission Control. Here’s the first few lines of the `default` settings. As its name impies, this is the configuration that is used when the command is run without a `settings` option.$JAVA_HOME/lib/jfr/default.jfc
``````<?xml version="1.0" encoding="UTF-8"?>

<!--
Recommended way to edit .jfc files is to use Java Mission Control,
see Window -> Flight Recorder Template Manager.
-->

<configuration version="2.0"
label="Continuous"
description="Low overhead configuration safe for continuous use in production environments, typically less than 1 % overhead."
provider="Oracle">

<setting name="enabled">true</setting>
<setting name="period">everyChunk</setting>
</event>

<!-- a lot more events -->

<!-- then the control element -->
</configuration>``````

In terms of file size magnitude on a pretty busy web application server using the `default` settings and for a duration of 5 minutes, the resulting dumped file weighs 15 MiB. With this profile you’ll get more than basic information, IO, GC events, locking behavior, thread events, method profiling, etc.

The announced overhead is maximum 1% !

$JAVA_HOME/lib/jfr/profile.jfc ``````<!-- Recommended way to edit .jfc files is to use Java Mission Control, see Window -> Flight Recorder Template Manager. --> <configuration version="2.0" label="Profiling" description="Low overhead configuration for profiling, typically around 2 % overhead." provider="Oracle"> <event name="jdk.ThreadAllocationStatistics"> <setting name="enabled">true</setting> <setting name="period">everyChunk</setting> </event> <!-- a lot more event --> </configuration>`````` With the `profile` settings, the dumped file takes around 35mb for a 5min duration. And it will get access to additional events like the `OldObjectSample` stacktraces, or TLS events like TLS handshakes, X509 validation, Classloading events, etc. It actually has a tad more overhead, 2%. In most workload this should be ok. To value of the `settings` option is file name of these files `default` or `profile`. In addition, it’s also possible to pass an absolute file path, in other words it’s possible to use configuration of our own stored elsewhere. ### Dumping a recording If it’s needed to acquire the recording, it’s possible to dump it at anytime. JFR.dump example ``````$ jcmd $(pidof java) JFR.dump filename=/tmp/app-profile-$(date +%FT%H-%M-%S).jfr
6:
Dumped recording, 239.5 MB written to:

/tmp/app-profile-2020-06-26T15-16-57.jfr``````
JFR.dump help
``````$jcmd$(pgrep java) help JFR.dump
6:
JFR.dump
Copies contents of a JFR recording to file. Either the name or the recording id must be specified.

Impact: Low

Permission: java.lang.management.ManagementPermission(monitor)

Syntax : JFR.dump [options]

Options: (options must be specified using the <key> or <key>=<value> syntax)
name : [optional] Recording name, e.g. \"My Recording\" (STRING, no default value)
filename : [optional] Copy recording data to file, e.g. \"/home/user/My Recording.jfr\" (STRING, no default value)
maxage : [optional] Maximum duration to dump, in (s)econds, (m)inutes, (h)ours, or (d)ays, e.g. 60m, or 0 for no limit (NANOTIME, 0)
maxsize : [optional] Maximum amount of bytes to dump, in (M)B or (G)B, e.g. 500M, or 0 for no limit (MEMORY SIZE, 0)
begin : [optional] Point in time to dump data from, e.g. 09:00, 21:35:00, 2018-06-03T18:12:56.827Z, 2018-06-03T20:13:46.832, -10m, -3h, or -1d (STRING, no default value)
end : [optional] Point in time to dump data to, e.g. 09:00, 21:35:00, 2018-06-03T18:12:56.827Z, 2018-06-03T20:13:46.832, -10m, -3h, or -1d (STRING, no default value)
path-to-gc-roots : [optional] Collect path to GC roots (BOOLEAN, false)``````

If there is a single recording at the time it’s possible to just use `JFR.dump`, but JFR is powerful enough to support multiple concomitant recordings, in this case you need to specify which recording to dump, obviously. Some options override those defined in the start command like `filename` or `maxage` for the current dump in particular. THe other options are certainly interesting but I found them a bit less useful in practice.

### Details of the active recording(s)

If they are multiple active recordings or if it’s necessary to check the event configuration of the active recording `jcmd` comes with the `JFR.check`.

JFR.check example
``````$jcmd$(pgrep java) JFR.check
6:
Recording 2: name=2 maxsize=250.0MB (running)``````
JFR.check help
``````$jcmd$(pgrep java) help JFR.check
6:
JFR.check
Checks running JFR recording(s)

Impact: Low

Permission: java.lang.management.ManagementPermission(monitor)

Syntax : JFR.check [options]

Options: (options must be specified using the <key> or <key>=<value> syntax)
name : [optional] Recording name, e.g. \"My Recording\" or omit to see all recordings (STRING, no default value)
verbose : [optional] Print event settings for the recording(s) (BOOLEAN, false)``````

The `verbose` option allows examining which event are enabled for a recording.

### Stopping a active recording

When the recording session is deemed over, then one can stop it providing a different file name than the one set in the start command.

JFR.stop example
``````$jcmd$(pgrep java) JFR.stop \
name=app-profile \
filename=/tmp/app-profile-$(date +%FT%H-%M-%S).jfr`````` JFR.stop help ``````$ jcmd $(pgrep java) help JFR.stop 6: JFR.stop Stops a JFR recording Impact: Low Permission: java.lang.management.ManagementPermission(monitor) Syntax : JFR.stop [options] Options: (options must be specified using the <key> or <key>=<value> syntax) name : Recording text,.e.g \"My Recording\" (STRING, no default value) filename : [optional] Copy recording data to file, e.g. \"/home/user/My Recording.jfr\" (STRING, no default value)`````` ### Global Flight Recorder configuration What we saw before is how to start a recording and how to configure this specific recording. But there is another class of options that modifies aspects of the JFR internals. As a reminder those affects all recording in some way. JFR.configure example ``````$ jcmd $(pidof java) JFR.configure \ stackdepth=96 \ repositorypath=/tmp/jfr-repo 6: Repository path: /tmp/jfr-repo/2020_06_26_16_01_58_6 Dump path: /gclogs Stack depth: 96$ jcmd $(pidof java) JFR.configure 6: Current configuration: Repository path: /tmp/jfr-repo/2020_06_26_16_03_41_6 Stack depth: 96 Global buffer count: 20 Global buffer size: 512.0 kB Thread buffer size: 8.0 kB Memory size: 10.0 MB Max chunk size: 12.0 MB Sample threads: true`````` JFR.configure help ``````$ jcmd $(pidof java) help JFR.configure 80657: JFR.configure Configure JFR Impact: Low Permission: java.lang.management.ManagementPermission(monitor) Syntax : JFR.configure [options] Options: (options must be specified using the <key> or <key>=<value> syntax) repositorypath : [optional] Path to repository,.e.g \"My Repository\" (STRING, no default value) dumppath : [optional] Path to dump,.e.g \"My Dump path\" (STRING, no default value) stackdepth : [optional] Stack Depth (JULONG, 64) globalbuffercount : [optional] Number of global buffers, (JULONG, 20) globalbuffersize : [optional] Size of a global buffers, (MEMORY SIZE, 512k) thread_buffer_size : [optional] Size of a thread buffer (MEMORY SIZE, 8k) memorysize : [optional] Overall memory size, (MEMORY SIZE, 10m) maxchunksize : [optional] Size of an individual disk chunk (MEMORY SIZE, 12m) samplethreads : [optional] Activate Thread sampling (BOOLEAN, true)`````` Here I’m increasing the `stackdepth`, this might be useful to generate more accurate flamegraphs, or for some other analysis like with the `OldObjectSample`. The `repositorypath` is where JFR dumps regularly slices or chunks of jfr events, they have maximum size of `maxchunksize`. These files behave like a log rolling appender. By default, these chunks are stored in the temporary directory and in a subfolder with a timestamp. JFR repository ``````$ ls -lah /tmp/jfr-repo/2020_06_26_16_03_41_6/
total 71M
drwxr-xr-x 2 43514 root 4.0K Jun 26 16:21 .
drwxr-xr-x 3 43514 root 4.0K Jun 26 16:03 ..
-rw-r--r-- 1 43514 root 2.4M Jun 26 16:04 2020_06_26_16_04_02.jfr
-rw-r--r-- 1 43514 root 3.6M Jun 26 16:04 2020_06_26_16_04_12.jfr
-rw-r--r-- 1 43514 root  18M Jun 26 16:10 2020_06_26_16_04_47.jfr
-rw-r--r-- 1 43514 root 2.5M Jun 26 16:10 2020_06_26_16_10_18.jfr
-rw-r--r-- 1 43514 root  19M Jun 26 16:16 2020_06_26_16_10_26.jfr
-rw-r--r-- 1 43514 root  18M Jun 26 16:21 2020_06_26_16_16_16.jfr
-rw-r--r-- 1 43514 root    0 Jun 26 16:21 2020_06_26_16_21_50.jfr
-rw-r--r-- 1 43514 root 8.7M Jun 26 16:25 2020_06_26_16_21_50.part``````

I’m not sure why some chunks are over 12M (the default chunk size) at this time.

Careful however as some of these options are not well documented, and may not expose what we’d expect, e.g. `dumppath` only affects dump created when the app crashes and only if the `dumponexit` recording option is true.

 These options are also available at startup via `-XX:FlightRecorderOptions`, we’ll see later how to use this option.

### JFR logs

Thanks to unified logging, it’s easy to open the hood on any JVM runtime feature. In order to follow JFR, it’s possible to JFR component, to understand how it works and how options affect recordings.

``-Xlog:jfr``
``````[0.337s][info][jfr] Flight Recorder initialized
[0.338s][info][jfr] Created repository /tmp/2020_06_19_13_08_28_6
[0.367s][info][jfr] Creating thread sampler for java:20 ms, native 0 ms
[0.367s][info][jfr] Updated thread sampler for java: 20  ms, native 0 ms
[0.367s][info][jfr] Updated thread sampler for java: 20  ms, native 0 ms
[0.367s][info][jfr] Updated thread sampler for java: 20  ms, native 20 ms
[0.373s][info][jfr] Started recording "startup" (1) {maxsize=200.0MB, dumponexit=true, duration=6m, filename=/var/log/jfr/startup.jfr}
...
[0.847s][info][jfr] Updated thread sampler for java: 0  ms, native 20 ms
[0.848s][info][jfr] Stopped recording "1" (1). Reason "Dump on exit".
[0.862s][info][jfr] Wrote recording "1" (1) to /var/log/jfr/startup.jfr
[0.864s][info][jfr] Closed recording "1" (1)
[0.866s][info][jfr] Removed repository /tmp/2020_06_19_13_08_28_6``````

Increasing the log level may reveal additional details.

### Memory usage in the process

The section above describes where the actual data is saved for long or large (`duration`, `maxage`, `maxsize`) profiling sessions, e.g. the reposotory, on disk, will grow within these constraints. JFR is safe to enable in prod but there’s an overhead in memory as well, although it’s usually minimal compared to the heap or other native memory sections, but it’s worth mentioning.

If NMT is enabled, you can just display the summary:

``````$jcmd$(pidof java) VM.native_memory
6:

Native Memory Tracking:

Total: reserved=5324939KB, committed=3600539KB
-                 Java Heap (reserved=2793472KB, committed=2793472KB)
(mmap: reserved=2793472KB, committed=2793472KB)

...

-                   Tracing (reserved=75866KB, committed=75866KB) (1)
(malloc=75866KB #85438)

...``````
 1 JFR’s `Tracing` memory zone uses ~74MB

This was taken on a very active application, with custom events, you mileage may vary.

The next output shows the committed memory for tracing, after a 6 min recording, which means JFR will keep a memory zone any. That is the minimal JFR footprint I experienced.

``````-                   Tracing (reserved=21041KB, committed=21041KB)
(malloc=21041KB #2783)``````

## Analyze with `jfr`

Now to exploit the recording, we have a tool named `jfr` that ships with the JDK. On Linux the alternative jdk management may not be aware of `jfr`, which means you may need to use the full path to this executable.

The first interesting thing to do is to get an overview of the recording, the `summary` sub-command displays an histogram of the events.

events type histogram (summary)
``````$jfr summary /tmp/app-profile-2020-03-26T16-57-14.jfr Version: 2.0 Chunks: 1 Start: 2020-03-26 16:57:14 (UTC) Duration: 303 s Event Type Count Size (bytes) =========================================================== jdk.ThreadPark 130278 5868710 jdk.SocketRead 38804 1934842 jdk.JavaMonitorWait 38722 1378513 jdk.NativeMethodSample 14702 263403 jdk.ThreadCPULoad 11821 271763 jdk.ExecutionSample 3010 54177 jdk.ModuleExport 2505 40187 jdk.ClassLoaderStatistics 2344 72694 jdk.ThreadAllocationStatistics 878 16962 jdk.ModuleRequire 754 11964 jdk.BooleanFlag 648 23106 jdk.CPULoad 298 7450 jdk.JavaThreadStatistics 298 6258 jdk.ClassLoadingStatistics 298 5066 jdk.CompilerStatistics 298 11324 jdk.ExceptionStatistics 298 6258 jdk.ActiveSetting 285 10497 jdk.BiasedLockRevocation 275 7831 ... jdk.GCPhasePauseLevel1 20 965 jdk.CheckPoint 17 1631868 jdk.ExecuteVMOperation 15 391 jdk.DoubleFlag 13 618 jdk.BiasedLockClassRevocation 10 275 jdk.GCHeapSummary 10 475 jdk.MetaspaceSummary 10 580 jdk.G1HeapSummary 10 300 jdk.OldObjectSample 10 367 ... jdk.BiasedLockSelfRevocation 2 45 jdk.PhysicalMemory 2 46 jdk.ThreadDump 2 1389568 jdk.CodeSweeperStatistics 2 64 jdk.GCConfiguration 2 60 jdk.ThreadEnd 1 17 jdk.Metadata 1 74738 jdk.JavaMonitorEnter 1 33 jdk.SafepointBegin 1 24 jdk.JVMInformation 1 898 jdk.OSInformation 1 367 jdk.VirtualizationInformation 1 33 jdk.CPUInformation 1 1432 jdk.CPUTimeStampCounter 1 25 jdk.CompilerConfiguration 1 15 jdk.CodeCacheConfiguration 1 51 ... jdk.X509Certificate 0 0 jdk.TLSHandshake 0 0`````` But other interesting things could be done using this tool. The `print` sub-command can extract these events, in XML or in JSON. From there it’s possible to perform other type of aggregation using other tools. extract data from using `jfr` and `jq` ``````$ jfr print \
--json \
/gclogs/startup.jfr \
| jq '.recording.events[] | .values.duration'``````

It’s also possible to assemble `jfr` files or break them in smaller parts. As a side note, the files in the repository can be exploited this way. Keep in mind these files may be removed as soon as they are expired or as soon as every recording stops.

summary on a chunk in the repository
``````$jfr summary /tmp/jfr-repo/2020_06_26_16_03_41_6/2020_06_26_16_04_12.jfr Version: 2.0 Chunks: 1 Start: 2020-06-26 16:04:12 (UTC) Duration: 35 s Event Type Count Size (bytes) =========================================================== jdk.ThreadPark 19853 918218 jdk.SocketRead 6459 325796 jdk.JavaMonitorWait 5581 200005 jdk.ClassLoaderStatistics 2620 81098 ...`````` ## Controlling JFR programmatically While JDK 14 allows consuming events on the fly, previous JDK versions (from JDK 11) offer a public API useful enough to control Flight Recorder programmatically or to read events from a JFR file. Start JFR programmatically ``````Configuration c = Configuration.getConfiguration("profile"); (1) Recording r = new Recording(c); r.setName("monitor jvm"); r.enable("jdk.*"); (2) r.setMaxAge(java.time.Duration.ofMinutes(4)); (3) r.start(); (4) // to be profiled r.stop(); (5) r.dump(Files.createFile("/var/log/jfr/app-initiated.jfr")); (6)``````  1 As shown above choose the JFR configuration. 2 Choose which events the recording should be interested in. Another signature accepts classes, it’s unlikely to be helpful for JDK events, but it may get interesting for custom events, your classes. 3 Eventually set recording constraints, like the maximum age of the records. 4 Hit record 5 When the recording session is over, stop JFR. 6 Then store the results in the location of your choosing. The above snippet creates a continuous profiling session with a 4 minute window. Now the API allows reading emitted `.jfr` files. The API represents what’s actually in a file, a schema of the events and the events themselves. Read JFR files yourself ``````try(RecordingFile rf = new RecordingFile(Paths.get("/var/log/jfr/app-initiated.jfr"))( { (1) // read the schema rf.readEventTypes().forEach((EventType et) -> { (2) System.out.println(et.getName()); et.getFields() .stream() .map((ValueDescriptor vd) -> vd.getName()) .forEach(System.out::println); }); // actual events for(jdk.jfr.consumer.RecordedEvent e = rf.readEvent(); rf.hasMoreEvents(); e = rf.readEvent()) { (3) System.out.println(e.getEventType().getName()); (4) } }``````  1 Open the JFR file, it’s a `Closeable` and it reads a file, so be sure to use it in a try-with-resources block. 2 `readEventTypes()` gets you the schema of the events, fields name, labels, thresholds, etc. 3 Then there’s this weird enumeration style api to read the events `hasMoreEvents()` and `readEvent()` 4 Access details on the event type. `RecordingFile` api is a bit awkward to work with, more specifically parsing each event requires looking at the event descriptor (via `getEventType()`, or getFields()), and interrogate the event as fields presence may evolve with each JDK revision. The javadoc advises defensive programming style when reading a JFR file : snippet form javadoc ``````if (event.hasField("intValue")) { int intValue = event.getValue("intValue"); System.out.println("Int value: " + intValue); }`````` This API is now complemented by streaming live events JEP-349 in JDK 14 using an API `RecordingStream` that is mix of the above, that’s out of scope for this article. But that’s yet another reason to make the effort to upgrade our JDK. Such API facilities are useful especially when combined with other technologies like Spring Actuators. Yet when there’s available integration or when using these integrations is too late, like recording startup the most actionable way to get recording is from the command line. ## How and when to record in production The next points will present command line flags to set up JFR.  On JDK 11 there is the `JDK_JAVA_OPTIONS` environment variable. I found this variable particularly useful to tweak JVM parameters in a containerized environment without having to rebuild the image. This environment variable only affects `java` while JAVA_TOOL_OPTIONS affects any binary in the JDK `java`, `jps`, `jcmd`, etc. While it would certainly be useful to record the whole lifetime, this is unpractical, even plane Flight Data Recorder (and Cockpit Voice Recorder) only keep recent history. Instead, it’s possible to aim at specific time frames where a recording could be useful: • At startup, the JVM does a lot of things, so does the application, it generally initialize a lot of long lived objects, generally services, threads, etc. • Continuously at runtime, this is likely a sliding time window in which one can access what’s happened in last X <time unit> (this is either limited by age or by size). It this case the dump could be done when required. • At shutdown whether the JVM was killed or crashed. It this case the JFR files are an alternative to heap dumps for the autopsy. ### Record application startup I mentioned startup as a separate window, because it’s useful to inspect startup recording. During this time, the JVM initialize a lot of things, most code has yet to be warmed up, depending on the workload there may be a lot of allocations. I found that having this startup recording very useful to tune the readiness of an application, as we’ll see after. Time bound recording at JVM startup ``-XX:StartFlightRecording=settings=profile,duration=6m,name=app-startup,filename=/var/log/jfr/app-startup.jfr`` Eventually it’s possible to tweak this recording with, other parameters like: • `path-to-gc-roots=true`, which allows to identify leaks using the `OldObjectSample` (enabled in the `profile` settings) • `maxsize` to set a size threshold to the recording • `disk=false` if you want to keep the event in memory only before dumping to the configured `filename`. Otherwise the JVM will use it’s default strategy which is to evacuate chunks of event to disk, in the JFR `repository` (by default a folder in the temporary folder). In addition to these recording parameter, it can be useful to set a few JFR wide options, i.e. that affects all recordings, e.g. `-XX:FlightRecorderOptions=stackdepth=96`, which augments the size of the captured stack, be advised, that the bigger the number the higher the impact. In the container, checking JFR ``````❯ jcmd$(pgrep java) JFR.check
6:
Recording 1: name=app-startup duration=6m (running) (1)
❯ jcmd $(pgrep java) JFR.check 6: No available recordings. (2) Use jcmd 6 JFR.start to start a recording. ❯ ls -lah /var/log/jfr/app-startup.jfr -rw-r--r-- 1 root root 57M May 6 22:35 /var/log/jfr/app-startup.jfr``````  1 Indicates the configured 30s recording is ongoing. 2 No more recording once the duration is over. I’ll show how to use this recording later in this article. ### Record application post-startup or continuous recording Once startup has been recording, it’s useful to set up a continuous recording. The good thing is that the JVM allows to define multiple recording in the command line. Let’s add another `-XX:StartFlightRecording` with the `delay` parameter. Delayed continuous recording ``-XX:StartFlightRecording=settings=profile,delay=5m,maxage=10m,name=post-startup,filename=/var/log/jfr/post-startup.jfr`` This will register a continuous profiling that will start 5m after the JVM starts. And it sets a retention of 10 minutes, or a retention of the default maximum size which is `250 MiB` in JDK11. If this is the only recording, `JFR.check` will output something like that. In the container, checking JFR ``````$ jcmd $(pgrep java) JFR.check 6: Recording 1: name=post-startup maxage=10m (delayed) (1) ❯ jcmd$(pgrep java) JFR.check
6:
Recording 1: name=app-startup maxage=10m (running) (2)``````
 1 Indicates there’s a recording that will start at some point in the future. 2 Indicates the configured continuous recording is ongoing.

Note that in the case of the continuous recording it’s necessary to dump the recording via `JFR.dump`.

### Recording for shutdown

The only thing to do is to set the recording parameter `dumponexit=true` (on each recording). The record will be stored in the configured `filename` otherwise JFR will create a file similar to this the working directory of the process `hotspot-pid-6-id-1-2020_05_03_12_54_14.jfr`. Later on I discussed with Erik Gahlin and he told me that the `filename` attribute accepts a path to an existing folder, so that dumps on exit are dumped in this folder.

IN the following filename `hotspot-pid-6-id-1-2020_05_03_12_54_14.jfr`,

• `pid-6` is the process identifier, usually it’s the same in a container.

• `id-1` is the recording identifier, the identifier is assigned following the order of `StartFlightRecording` options, new recordings started by `jcmd` will be given an identifier as well.

• `2020_05_03_12_54_14` is the date at which this dump has been created.

The generated name make it uneasy to correlate recordings when a container is restarded several times, but that can be worked with.

 The JVM source code, suggest that JFR has the notion of emergency JFR dump, but the mechanism is different as it seems those are dumped in the working directory of the process, which may not be writable in a container. I don’t think it’s currently possible to change the location. But from what I’ve seen a `StackOverFlowError` or a `OutOfMemoryError` are dumped fine via `dumponexit=true` and `filename=…​`. OutOfMemory ⇒ `hs_oom_pid.jfr` StackOverflowError ⇒ `hs_soe_pid.jfr` Other error ⇒ `hs_err_pid.jfr` In practice, for the stack overflow and the java heap out of memory errors, I never witnessed dumps named this way, instead the recording seem to follow the configured location or the default one `hotspot-pid-…​.jfr`.

### Putting it all together

Putting it all together, let’s put these in the `JDK_JAVA_OPTIONS`.

record startup then record continuously, and dump on exit
``````-XX:StartFlightRecording=settings=profile,duration=6m,name=app-startup,dumponexit=true,filename=/var/log/jfr/app-startup.jfr
-XX:StartFlightRecording=settings=profile,delay=5m,maxage=10m,name=post-startup,dumponexit=true,filename=/var/log/jfr/post-startup.jfr
-XX:FlightRecorderOptions=stackdepth=96``````

After acquiring the record files, you are ready to exploit them. We’ve seen `jfr` on which it’s possible to build upon, now the next section briefly presents the other elephant in the room (in a positive way), JDK Mission Control which empowers its user with remarkable diagnosis skills.

## JDK Mission Control

JDK Mission Control, is a tool that will make sense of all these events, and it excels at that.

In an event sourcing system, the query part of the system is what empowers the view, JDK Mission Control uses these events to present several consolidated views tailored to diagnose several part of the JVM runtime, and more if using custom events.

On macOs you can install it via Homebrew (`brew cask install jdk-mission-control`).

As mentioned earlier Mission Control was present before in the JDK, however it’s user interface evolved quite a bunch and following the official Oracle troubleshooting guide for Java 11 to inspect a recording may get be a bit confusing. And even the shipped help pages of JMC are sometime outdated, or explicit about which version of the JDK the help section makes references to.

 At this time the JDK Mission Control 7.1 is available, and the 8 snapshot are also available for the edge users. I’m using the 8 snapshots but the 7.1 does the job very fine.

Enough text, let’s see how to use JDK Mission Control on a concrete issue.

 A large screen will be really appreciated when looking at JMC. My 13 inches laptop is way too narrow in my opinion, using a 27 inches monitor really adds to usability.

## Real case startup request latencies

At some point we had an issue with application startup performance. For a bit of context this app runs in a Kubernetes cluster, with k8s it’s possible to define a rollout policy to avoid service disruption, that what we had, so there’s a rollout, new pods (a pod is composed of the application container) get created, new application are started, then Kubernetes has a mean to check when the application is ready, it’s called the readiness probe. This probe can be a call to http endpoint or a script, ours is an HTTP endpoint.

When the probe indicates the application is ready, the application starts receiving traffic, from this moment latencies increased from 10-20 milliseconds to seconds.

Opening the recording JMC, it will start an automated analysis. Sometime it’s helpful and warning should be looked at. In this case the issue was elsewhere.

To help in our case, I needed to go in the thread view, which as a thread lanes which displays each thread as an horizontal bar, and JFR event are painted on the bar at the time they happened and for how long. At the bottom there’s the time axis.

The first things that catch our eyes will be the vertical patterns, especially the one with yellow and red around 12:51:15. Let’s zoom (you can select and right click, or use arrow keys.).

Zooming on the problematic

The vertical pattern is more precise, these vertical pattern indicates a cross threads phenomena. And we can notice that all thread are not affected the same way.

The lanes in green at the top indicates these threads are progressing normally, nothing’s wrong with them. However, if we go down there especially, the pattern materialize around 12:51:11.

Each color indicate some events:

• Gray ⇒ the thread does nothing, it’s parked

• Small dark point ⇒ The park event, the thread was scheduled but was eventually parked

• Red ⇒ In this case it’s a blocked lock

• Yellow ⇒ Here the yellow events that interests us are thread waiting on a lock

I’m not well versed in describing colors, so know that there are other events that use a slightly different tone / shade of red, yellow or else. Just click on the event to see what’s going on.

monitor blocked

The UI could be improved but going over the events shows a popup with its details, for the Monitor Blocked event, there’s the duration this monitor was block, who held this monitor before, and the monitor address in the heap.

This monitor was held for ~9s, that’s a big bottleneck !

In this case this application had ~200 thread competing for this lock in particular. And finding the thread that acquired it is not automated, it’s a bit tedious, but in the end didn’t reveal anything.

However, the Monitor Blocked event has another interesting element, the stacktrace. In the lower pane of JMC, if the event is selected, we can see the stack trace where this event was emitted.

Here that’s the JDK classloader. I would never have thought that the JDK code would cause ~9s contention. But looking at the stacktrace, there jedis (a Redis client) connection pool involved. Maybe the class have static initialization that does IO, maybe it’s the agent that runs in production that introduces latencies during this specific classloading.

But all the threads starts working again around 12:51:22, and redis bottleneck is not anymore an issue afterward. Let’s look at the story on the IO side, open the Socket IO view.

Identify which one is related to redis, either via IP or via port (Redis is `6379`).

Socket IO with redis port selected

While we see a total time of 30s, it’s the accumulated time for all thread for the recording duration (6m). So nothing fancy here. However, we see that actual IO happens later, around 12:51:22.

Socket IO actual Redis read / write

I’m not sure of the cause, we can definitely rule out Redis, the is either in the network, during connection establishment, or actually during classloading.

In order to remediate that, we noticed this code is called once the application is ready. So in order to avoid the contention the code was changed to pre-connect to redis during the application startup before the application is considered ready. And it worked, this specific issue disappeared.

The issue identified in the first recording showed some contention on redis, this is now gone. There is however in this second iteration multiple other issues when flying the mouse pointer over the yellow and red events.

Incriminating stack traces 1
Jersey’s `ListMultiMap` classloading
2
Jackson’s `SerializerCache` initialization
3
Jersey’s `ImprovedAbstractMap` classloading
4
Jackson’s `PropertySerializerMap` initialization
5
Jackson’s `TypeFactory` initialization
6
Jackson’s `DeserializerCache` initialization
7
Guava’s `hash` function involving classloading
8
Newrelic’s `SynchronizedCollection.size()`
9
Jackson’s `PropertySerializerMap` initialization
10
Datastax cassandra driver’s `ChainedResultSetFuture` involving classloading
11
Jersey’s `AbstractMapBasedMultimap` involving classloading

All the stacktraces above have one thing in common, a contention on a lock. As thread were blocked, and new request appeared, this had the effects of increase the queuing, and made tomcat creates new http nio worked threads until the maximum is reached (200 by default). I applied the same trick: i.e before readiness, exercise code paths that are lazy initialized. And it worked too. Most contention disappeared afterward.

Better startup / C2 compiler event

There was still some potential gain however. I noticed a few C2 compiler related events. If you don’t know C1 and C2, those are the compilers that transform bytecode to assembly. C1 interprets the bytecode and may perform some tricks like escape analysis, but its role is to be fast to execute bytecode and not about optimizing the machine code assembly. In a second phase C2 may kick in for hot code path that is code that is executed very often, and its job is to output highly optimized code at the cost of higher computation resources.

Note this compilation event is not blocking threads !

As an experiment I used the Graal compiler, which I believe is the new generation of compiler that have a chance to replace C2. Graal compiler is part of the GraalVM project. It ships as an experimental option : `-XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler`.

And the result for this work load in particular, just activating the Graal compiler between these two successive rollouts.

Improved P99

This graph is falling under the percentlie but the improvement is still visible.

After that experiment when I analyzed the new profile I noticed that JVM was using a lot of time to revoke bias locking. Note the events in fushia are custom JFR events that track HTTP requests. (Gunnar Morling has an inspiring write up about it).

This VM operation is done during safepoint during which the JVM is paused. At this time a lot of bias revocation event appears, so we removed them, and it works too.

Revoke Bias locking

The graphic below shows the difference in startup time. Other consuming backend services had to timeout and eventually perform retries during rollouts which increased the pressure on the application.

Before ⇒ after

Other smaller issues could be discovered and tackled, for example I noticed a few gaps 200ms wide at 30/1min intervals that were related to GC pauses.

GC gaps

For this application the GC is working within 20ms excepts for a few larger pauses during the early minutes of startup:

GC pauses causes new HTTP worker threads

Tuning GC is out of scope for this article, but JMC presents a nice view that could help identify GC causes. In my opinion it’s currently a bit less powerful as GC logs because GC events lack useful information only found in GC logs with appropriate tags and level.

Java Application view
Garbage Collection view

Anyway from there we could imagine a setting file in the JDK dedicated for GC events. Instead of parsing log files whose string format that may change a bit over releases (even with unified logging) GC events could be analyzed leveraging the included JFR schema.

Also, with JFR it’s even possible to record memory leak using the `profile` settings file, I never tried that in a real situation though, but if this avoids the need to make a heap dump and parse it then I’m all in.

## Wrap up

Use JFR in production, use JDK Mission Control !

I’m amazed by the professionalism and the quality of the work BEA put in their JVM (JRockit) in the early 2000s, thanks to these talented engineer, we have JFR and `jcmd` (which was named `jrcmd`). And thanks to Oracle for making these available for all !.

You may also want to follow Marcus Hirt’s blog, he regularly updates it with useful information on JMC and JFR. As it happens Marucs was the co-founder of Appeal (the company that developed the JRockit), then continued his career at BEA then Oracle.

Finally, here’s a quick thank to David Delabasse for his comments. Recently I’d like to thank Erik Gahlin for his feedback on JFR.

EDIT 2020-09-15 Since this article on JFR and JMC, and due to the JDK Mission Control win, many more article popped up:

Also, be sure to follow Erik Gahlin who is working on JFR as well.