Rediscovering JVM ergonomics with containers

2020-11-03
30 min read

On this page

Ergonomics tunes internal JVM values, such as thread pool sizes used by the GC. They may even choose a GC algorithm for you.

Ergonomics for servers was first introduced in Java SE 5.0.

Ergonomics greatly reduced time spent on tuning server applications — at the time the JVM had two mode client and server — in particular heap sizing and other advanced GC flag tuning. But other components were affected like the compiler threads.

Since the Java 7, the client mode is inactive on a 64 bit JDK

-client Selects the Java HotSpot Client VM. A 64-bit capable JDK currently ignores this option and instead uses the Java Hotspot Server VM.

which should effectively eliminate the two modes -client and -server. However the concept behind server-class still remains and actually may cause you some surprises.

In the next section I’ll take a look at the following items

GC threads
GC region sizes // TODO Maybe not, the region sizes as this is usually modified by Xms/Xmx
HotSpot compilation thread count
Code cache sizes
ForkJoinPool (and incidentally CompletableFuture)

Unless specified on the command line via ActiveProcessorCount the os::initialize_initial_active_processor_count() method on Linux will use either a syscall to get the number of processors, or get the value from the cgroup’s relevant cpu.shares and cpu.quota files (see OSContainer::active_processor_count()).

Using the flag -XX:+PrintFlagsFinal to see these default values is immediately useful. And a live application one can use jcmd $(pidof java) VM.flags -all.

GC

The JDK 11 GC tuning and ergonomics page mentions these points :

Garbage-First (G1) collector, ⇐ Actually that’s not always the default GC, there’s always the server-class mechanism in place.
The maximum number of GC threads is limited by heap size and available CPU resources ⇐ CPU counts indeed but the size of the heap isn’t, it’s the available memory.
Initial heap size of 1/64 of physical memory ⇐ ✔
Maximum heap size of 1/4 of physical memory ⇐ ✔

This section will focus on the worker thread ergonomics, heaping sizing, heap spaces (eden, survivor, old), regions are other ergonomics that this article won’t dive into.

Let’s take a simple look at GC threads:

On the OS

$ java -XX:+PrintFlagsFinal --version | grep -E "GCThreads|Use.*GC\b"
     uint ConcGCThreads                            = 2                                         {product} {ergonomic} (2)
     uint ParallelGCThreads                        = 8                                         {product} {default} (3)
     bool UseAdaptiveSizePolicyWithSystemGC        = false                                     {product} {default}
     bool UseConcMarkSweepGC                       = false                                     {product} {default}
     bool UseDynamicNumberOfGCThreads              = true                                      {product} {default}
     bool UseG1GC                                  = true                                      {product} {ergonomic} (1)
     bool UseMaximumCompactionOnSystemGC           = true                                      {product} {default}
     bool UseParallelGC                            = false                                     {product} {default}
     bool UseParallelOldGC                         = false                                     {product} {default}
     bool UseSerialGC                              = false                                     {product} {default}
     bool UseShenandoahGC                          = false                                     {product} {default}

1	On this machine G1GC has been chosen ergonomically
2	`ConcGCThreads` is configured ergonomically
3	`ParallelGCThreads` appears to be the default, but as we’ll see the default value is chosen as a function of the available processors

The math of these flags is different for each GC and could change on new JDK release.

For G1GC ParallelGCThreads and ConcGCThreads this Oracle Tech Network article (by Monica Beckwith) mentions (emphasis are mine)

The value of n is the same as the number of logical processors up to a value of 8. If there are more than eight logical processors, sets the value of n to approximately 5/8 of the logical processors.

Sets n to approximately 1/4 of the number of parallel garbage collection threads (ParallelGCThreads).

Let’s play with docker to show these values in action within a constrained cgroup (note the use of --cpus which sets a cpu-quota for a cpu-period).

In a CPU-limited cgroup

$ docker container run --rm -it --cpus=0.8 adoptopenjdk/openjdk11:latest java -XX:+PrintFlagsFinal --version | grep -E "GCThreads|Use.*GC\b"
     uint ConcGCThreads                            = 0                                         {product} {default} (2)
     uint ParallelGCThreads                        = 0                                         {product} {default} (2)
     bool UseConcMarkSweepGC                       = false                                     {product} {default}
     bool UseG1GC                                  = false                                     {product} {default}
     bool UseParallelGC                            = false                                     {product} {default}
     bool UseParallelOldGC                         = false                                     {product} {default}
     bool UseSerialGC                              = true                                      {product} {ergonomic} (1)
     bool UseShenandoahGC                          = false                                     {product} {default}

$ docker container run --rm -it --cpus=1.2 adoptopenjdk/openjdk11:latest java -XX:+PrintFlagsFinal --version | grep -E "GCThreads|Use.*GC\b"
     uint ConcGCThreads                            = 1                                         {product} {ergonomic}
     uint ParallelGCThreads                        = 2                                         {product} {default}
     bool UseConcMarkSweepGC                       = false                                     {product} {default}
     bool UseG1GC                                  = true                                      {product} {ergonomic} (3)
     bool UseParallelGC                            = false                                     {product} {default}
     bool UseParallelOldGC                         = false                                     {product} {default}
     bool UseSerialGC                              = false                                     {product} {default}
     bool UseShenandoahGC                          = false                                     {product} {default}

$ docker container run --rm -it --cpus=1.2 --memory=1.5g adoptopenjdk/openjdk11:latest java -XX:+PrintFlagsFinal --version | grep -E "GCThreads|Use.*GC\b"
     uint ConcGCThreads                            = 0                                         {product} {default}
     uint ParallelGCThreads                        = 0                                         {product} {default}
     bool UseConcMarkSweepGC                       = false                                     {product} {default}
     bool UseG1GC                                  = false                                     {product} {default}
     bool UseParallelGC                            = false                                     {product} {default}
     bool UseParallelOldGC                         = false                                     {product} {default}
     bool UseSerialGC                              = true                                      {product} {ergonomic} (4)
     bool UseShenandoahGC                          = false                                     {product} {default}

$ docker container run --rm -it --cpus=6 --memory=2g adoptopenjdk/openjdk11:latest java -XX:+PrintFlagsFinal --version | grep -E "GCThreads|Use.*GC\b"
     uint ConcGCThreads                            = 2                                         {product} {ergonomic}
     uint ParallelGCThreads                        = 6                                         {product} {default}
     bool UseConcMarkSweepGC                       = false                                     {product} {default}
     bool UseG1GC                                  = true                                      {product} {ergonomic} (5)
     bool UseParallelGC                            = false                                     {product} {default}
     bool UseParallelOldGC                         = false                                     {product} {default}
     bool UseSerialGC                              = false                                     {product} {default}
     bool UseShenandoahGC                          = false                                     {product} {default}

1	CPU is less than 1, which sets `SerialGC` as the default GC because the JVM assumes there’s at most one hardware thread
2	Consequently there’s no need for GC threads
3	If they are more than 1 CPU, G1GC is chosen as the default this time
4	If there’s less than 2 CPU and less than 2 GiB, `SerialGC` is again the default
5	If there’s more than 2 CPU and less than 2 GiB, then it is G1GC

If the application is running with fewer than 2 cpus and less than 1792 GiB of memory the JVM heuristics think the app is not running on a server which makes the SerialGC the default GC.

Running the SerialGC may not be an issue for some workload, but it can be one on others. In this case it might be useful to toggle the GC explicitly.

Let’s look at the source to get a precise picture of what’s going on.

os::is_server_class_machine()

// This is the working definition of a server class machine:
// >= 2 physical CPU's and >=2GB of memory, with some fuzz
// because the graphics memory (?) sometimes masks physical memory.
// If you want to change the definition of a server class machine
// on some OS or platform, e.g., >=4GB on Windows platforms,
// then you'll have to parameterize this method based on that state,
// as was done for logical processors here, or replicate and
// specialize this method for each platform.  (Or fix os to have
// some inheritance structure and use subclassing.  Sigh.)
// If you want some platform to always or never behave as a server
// class machine, change the setting of AlwaysActAsServerClassMachine
// and NeverActAsServerClassMachine in globals*.hpp.
bool os::is_server_class_machine() {
  // First check for the early returns
  if (NeverActAsServerClassMachine) { (1)
    return false;
  }
  if (AlwaysActAsServerClassMachine) { (2)
    return true;
  }
  // Then actually look at the machine
  bool         result            = false;
  const unsigned int    server_processors = 2;
  const julong server_memory     = 2UL * G;
  // We seem not to get our full complement of memory.
  //     We allow some part (1/8?) of the memory to be "missing",
  //     based on the sizes of DIMMs, and maybe graphics cards.
  const julong missing_memory   = 256UL * M;

  /* Is this a server class machine? */
  if ((os::active_processor_count() >= (int)server_processors) && (3)
      (os::physical_memory() >= (server_memory - missing_memory))) {

1	Tell the JVM to never act as server
2	Tell the JVM to always act as server
3	The server-class ergonomic code : if there’s at least 2 active processor and the memory is at least 1792 MiB

GCConfig::select_gc_ergonomically()

void GCConfig::select_gc_ergonomically() {
  if (os::is_server_class_machine()) {
#if INCLUDE_G1GC
    FLAG_SET_ERGO_IF_DEFAULT(bool, UseG1GC, true);
#elif INCLUDE_PARALLELGC
    FLAG_SET_ERGO_IF_DEFAULT(bool, UseParallelGC, true);
#elif INCLUDE_SERIALGC
    FLAG_SET_ERGO_IF_DEFAULT(bool, UseSerialGC, true);
#endif
  } else {
#if INCLUDE_SERIALGC
    FLAG_SET_ERGO_IF_DEFAULT(bool, UseSerialGC, true);
#endif
  }
}

Inspecting the source code it is possible to guide the JVM heuristics on this matter :

Use -XX:+AlwaysActAsServerClassMachine which consequently let the server GC be used,
Enable a GC algorithm explicitly
Use -XX:ActiveProcessorCount=<number>, but memory

In my opinion enabling a particular GC algorithm is the superior choice has it is explicit, in regard of the GC parameters.

Now let’s focus on the worker threads for the different GCs. The table below summarize the worker threads for GCs you can find in the JDK (starting from JDK11u).

ZGC is experimental from JDK 11 to JDK 15 excluded and as such require to unlock experimental options to be used.

Garbage Collector Worker threads options

Garbage Collector	Worker threads options
Serial	non-applicable of course
Parallel	`ParallelGCThreads`
CMS	`ParallelGCThreads` `ConcGCThreads`
G1	`ParallelGCThreads` `ConcGCThreads` `G1ConcRefinementThreads`
Shenandoah	`ParallelGCThreads` `ConcGCThreads`
ZGC	`ParallelGCThreads` `ConcGCThreads`

Serial

non-applicable of course

Parallel

ParallelGCThreads

CMS

ParallelGCThreads
ConcGCThreads

ParallelGCThreads
ConcGCThreads
G1ConcRefinementThreads

Shenandoah

ParallelGCThreads
ConcGCThreads

ZGC

ParallelGCThreads
ConcGCThreads

In general one can say that

The parallel threads are threads that will perform work when the world is paused
The concurrent threads are threads that will perform work concurrently with the application

The GC thread count is based on the number of processors reported by the system and differ for each GC. In the subsequent sections I’ll have an overlook for G1, Shenandoah and ZGC. I will skip Parallel and CMS GC as G1 is now default since JDK 11 but the rationale is the same.

In order tweak those we need to understand what they are supposed to do, and to understand what they are supposed to do it is essential to have a basic understanding of how the GC work.

I will use the term ncpus as the active processor count.

G1 Threads

Pool Controlled by Default

Pool	Controlled by	Default
Stop-the-world threads	`-XX:ParallelGCThreads`	\$= {(ncpus, if ncpus <= 8), (8 + ((ncpus - 8) * 5) / 8, if ncpus > 8):}\$ source : vm_version.cpp
Parallel operations : evacuation, remark and cleanup
Concurrent	`-XX:ConcGCThreads`	\$= max((text{ParallelGCThreads} + 2) / 4, 1)\$ source : g1ConcurrentMark.cpp and here
Object marking and region liveness
Concurrent Remembered set processing	`-XX:G1ConcRefinementThreads`	\$=\ text{ParallelGCThreads}\$ source : g1Arguments.cpp
Process the RSet buffer

Stop-the-world threads

-XX:ParallelGCThreads

\$= {(ncpus, if ncpus <= 8), (8 + ((ncpus - 8) * 5) / 8, if ncpus > 8):}\$

source : vm_version.cpp

Parallel operations : evacuation, remark and cleanup

Concurrent

-XX:ConcGCThreads

\$= max((text{ParallelGCThreads} + 2) / 4, 1)\$

source : g1ConcurrentMark.cpp and here

Object marking and region liveness

Concurrent Remembered set processing

-XX:G1ConcRefinementThreads

\$=\ text{ParallelGCThreads}\$

source : g1Arguments.cpp

Process the RSet buffer

The source code indicates the ParallelGCThreads for G1GC (and CMS) is not defined as being set ergonomically (FLAG_SET_DEFAULT), but in my opinion this flag is somewhat ergonomic in its nature.

To understand exactly what these ergonomics affect, it’s possible to rely on the official G1GC documentation, in particular there is this schema that tries to picture G1 collections cycle. Otherwise, there’s this great documentation by the people of plumber.io.

G1 Collection Cycle Overview

image::/assets/rediscovering-jvm-ergonomics/g1-cycle.png

blue dots are young collections, young evacuation pauses happens there, orange dots are remark and cleanup pauses respectively, red dots are part of the space reclamation phase (mixed collections, full gc)

ParallelGCThreads

These threads are employed during stoop-the-world pauses, they are in particular doing the following job:

Evacuation : This pause moves live objects to another region, the reference of moved objects will need to be updated.
Remark : This is a pause that finalizes the concurrent marking (of live objects) itself, additionally this pause may be an opportunity to unload classes.
Cleanup : This pause is a step where G1GC determines whether a space-reclamation phase need to follow. Also, the cleanup phase is when G1GC can collect dead humongous objects. Finally, if G1GC deems space-reclamation necessary, ie collect old regions, this phase will prepare for a mixed young collection.

ConcGCThreads

In the original design of G1GC, the marking of live objects is performed concurrently, so the main job is:

Marking : Marking live objects is also used to determine the liveness of regions.

G1ConcRefinementThreads

This a specific G1 pool of threads that work concurrently and updates remembered-sets. For reference remembered-sets (aka RSets) are per-region entries that are used by G1 GC to track inbound object references into a heap region. The RSets avoid scanning the whole heap to track references, this is particularly helpful to keep evacuation pauses "reasonably" short as G1 just needs to scan the region’s RSet.

The RSet is in fact a buffer that log object updates. It is expected is that this entry is processed by the refinement threads only. However, if there are too many updates or too many cross-region references, the refinement threads may not keep up, and the application threads will take over.

Personally I never had to tweak the concurrent marking of G1 (ConcGCThreads) or the RSets (G1ConcRefinementThread), with containers however I really had to tweak the stop-the-world worker (ParallelGCThreads).

The JVM has a lot of flags here and there that can be used to alter the default ergonomics, e.g. UseDynamicNumberOfGCThreads, if you use them you enter the tuning terrain!

Shenandoah

Shenandoah is a next_generation low-pause collector, and like all GCs in OpenJDK it defines a pool of threads for certain tasks

Pool Controlled by Default

Pool	Controlled by	Default
Concurrent GC Threads	`-XX:ConcGCThreads`	\$max(1, "ncpus" / 4)\$ source: shenandoahArguments.cpp
This where most of the work is done, from updating references to moving objects
Parallel GC Threads	`-XX:ParallelGCThreads`	\$max(1, "ncpus" / 2)\$ // Set up default number of parallel threads. We want to have decent pauses performance // which would use parallel threads, but we also do not want to do too many threads // that will overwhelm the OS scheduler. Using 1/2 of available threads seems to be a fair // compromise here. Due to implementation constraints, it should not be lower than // the number of concurrent threads. bool ergo_parallel = FLAG_IS_DEFAULT(ParallelGCThreads); if (ergo_parallel) { FLAG_SET_DEFAULT(ParallelGCThreads, MAX2(1, os::initial_active_processor_count() / 2)); } source: shenandoahArguments.cpp
If the concurrent GC didn’t keep up there’s an allocation failure, in which case Shenandoah starts a degenerated gc.

Concurrent GC Threads

-XX:ConcGCThreads

\$max(1, "ncpus" / 4)\$

source: shenandoahArguments.cpp

This where most of the work is done, from updating references to moving objects

Parallel GC Threads

-XX:ParallelGCThreads

\$max(1, "ncpus" / 2)\$

// Set up default number of parallel threads. We want to have decent pauses performance // which would use parallel threads, but we also do not want to do too many threads // that will overwhelm the OS scheduler. Using 1/2 of available threads seems to be a fair // compromise here. Due to implementation constraints, it should not be lower than // the number of concurrent threads. bool ergo_parallel = FLAG_IS_DEFAULT(ParallelGCThreads); if (ergo_parallel) { FLAG_SET_DEFAULT(ParallelGCThreads, MAX2(1, os::initial_active_processor_count() / 2)); }

source: shenandoahArguments.cpp

If the concurrent GC didn’t keep up there’s an allocation failure, in which case Shenandoah starts a degenerated gc.

The source code indicates these two pool of threads are not marked as being set ergonomically (FLAG_SET_DEFAULT), but in my opinion these flags are somewhat ergonomic in their definition.

Wiki

ZGC

Pool Controlled by Default

Pool	Controlled by	Default
Parallel GC Threads	`-XX:ParallelGCThreads`	\$min(\|~ 60% of "ncpus" ~\|, "2% of MaxHeapSize" / (2 MiB))\$ MIN2(nworkers_based_on_ncpus(cpu_share_in_percent), nworkers_based_on_heap_size(2.0)); static uint nworkers_based_on_ncpus(double cpu_share_in_percent) { return ceil(os::initial_active_processor_count() * cpu_share_in_percent / 100.0); } static uint nworkers_based_on_heap_size(double reserve_share_in_percent) { const int nworkers = (MaxHeapSize * (reserve_share_in_percent / 100.0)) / ZPageSizeSmall; return MAX2(nworkers, 1); } source: zHeuristics
Todo
Concurrent GC Threads	`-XX:ConcGCThreads`	\$min(\|~ 12.5% of "ncpus" ~\|, "2% of MaxHeapSize" / (2 MiB))\$ source: zHeuristics
Todo

Parallel GC Threads

-XX:ParallelGCThreads

\$min(|~ 60% of "ncpus" ~|, "2% of MaxHeapSize" / (2 MiB))\$

MIN2(nworkers_based_on_ncpus(cpu_share_in_percent), nworkers_based_on_heap_size(2.0));

static uint nworkers_based_on_ncpus(double cpu_share_in_percent) { return ceil(os::initial_active_processor_count() * cpu_share_in_percent / 100.0); }

static uint nworkers_based_on_heap_size(double reserve_share_in_percent) { const int nworkers = (MaxHeapSize * (reserve_share_in_percent / 100.0)) / ZPageSizeSmall; return MAX2(nworkers, 1); }

source: zHeuristics

Todo

Concurrent GC Threads

-XX:ConcGCThreads

\$min(|~ 12.5% of "ncpus" ~|, "2% of MaxHeapSize" / (2 MiB))\$

source: zHeuristics

Todo

Wiki

Compiler

The JDK 11 GC tuning and ergonomics page

Tiered compiler, using both C1 and C2 ⇐ :y:

Ergonomic Controlled by Default

Ergonomic	Controlled by	Default
Compiler	`CICompilerCount`	\$max(log(ncpus)-1,1) = 2\$

Compiler

CICompilerCount

\$max(log(ncpus)-1,1) = 2\$

Other ergonomics

How to discover ergonomics

FLAG_SET_ERGO
-Xlog:gc+ergo

Ergonomic Controlled by Default

Ergonomic	Controlled by	Default
Compressed pointers	`UseCompressedOops`	\$true if 64bits\$
Compressed class pointers	`UseCompressedClassPointers`	\$true if 64bits\$
Non-Uniform Memory Access interleaving old and eden space	`UseNUMAInterleaving`	\$true if "UseNUMA" = true\$

Compressed pointers

UseCompressedOops

\$true if 64bits\$

Compressed class pointers

UseCompressedClassPointers

\$true if 64bits\$

Non-Uniform Memory Access interleaving old and eden space

UseNUMAInterleaving

\$true if "UseNUMA" = true\$

End words

In the code source it should be easy to find ergonomic flags as they are usually prefixed by FLAG_SET_ERGO, but it happens that GC authors have chosen to use FLAG_SET_DEFAULT instead for reason I have not yet investigated.