# Off-Heap memory reconnaissance

This entry has been marinating for most of the year 2020. Rewriting it multiple times to make it more digestible form, I’ve left out some content, but this is still a big piece to read.

tl;dr
• JVM Native Memory Tracking values are different from the RSS

• NMT is useful to identify the JVM runtime memory usage

• NMT does not track memory mapped files

• NMT does not track allocations performed by native libraries

• More threads means more memory usage (times the stack depth of the application)

• `pmap` or `/proc/{pid}/maps` (or `smap`) completes the picture about effective RSS of every memory segment of the process

• In containers, evaluate the usage of `-XX:+AlwaysPreTouch`, it is easier to reason about off-heap usage

Read on to get the details. Some sections are collapsed, typically the snippets written in C.

## Motivation

I have been running applications in containers for a while now, and I have noted that this shift in deployment resulted in tighter constraints. And the closer the walls are the more we, software developers (or anyone involved in production), should pay attention to how memory is consumed.

Indeed, sometime getting the right memory limit for Java applications is sometimes an intangible task, and I think the `MaxRAMPercentage` flag is certainly not the right tool for this job.

Moreover, with a given limit, if an application gets OOM Killed then one has to ask if it is the limit that needs adjustment or if it is the application that is misbehaving (memory leak in particular, but not always).

To answer questions about memory usage there are various things to look at, Java Heap, Metaspace other JVM components, etc. I faced a few incidents, where the JVM settings and the Kubernetes memory limit were seemingly appropriate, yet the apps were constantly OOM Killed because the RSS kept growing toward this limit. Some of these issues have been solved just by raising the memory limit. However, in some other cases it wasn’t that crystal clear.

This served as excuse to go down the rabbit hole.

This entry will hopefully help to understand the basics of how a `java` process uses native memory and remind some rudiments of OS memory management.

 Most of the time, figures will use the IEC binary notation (`1 KiB = 1024 Byte`), it matches the JVM, our Kubernetes usage, and Linux’s tools (`/proc/{pid}/stat` or `/proc/{pid}/maps` ; although I couldn’t find a reference link stating this). Some charts may however use the SI metric notation (`1 KB = 1000 Byte`). Actually, 227,893 KB is only 222 MB. For ease of discussion, I’ll truncate the KBs part by 1,000 in this chapter; pretend I’m a disk manufacturer. — Java Performance: The Definitive Guide Getting the Most Out of Your Code (1st Edition) Thanks to this tweet.
 Also, all java snippet and command have been run with Java 11.
 Last one, this writing assume the cgroups v1, the v2 have slight variation that I haven’t studied yet.

Getting comfortable with the memory of a (JVM) process is a tedious task for most of us that wrote code in Java for their entire professional life. However, this is a rewarding task, and it’s possible to extract useful findings.

## Exploration begins

### The JVM flags as a starting point

When assessing java memory, one of the fist thing to look at are the Java heap parameters.

It’s likely anyone that reads this article is familiar with `Xms` or `Xmx`, but there are other ways to define the boundaries of the Java heap in particular if the process is started with `\*RAMPercentage`. With these the JVM will compute the actual values from the `cgroup`, in this case it’s possible to access the actual runtime values with `jcmd`.

In short, it’s possible to look at the command line options, but using the diagnostic command `jcmd {pid} VM.flags` lets you peek at the actual values that the JVM used.

For example with a memory limit of `5 GiB`, if a process is started with `-XX:InitialRAMPercentage=85.0 -XX:MaxRAMPercentage=85.0` the `VM.flags` diagnostic command will output this :

JVM current flags in a kubernetes container
``````$jcmd$(pidof java) VM.flags | tr ' ' '\n'
6:
...
-XX:InitialHeapSize=4563402752 (3)
-XX:InitialRAMPercentage=85.000000 (1)
-XX:MarkStackSize=4194304
-XX:MaxHeapSize=4563402752 (4)
-XX:MaxNewSize=2736783360
-XX:MaxRAMPercentage=85.000000 (2)
-XX:MinHeapDeltaBytes=2097152
-XX:NativeMemoryTracking=summary
...``````
 1 Initial RAM at 85% 2 Max RAM at 85% 3 Initial heap size ~`4.25 GiB` 4 Max heap size ~`4.25 GiB`
 Do not confuse the `VM.flags` command which will output parameters calculated from the command line and `VM.command_line` which will print the raw command line.

The other Hotspot flag values comes are JVM defaults, which may either be static values, or computed from internal heuristics.

As we tend to dismiss regularly, the Java heap is only a part of the process memory usage. So now let’s dig into how memory is consumed. The values or snippet comes from an application running inside a container.

### The real memory footprint of the java process in the container

The JVM is doing everything to keep software developers from caring about memory, and before containers the bigger systems helped to sustain this comfortable way of programming. Sometime there’s a Java heap memory leak but it doesn’t happen every day, and even more remotely there’s a problem with the process memory.

There’s more chance we could get hit by GC pauses.

With containers, one of the most critical things to look at is the resident set size, that’s the native memory, it can be obtained in various ways, using `ps`, `top` or reading the `/proc` filesystem. E.g. on the same application on which I got the flags above:

`ps`
``````$ps o pid,rss -p$(pidof java)
6 4701120``````
`/proc/{pid}/status`
``````$cat /proc/$(pgrep java)/status | grep VmRSS

The RSS is `4.6 GiB`, and it’s Java heap size is `4.25 GiB`, indicating this process uses around `0.35 GiB` of non-Java heap memory, I’ll refer to this memory as native memory.

I’d like to dig a bit to understand the reported number `4701120 KiB`, what it actually measures.

#### The JVM component memory

In order to understand how the Java process memory is consumed, we need to use Native Memory Tracking (`-XX:NativeMemoryTracking=summary`) which produces an overview of the memory usage by the components of the JVM. It actually gives a pretty good picture of the "cost" of having a JVM.

 Enabling detailed native memory tracking (NMT) causes a 5% to 10% performance overhead. The summary mode merely has an impact in memory usage as shown below and is usually enough.
 It is necessary to note that while the above command indicate a scale in `KB` for the JVM it really means `KiB`.
JVM native memory tracking report
``````$jcmd$(pidof java) VM.native_memory
6:

Native Memory Tracking:

Total: reserved=7168324KB, committed=5380868KB                               (1)
-                 Java Heap (reserved=4456448KB, committed=4456448KB)        (2)
(mmap: reserved=4456448KB, committed=4456448KB)

-                     Class (reserved=1195628KB, committed=165788KB)         (3)
(classes #28431)                                 (4)
(  instance classes #26792, array classes #1639)
(malloc=5740KB #87822)
(mmap: reserved=1189888KB, committed=160048KB)
(    reserved=141312KB, committed=139876KB)
(    used=135945KB)
(    free=3931KB)
(    waste=0KB =0.00%)
(  Class space:)
(    reserved=1048576KB, committed=20172KB)
(    used=17864KB)
(    free=2308KB)
(    waste=0KB =0.00%)

(stack: reserved=692812KB, committed=81872KB)    (5)
(malloc=2432KB #4046)
(arena=1150KB #1347)

-                      Code (reserved=251877KB, committed=105201KB)          (6)
(malloc=4189KB #11718)
(mmap: reserved=247688KB, committed=101012KB)

-                        GC (reserved=230739KB, committed=230739KB)          (7)
(malloc=32031KB #63631)
(mmap: reserved=198708KB, committed=198708KB)

-                  Compiler (reserved=5914KB, committed=5914KB)              (8)
(malloc=6143KB #3281)
(arena=180KB #5)

-                  Internal (reserved=24460KB, committed=24460KB)           (10)
(malloc=24460KB #13140)

-                     Other (reserved=267034KB, committed=267034KB)         (11)
(malloc=267034KB #631)

-                    Symbol (reserved=28915KB, committed=28915KB)            (9)
(malloc=25423KB #330973)
(arena=3492KB #1)

-    Native Memory Tracking (reserved=8433KB, committed=8433KB)
(malloc=117KB #1498)

-               Arena Chunk (reserved=217KB, committed=217KB)
(malloc=217KB)

-                   Logging (reserved=7KB, committed=7KB)
(malloc=7KB #266)

-                 Arguments (reserved=19KB, committed=19KB)
(malloc=19KB #521)

-                    Module (reserved=1362KB, committed=1362KB)
(malloc=1362KB #6320)

-              Synchronizer (reserved=837KB, committed=837KB)
(malloc=837KB #6877)

-                 Safepoint (reserved=8KB, committed=8KB)
(mmap: reserved=8KB, committed=8KB)

-                   Unknown (reserved=32KB, committed=32KB)
(mmap: reserved=32KB, committed=32KB)``````
 1 This shows a `reserved` value (`7168324 KiB` (~`6.84 GiB`)), it’s the amount of addressable memory on that container, and a `committed` value (`4456448 KiB` (~`4.25 GiB`)) that represents what the JVM actually asked the OS to allocate. 2 `Heap` zone, note that reserved and committed values are the same `4456448 KiB` here because our `InitialRAMPercentage` is the same as max. I’m not sure why this number is different from the VM flags `-XX:MaxHeapSize=4563402752` though. 3 ~`162 MiB` of metaspace. 4 How many classes have been loaded : `28431`. 5 There are 674 threads whose stacks are using ~`80 MiB` at this time. 6 `Code` cache area (assembly of the used methods) ~`102 MiB` out of ~`246 MiB`. 7 This section contains `GC` algorithms internal data structures, this is app is using G1GC which takes ~`225 MiB`. 8 C1 / C2 compilers (which compile bytecode to assembly) use ~`5.8 MiB`. 9 The `Symbol` section contains many things like interned strings and other internal constants for about `28.2 MiB`. 10 The `Internal` area takes ~`24 MiB`. Before Java 11 this area included `DirectByteBuffers`, but from Java 11 those are accounted in the `Other` zone. 11 The `Other` section after Java 11 includes `DirectByteBuffers` ~`261 MiB`.

The remaining areas are much smaller in scale, NMT takes ~`8.2 MiB` itself, module system usage ~`1.3 MiB`, etc. Also, note that enabling other JVM features may show up if they are activated, like flight recorder. Source

There’s a lot more to read on the official documentation about NMT and how to Monitor VM Internal Memory. Yet another worthwhile read on native memory tracking by Aleksey Shipilёv.

In the rest of this article when talking the context of Native Memory Tracking I may use the terms memory type or memory zones, but the real definition would be :

the memory allocation type performed by a JVM component

The different sections are defined there in this `MemoryType` enumeration, and here as they appear in the report.

NMT is a great tool to gain an insight on the memory usage of the various parts that compose the Java runtime. It has interesting subcommands to compare the memory usage of the JVM components with a baseline (`jcmd $(pidof java) VM.native_memory baseline`, followed at some point by one or several `jcmd$(pidof java) VM.native_memory summary.diff`).

This is very useful for JVM components and a good complement to what I would like to show in this article, because NMT alone does not answer what is actually accounted in the RSS column of `ps`.

#### Revising OS virtual memory and memory management

I mentioned this acronym already, RSS or Resident Set Size, what is it? What exactly means committed memory or reserved memory reported in NMT ? How do they relate to each other?

First let’s break down the vocabulary when we talk about memory.

memory vocabulary
 Committed Address ranges that have been mapped or `malloc`ed. They may or may not be backed by physical or swap due to lazy allocation and paging. This applies to the JVM and the OS. These ranges are actually not necessarily contiguous. Reserved The total address range that has been pre-mapped via `mmap` or `malloc` for a particular memory pool. In other words reserved memory represents the maximum addressable memory. Those could be referred to as uncommitted. Resident OS memory pages which are currently in physical ram. This means code, stacks, part of the committed memory pools but also portions of `mmap`ed files which have recently been accessed and allocations outside the control of the JVM. Virtual The sum of all virtual address mappings. Covers committed, reserved memory pools but also mapped files or shared memory. This number is rarely informative since the JVM will reserve large address ranges upfront. We can see this number as the pessimistic memory usage.

The above graph mostly displays the relative size by memory kind within the address space of a process. In order to explain resident memory it’s necessary to revise how Linux (and other OSes by the way) manage memory using the concept of paging.

The virtual address space is divided into smaller chunks called pages usually of `4 KiB`. There are other page sizes and these sizes may even co-exist (e.g. having pages of 4 KiB mixed with 2 MiB pages), it depends on the capabilities of the processor ; working with different size of pages is something that is out of scope for this article. What is interesting is how paging and RSS relate to each other.

Virtual memmory and paging (for a single process)

The graph above shows the addressable space of a process and its pages. The process can access these pages using the addresses of its virtual space, however these pages have to be stored physically, usually in RAM, sometime on disk. When referring to these chunks of memory on hardware, we use the term frame.

The real memory address is naturally different from this virtual address space for the process. In the CPU there’s a specialized component called MMU (Memory Management Unit) whose role is to translate the virtual addresses to physical addresses.

The incentive behind virtual memory and paging comes from multi-tasking, it allows running multiple program concurrently. Each process will have the illusion of a single big block of memory. In practice, it abstracts away useful tricks like lazy allocation, swapping, file mapping, defragmentation, caching, etc.

The OS is hard at work performing these tricks while keeping this illusion for all processes. Since programs run concurrently, not all memory pages is used at the same time.

In practical terms we can observe that:

• A physical memory frame won’t be used if the process didn’t touch a page, or we can say this page doesn’t exist.

• The kernel may choose to move the content of a page to a slower device, usually a disk in a special place called swap if it thinks there won’t be enough physical memory (RAM).

• The kernel may use unemployed physical frames for caching purpose, or other tasks like defragmentation.

The resident set size mean the total set of pages of a process, i.e. without untouched / unused pages. This contrasts with virtual size which includes the total address space of a program, this value is usually way superior to RSS.

If you want to dive how the whole paging thing works head to system courses, or articles (like this masterpiece) where they usually explain in depth how everything interacts.

##### Reserved and committed memory for NMT

Concretely for the JVM it means that

1. the committed memory is immediately usable,

2. and the reserved memory part means memory put on hold and not immediately usable.

With a better understanding of how memory works let’s look again at the output of the `VM.native_memory` command to make more sense of it:

``````Total: reserved=7168324KB, committed=5380868KB                               (1)
-                 Java Heap (reserved=4456448KB, committed=4456448KB)        (2)
(mmap: reserved=4456448KB, committed=4456448KB)
...
-                     Class (reserved=1195628KB, committed=165788KB)         (3)
...
...
-                      Code (reserved=251877KB, committed=105201KB)
...
-                        GC (reserved=230739KB, committed=230739KB)          (5)
...``````
 1 The process addressable memory and what is currently committed. 2 Here the NMT also show the same abstractions of committed and reserved memory, on this process these values are the same because the `InitialHeapSize` (`Xms`) and `MaxHeapSize` (`Xmx`)are the same. If these boundaries were different it is likely the heap zone would show different values for reserved and committed memory; the JVM will increase the committed memory if necessary, and can even uncommit some of this memory if the GC algorithm allows it. 3 Class, Code spaces works the same way, specifics JVM flags control the reserved and committed memory. 4 Java Threads are allocated within the process memory, the JVM flags only control the size of a thread. I will expand on this later. 5 Then comes the other memory space of the JVM, like the GC internal structures, who are using a different memory management, these zones usually have the same reserved/committed amount.

Or with a picture :

JVM memory allocations

This graph bring the following definitions :

 Used Heap The amount of memory occupied by live objects and to a certain extent object that are unreachable but not yet collected by the GC. This only relate to the JVM Java heap. Committed heap The current limit if the writable memory to write objects to. It’s the current workspace of the GC. Upon JVM bootstrap this value should be equal to `Xms`, then the GC may expand it up to the Java heap reserved memory, or in Java terms the heap max size, or `Xmx`. Heap Max Size The maximum amount of memory that the Java heap can occupy. It’s the reserved amount in Java Heap section of the NMT output. If the application requires more memory, this will result in a `OutOfMemoryError`.

So committed stands for writable memory and, reserved stands for total addressable space of the memory. How does it work concretely?

The JVM starts by reserving the memory, then parts of this "reserve" will be made available by modifying the memory mappings using `malloc`, `mmap`, as well as `mprotect` calls in particular (on Linux).

##### `malloc` and `mmap`

The `malloc` and `mmap` C calls ask the OS to allocate memory. The OS will then provide the application the necessary memory or report an error if it is not possible.

Also, depending on the mapping in particular for `mmap` the OS can be asked to make a file accessible as a memory zone, in short it’s the kernel that perform IOs, in contrast to perform IOs with a file descriptor application side.

Differences between `malloc` and `mmap`
• `malloc` may recycle previously used memory that was released by `free`, and perform a system call to get memory only required. It’s part of the C standard.

• `malloc` allows you pass a size and that’s basically it.

• `mmap` is a system call. It’s not part of the C standard, and may not be available on all platforms.

• `mmap` can both map private memory or shared memory (as in shared with other processes). Those are called anonymous mapping using flag `MAP_ANONYMOUS`.

• `mmap` can also interact with disk files on specific ranges, without having a file descriptor.

• `mmap` can be set with various flags that are used to control how this memory mapping behave.

• Both have their performance characteristics, `malloc` is usually preferred for few and small allocations, `mmap` is preferred for few but large allocations.

When the JVM bootstrap, it requests a main memory of a certain size with the `PROT_NONE` flag to prevent any access. This has the effect to tell the OS that this mapping should not be backed by physical memory. Then when memory is needed by the program, the JVM changes the mapping for a sub-range of that main memory by removing the `PROT_NONE` flag. When new java threads are created, then the JVM will simply request another memory segment.

Simple C code example

1. that reserves `16 MiB` via a `malloc` call and `16 MiB` via the `mmap` call

2. then this program will invoke `ps` to show its actual memory consumption (RSS)

3. then it will touch/use memory by setting a bit every `1 KiB`

4. then this program will invoke `ps` again to show its actual memory consumption (RSS)

Memory example
``````#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>

#define HEAP_SIZE (16 * 1024 * 1024 * sizeof(char))

int main (int argc, char *argv[])
{
char *heap1 = malloc(HEAP_SIZE);
char *heap2 = mmap(0,
HEAP_SIZE,
PROT_NONE | PROT_WRITE,
MAP_PRIVATE | MAP_NORESERVE | MAP_ANONYMOUS,
-1,
0);

pid_t pid = getpid();
printf("pid: %d\n", pid);

char buffer[50];

sprintf(buffer, "ps -p %d -o rss,vsz,command", pid);
printf("Executing: '%s'\n", buffer);
system(buffer);

printf("Writing to some pages, but not all\n");

for (char* i = heap1; i < (heap1 + HEAP_SIZE / 16); i += 1024) {
*i = 0x01;
}
for (char* i = heap2; i < (heap2 + HEAP_SIZE / 8); i += 1024) {
*i = 0x01;
}

sprintf(buffer, "ps -p %d -o rss,vsz,command", pid);
printf("Executing: '%s'\n", buffer);
system(buffer);

free(heap1);
munmap(heap2, HEAP_SIZE);

return 0;
}``````
Result (Linux / llvm)
``````$clang -Wall -Wpedantic -o test-alloc test-alloc.c && ./test-alloc pid: 4301956 Executing: 'ps -p 2904 -o rss,vsz,command' RSS VSZ COMMAND 708 4301956 ./test-mem Writing to some pages, but not all Executing: 'ps -p 2904 -o rss,vsz,command' RSS VSZ COMMAND 3780 4301956 ./test-mem`````` As the `stdout` shows the RSS of this program is very low until memory is actually written to. At the same time the virtual memory is much, much higher; it means this simple program could address up to about `4 GiB`. This program ran on a MacBook Pro 2018 running an Intel Core i7 CPU. Now after some memory management refresh, let’s go back to the main topic of this blog post. #### Exploring what NMT does not show The previous section walked through the numbers reported, and that they represent the sizes of the different JVM memory zones, but, does not reveal the effective usage. The JVM components can use different types of memory management and as such may have multiple allocation mechanisms. NMT reports the different allocation types, for example: 1. GC based The `Java heap` and the `Metaspace` (`Class`) are usually the biggest consumers of memory, they both rely on `mmap`. Java heap and metaspace ``````- Java Heap (reserved=3145728KB, committed=3145728KB) (mmap: reserved=3145728KB, committed=3145728KB) - Class (reserved=1195111KB, committed=164967KB) (classes #27354) ( instance classes #25689, array classes #1665) (malloc=5223KB #86596) (mmap: reserved=1189888KB, committed=159744KB)`````` These two memory zones are interesting in that they are managed by the GC algorithm, put in other words the GC is actually the memory manager of these zones, it is able to arrange the memory according to the options that are passed on the command line. E.g. with a fixed size heap (`Xms` = `Xmx`), the heap will be constituted of a large memory segment, in this case the reserved and committed values will be the same as well. Other options may trigger specific behavior for these memory zones, e.g. make the heap to grow or to shrink (I never saw that in practice, maybe I’ll see it once I use a JDK 12+ with heap uncommit with JEP-346, although even the JEP mention it’ll only happen if there is very low activity, which is unlikely to happen for some workload). 2. Threads The Java threads are constructs controlled by the JVM runtime, each thread is allocated on addressable space, their allocation size is always the same, but can be controlled via a few JVM parameters. Their usage depends on application usage. E.g. if the program request 1000 threads, then the JVM needs to allocate 1000 threads. Thread ```- Thread (reserved=533903KB, committed=70439KB) (thread #517) (stack: reserved=531432KB, committed=67968KB) (1) (malloc=1866KB #3103) (2) (arena=605KB #1033) (3)```  1 The stack memory is where the JVM puts the thread stack, it’s the sum of all thread stack memory mappings. 2 The thread sub-system performed 3103 `malloc` calls amounting to `1866 KiB`. 3 The thread local handles required 1033 arenas, amounting to `605 KiB`. 3. Other native zones The other component reported by NMT management uses different technics. Sometime using a combination of these technics: `GC` zone for example only works with `malloc` and `mmap`, and size can grow as needed. GC ``````- GC (reserved=180505KB, committed=180505KB) (malloc=30589KB #219593) (1) (mmap: reserved=149916KB, committed=149916KB) (2)``````  1 Here the GC performed 219593 `malloc` calls amounting to `30589 KiB`. 2 Here the GC reserved and committed memory segment(s) amount to `149916 KiB`. The JVM also implements its own Arena based memory management, (distinct from the arena memory management of glibc). It is used by some subsystems of the JVM or when native code uses internal objects that rely on JVM arenas [1] [2] `Compiler`, `Symbol table` do use this memory management for example. Special mention of the thread local handles that also use JVM arenas. NMT reports all the memory allocation technics that are used by a JVM component, for example the GC system : compiler ``````- Compiler (reserved=6666KB, committed=6666KB) (malloc=6533KB #3575) (1) (arena=133KB #5) (2)``````  1 The compiler performed 3575 `malloc` calls amounting to `6533 KiB`. 2 The compiler uses 5 arenas totaling `133 KiB`. #### Track `DirectByteBuffer` with NMT Using NMT `baseline` and `summary.diff` modes, it is possible to track the evolution of the JVM components. `DirectByteBuffer`s allow allocating native memory segments. They are not cheap to create, and they are only deallocated when a GC actually finalize the references. Usually these byte buffers have a long life and they are big. The following snippet of code will try to show they are reported in the `Other` section of NMT. Note that in this snippet I’m just invoking the external process `jcmd` for brevity and clarity, but it’s possible to invoke the diagnostic command in pure Java. Exercise `DirectByteBuffer` and NMT ``````// env -u JDK_JAVA_OPTIONS java -XX:NativeMemoryTracking=summary DBB.java 1 1 import java.nio.*; import java.lang.ProcessBuilder.*; public class DBB { public static void main(String[] args) throws Exception { System.out.printf("nmt baseline: %n"); new ProcessBuilder("jcmd", Long.toString(ProcessHandle.current().pid()), "VM.native_memory", "baseline") .redirectOutput(Redirect.INHERIT) .redirectError(Redirect.INHERIT) .start() .waitFor(); var bbCount = Integer.parseInt(args[0]); var bbSizeMiB = Integer.parseInt(args[1]); for (var i = 0; i < bbCount; i++) { var byteBuffer = ByteBuffer.allocateDirect(bbSizeMiB * 1024 * 1024) .putInt(0, 0x01); } System.out.printf("nmt summary.diff: %n"); new ProcessBuilder("jcmd", Long.toString(ProcessHandle.current().pid()), "VM.native_memory", "summary.diff") .redirectOutput(Redirect.INHERIT) .redirectError(Redirect.INHERIT) .start() .waitFor(); } }`````` 1 x 1MiB ``````$ env -u JDK_JAVA_OPTIONS java -XX:NativeMemoryTracking=summary DBB.java 1 1
nmt baseline:
779:
Baseline succeeded
nmt summary.diff:
779:

Native Memory Tracking:

Total: reserved=1916470KB +1027KB, committed=113950KB +1031KB

-                 Java Heap (reserved=509952KB, committed=32768KB)
(mmap: reserved=509952KB, committed=32768KB)

...

-                     Other (reserved=1034KB +1024KB, committed=1034KB +1024KB) (1)
(malloc=1034KB +1024KB #3 +1) (2)

...``````
 1 The `DirectByteBuffer` of `1 MiB`. 2 `DirectByteBuffer`s use `malloc` underneath.
10 x 1MiB
``````$env -u JDK_JAVA_OPTIONS java -XX:NativeMemoryTracking=summary DBB.java 10 1 nmt baseline: 839: Baseline succeeded nmt summary.diff: 839: Native Memory Tracking: Total: reserved=1933553KB +10243KB, committed=132061KB +10247KB - Java Heap (reserved=509952KB, committed=32768KB) (mmap: reserved=509952KB, committed=32768KB) ... - Other (reserved=10250KB +10240KB, committed=10250KB +10240KB) (1) (malloc=10250KB +10240KB #12 +10) (2) ...``````  1 The 10 `DirectByteBuffer`s of `1 MiB`. 2 `DirectByteBuffer`s use `malloc` underneath. 20 x 100MiB ``````$ env -u JDK_JAVA_OPTIONS java -XX:NativeMemoryTracking=summary DBB.java 20 100
nmt baseline:
898:
Baseline succeeded
nmt summary.diff:
898:

Native Memory Tracking:

Total: reserved=2331899KB +408590KB, committed=512275KB +390462KB

Total: reserved=2323817KB +409608KB, committed=498961KB +386252KB

-                 Java Heap (reserved=509952KB, committed=10240KB -22528KB) (3)
(mmap: reserved=509952KB, committed=10240KB -22528KB)

...

-                     Other (reserved=409610KB +409600KB, committed=409610KB +409600KB) (1)
(malloc=409610KB +409600KB #6 +4) (2)

...``````
 1 The 20 `DirectByteBuffer`s of `100 MiB`. Uh wait, `409600 KiB` is nothing near ~`2 GiB` (`2048000 KiB`), it looks more like 4 buffers of `100 MiB` 2 `DirectByteBuffer`s use `malloc` underneath. 3 This times there is also a reduction in the Java Heap.

As one can see the total reserved and committed memory are actually increased by the amount of allocated memory.

The last exercise, `20 x 100 MiB`, is more captivating: the low amount of allocated memory by `DirectByteBuffer`s is simply explained by the GC that kicked in, if run the last command with `-Xlog:gc*` you’ll notice 4 Full GC happening in the middle of the loop.

``[1.671s][info][gc,start       ] GC(4) Pause Full (System.gc())``

The above code don’t keep strong references to the wrapping buffers, thus allowing these object to be GCed, if the references of these byte buffers were kept, this program would exited with a `java.lang.OutOfMemoryError: Direct buffer memory`

It’s not part of this article but it’s well worth to understand how `DirectByteBuffer`s handle their garbage collection (using a `Cleaner`).

Now I mentioned that there was 4 Full GCs, that should have raised eyebrows. If it didn’t the full GC cause should provoke the attention, `System.gc()`. Pretending I don’t know where this came from I’ll search where these are happening

``````$env -u JDK_JAVA_OPTIONS java -XX:NativeMemoryTracking=summary \ -agentpath:async-profiler-1.8.2-linux-x64/build/libasyncProfiler.so=start,event=java.lang.System.gc,traces,file=traces.txt \ DBB.java 20 100 > /dev/null 2>&1$ cat traces.txt
--- Execution profile ---
Total samples       : 4

Frame buffer usage  : 0.0012%

--- 4 calls (100.00%), 4 samples
[ 0] java.lang.System.gc
[ 1] java.nio.Bits.reserveMemory
[ 2] java.nio.DirectByteBuffer.<init>
[ 3] java.nio.ByteBuffer.allocateDirect
[ 4] DBB.main
[ 5] jdk.internal.reflect.NativeMethodAccessorImpl.invoke0
[ 6] jdk.internal.reflect.NativeMethodAccessorImpl.invoke
[ 7] jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke
[ 8] java.lang.reflect.Method.invoke
[ 9] com.sun.tools.javac.launcher.Main.execute
[10] com.sun.tools.javac.launcher.Main.run
[11] com.sun.tools.javac.launcher.Main.main``````

By default, the VM limits the total size or capacity of direct byte buffers to somewhat the size of the heap. This can be tuned via `-XX:MaxDirectMemorySize`. The 4 Full GC cycles indicate for 20 allocateDirect() and 4 remaining, this means after 4 successful create the 5th allocateDirect will require a System.gc(), this suggests a max memory limit in this range `[419430400;524288000[`, and indeed the reported size of Java Heap section is `522190848` (`509952 KiB`).

#### Track memory mapped file with NMT

Using NMT `baseline` and `summary.diff` modes, is it possible to track the memory mapped file usage? Let’s try out.

Exercise `MappedByteBuffer` and NMT
``````package sandbox;

import java.nio.channels.FileChannel;
import java.nio.file.*;

public class MappedFiles {
public static void main(String[] args) throws Exception {
System.out.printf("nmt baseline: %n");
new ProcessBuilder("jcmd", Long.toString(ProcessHandle.current().pid()), "VM.native_memory", "baseline")
.start()
.waitFor();

Path src = Paths.get("/usr/lib/jvm/java-11-amazon-corretto/lib/src.zip"); (1)
try (var fileChannel = (FileChannel) Files.newByteChannel(src, StandardOpenOption.READ)) {
var mappedByteBuffer = fileChannel.map(
0, (2)
fileChannel.size()); (2)

System.out.printf("nmt summary.diff: %n");
new ProcessBuilder("jcmd", Long.toString(ProcessHandle.current().pid()), "VM.native_memory", "summary.diff")
.redirectOutput(ProcessBuilder.Redirect.INHERIT)
.redirectError(ProcessBuilder.Redirect.INHERIT)
.start()
.waitFor();
}
}
}``````
 1 Opens a binary file about `50 MiB` in size. 2 Range of the memory mapping starts at `0`, up to the total file size. 3 The `load` method will actually instruct the OS to load the range defined above in resident memory.

Let’s look at what NMT reports.

``````$env -u JDK_JAVA_OPTIONS java -XX:NativeMemoryTracking=summary MappedFiles.java nmt baseline: nmt summary.diff: 1760: Native Memory Tracking: Total: reserved=1929764KB -1028KB, committed=127588KB -44KB - Java Heap (reserved=509952KB, committed=32768KB) (mmap: reserved=509952KB, committed=32768KB) - Class (reserved=1065377KB +1KB, committed=16929KB +1KB) (classes #2650 +17) ( instance classes #2378 +15, array classes #272 +2) (malloc=417KB +1KB #5031 +35) (mmap: reserved=1064960KB, committed=16512KB) ( Metadata: ) ( reserved=16384KB, committed=14592KB) ( used=14167KB +34KB) ( free=425KB -34KB) ( waste=0KB =0.00%) ( Class space:) ( reserved=1048576KB, committed=1920KB) ( used=1720KB +9KB) ( free=200KB -9KB) ( waste=0KB =0.00%) - Thread (reserved=19723KB -1032KB, committed=1027KB -48KB) (thread #20 -1) (stack: reserved=19632KB -1028KB, committed=936KB -44KB) (malloc=69KB -4KB #122 -6) (arena=22KB #38 -1) - Code (reserved=247935KB +1KB, committed=7795KB +1KB) (malloc=247KB +1KB #1692 +9) (mmap: reserved=247688KB, committed=7548KB) - GC (reserved=60330KB, committed=42622KB) (malloc=8570KB #1516 +1) (mmap: reserved=51760KB, committed=34052KB) - Compiler (reserved=154KB -1KB, committed=154KB -1KB) (malloc=21KB #138 -6) (arena=133KB -1 #5 -1) - Internal (reserved=579KB, committed=579KB) (malloc=547KB #1040 -1) (mmap: reserved=32KB, committed=32KB) - Other (reserved=10KB, committed=10KB) (malloc=10KB #2) - Symbol (reserved=4386KB, committed=4386KB) (malloc=3163KB #28643 +18) (arena=1223KB #1) - Native Memory Tracking (reserved=650KB +2KB, committed=650KB +2KB) (malloc=7KB +1KB #94 +18) (tracking overhead=643KB +1KB) - Arena Chunk (reserved=20529KB +1KB, committed=20529KB +1KB) (malloc=20529KB +1KB) - Logging (reserved=4KB, committed=4KB) (malloc=4KB #191) - Arguments (reserved=18KB, committed=18KB) (malloc=18KB #492) - Module (reserved=60KB, committed=60KB) (malloc=60KB #1041) - Synchronizer (reserved=48KB, committed=48KB) (malloc=48KB #404 -2) - Safepoint (reserved=8KB, committed=8KB) (mmap: reserved=8KB, committed=8KB)`````` Nothing. We’ll see in a later section how to see how much memory mapped files can account in the resident memory. As a side note before switching to OS tooling, the memory segment used for the memory mapping is not freed until the next GC cycle. #### Inspecting memory mappings It’s easy to get the RSS of a process, to understand if the committed heap actually resides on physical memory you need to use `pmap` or inspect `/proc/{pid}/maps` or `/proc/{pid}/smaps`. The `pmap` binary is part of the `procps` utilities, that contains other tools like: `ps`, `pgrep`, `watch` or `vmstat`. It’s likely that no additional installation is required which is great as a container filesystem should be read-only for security reasons, if it isn’t there, one could still look at the `/proc` filesystem. You have to notice one of the first memory zones is quite big and about the size of the committed heap as shown in NMT. To select the file mappings we can filter on the access permissions: • `r-`: readable memory mapping • `w`: writable memory mapping • `x`: executable memory mapping • `s` or `p` : shared memory mapping or private mapping. `/proc/<pid>/maps` `pmap` may show another mapping mode which I barely found any reference of, here’s one and here • `R`: if set, the map has no swap space reserved (`MAP_NORESERVE` flag of `mmap`). This means that we can get a segmentation fault by accessing that memory if it has not already been mapped to physical memory, and if the system is out of physical memory. There’s also the value of the inode column, if it’s greater than `0` then it means the address range is backed by a file, if it’s `0` it’s a memory allocation that the application has requested. Identifying JVM memory components There are three kinds of memory segments we can easily guess in the memory mapping reported by `pmap` because we know their size, it’s the Java heap, and the threads. Some other type of allocations can be figured out but that’s for another post. The remaining address ranges are too difficult to guess for two reasons, they usually have unpredictable allocation behavior, and it also depends on the `malloc` implementation details, (like the arenas in Glibc), and on the number `malloc` calls for a single component. On a pod running in production let’s have a quick look on the very first mappings. It’s easier to spot with `pmap -X` (capital `X`). `pmap -x {pid}` ``````$ pmap -x $(pidof java) | head -n 20 7: /usr/bin/java -Dfile.encoding=UTF-8 -Duser.timezone=UTC -Djava.security.egd=file:/dev/./urandom -Djava.awt.headless=true -XX:NativeMemoryTracking=summary -jar /app/boot.jar Address Kbytes RSS Dirty Mode Mapping 0000000740000000 3163648 3163648 3163648 rw--- [ anon ] (1) 0000000801180000 1030656 0 0 ----- [ anon ] 000055bac4461000 4 4 0 r-x-- java 000055bac4662000 4 4 4 r---- java 000055bac4663000 4 4 4 rw--- java 000055bac569c000 455704 438268 438268 rw--- [ anon ] (2) 00007ff9b91e7000 16 0 0 ----- [ anon ] 00007ff9b91eb000 1012 24 24 rw--- [ anon ] 00007ff9b92e8000 16 0 0 ----- [ anon ] (3) 00007ff9b92ec000 1012 92 92 rw--- [ anon ] (4) 00007ff9b93e9000 16 0 0 ----- [ anon ] 00007ff9b93ed000 1012 88 88 rw--- [ anon ] 00007ff9b94ea000 16 0 0 ----- [ anon ] 00007ff9b94ee000 1012 24 24 rw--- [ anon ] 00007ff9b95eb000 16 0 0 ----- [ anon ] 00007ff9b95ef000 1012 28 28 rw--- [ anon ] 00007ff9b96ec000 16 0 0 ----- [ anon ] 00007ff9b96f0000 1012 24 24 rw--- [ anon ]``````  1 native heap memory heap 2 java heap 3 a thread guard pages 4 a thread stack `pmap -X {pid}` ``````$ pmap -X $(pidof java) | head -n 20 7: /usr/bin/java -Dfile.encoding=UTF-8 -Duser.timezone=UTC -Djava.security.egd=file:/dev/./urandom -Djava.awt.headless=true -XX:NativeMemoryTracking=summary -javaagent:/newrelic-agent.jar -javaagent:/dd-java-agent.jar -jar /edge-api-boot.jar --spring.config.additional-location=/etc/edge-api/config.yaml --server.port=8080 Address Perm Offset Device Inode Size Rss Pss Referenced Anonymous LazyFree ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping 740000000 rw-p 00000000 00:00 0 3163648 3163648 3163648 3163648 3163648 0 0 0 0 0 0 0 0 (1) 801180000 ---p 00000000 00:00 0 1030656 0 0 0 0 0 0 0 0 0 0 0 0 55bac4461000 r-xp 00000000 08:01 5623642 4 4 4 4 0 0 0 0 0 0 0 0 0 java 55bac4662000 r--p 00001000 08:01 5623642 4 4 4 4 4 0 0 0 0 0 0 0 0 java 55bac4663000 rw-p 00002000 08:01 5623642 4 4 4 4 4 0 0 0 0 0 0 0 0 java 55bac569c000 rw-p 00000000 00:00 0 455704 438268 438268 438268 438268 0 0 0 0 0 0 0 0 [heap] (2) 7ff9b91e7000 ---p 00000000 00:00 0 16 0 0 0 0 0 0 0 0 0 0 0 0 7ff9b91eb000 rw-p 00000000 00:00 0 1012 28 28 28 28 0 0 0 0 0 0 0 0 7ff9b92e8000 ---p 00000000 00:00 0 16 0 0 0 0 0 0 0 0 0 0 0 0 (3) 7ff9b92ec000 rw-p 00000000 00:00 0 1012 92 92 92 92 0 0 0 0 0 0 0 0 (4) 7ff9b93e9000 ---p 00000000 00:00 0 16 0 0 0 0 0 0 0 0 0 0 0 0 7ff9b93ed000 rw-p 00000000 00:00 0 1012 88 88 88 88 0 0 0 0 0 0 0 0 7ff9b94ea000 ---p 00000000 00:00 0 16 0 0 0 0 0 0 0 0 0 0 0 0 7ff9b94ee000 rw-p 00000000 00:00 0 1012 24 24 24 24 0 0 0 0 0 0 0 0 7ff9b95eb000 ---p 00000000 00:00 0 16 0 0 0 0 0 0 0 0 0 0 0 0 7ff9b95ef000 rw-p 00000000 00:00 0 1012 28 28 28 28 0 0 0 0 0 0 0 0 7ff9b96ec000 ---p 00000000 00:00 0 16 0 0 0 0 0 0 0 0 0 0 0 0 7ff9b96f0000 rw-p 00000000 00:00 0 1012 24 24 24 24 0 0 0 0 0 0 0 0``````  1 native heap memory heap 2 java heap 3 a thread guard pages 4 a thread stack `/proc/{pid}/maps` ``````$ cat /proc/$(pidof java)/maps | head -n 20 740000000-801180000 rw-p 00000000 00:00 0 (1) 801180000-840000000 ---p 00000000 00:00 0 55bac4461000-55bac4462000 r-xp 00000000 08:01 5623642 /usr/lib/jvm/java-11-amazon-corretto/bin/java 55bac4662000-55bac4663000 r--p 00001000 08:01 5623642 /usr/lib/jvm/java-11-amazon-corretto/bin/java 55bac4663000-55bac4664000 rw-p 00002000 08:01 5623642 /usr/lib/jvm/java-11-amazon-corretto/bin/java 55bac569c000-55bae13a2000 rw-p 00000000 00:00 0 [heap] (2) 7ff9b91e7000-7ff9b91eb000 ---p 00000000 00:00 0 7ff9b91eb000-7ff9b92e8000 rw-p 00000000 00:00 0 7ff9b92e8000-7ff9b92ec000 ---p 00000000 00:00 0 (3) 7ff9b92ec000-7ff9b93e9000 rw-p 00000000 00:00 0 (4) 7ff9b93e9000-7ff9b93ed000 ---p 00000000 00:00 0 7ff9b93ed000-7ff9b94ea000 rw-p 00000000 00:00 0 7ff9b94ea000-7ff9b94ee000 ---p 00000000 00:00 0 7ff9b94ee000-7ff9b95eb000 rw-p 00000000 00:00 0 7ff9b95eb000-7ff9b95ef000 ---p 00000000 00:00 0 7ff9b95ef000-7ff9b96ec000 rw-p 00000000 00:00 0 7ff9b96ec000-7ff9b96f0000 ---p 00000000 00:00 0 7ff9b96f0000-7ff9b97ed000 rw-p 00000000 00:00 0 7ff9b97ed000-7ff9b97f1000 ---p 00000000 00:00 0 7ff9b97f1000-7ff9b99ee000 rw-p 00000000 00:00 0``````  1 native heap memory heap 2 java heap 3 a thread guard pages 4 a thread stack The first thing to natice is that `pmap` choses to display the start address, and the size of the mapping in another column, while the `maps` file is using address ranges. As you might have guessed, the sum of the size of these mapping is the value one can see in the `vsz` column of `ps`. Explanation 1. `740000000-801180000` (`3163648 KiB`), around `3 GiB` in a simple mapping, this looks like the size of the heap, subtracting the addressed gives this number `3 239 575 552`, which very close to the VM actual flag for the heap `-XX:MaxHeapSize=3221225472`, the JVM must map additional space. We also note that the RSS on this mapping is equal to the size, this means that either this flag `-XX:+AlwaysPreTouch` is active, or that all pages in the heap have been touched once, for this app this is the former case. This single address range, also indicates that the minimum and the maximum value of the heap is the same `Xmx` = `Xms`. If they weren’t we would have seen two adjacent segment with different permissions (`rw-p` then `---p`), the JVM can grow the read-and-write segment of the Java Heap. Just under this mapping there’s another one `801180000-840000000` (`1030656 KiB`), around `1 GiB`, one could think it’s the metaspace, but it isn’t. Looking at the other columns, the mode or permissions or the RSS, we see respectively `---p` and `0`, this means this memory segment is reserved but it is not writeable. Finding the metaspace cannot be done this way. 2. `55bac569c000-55bae13a2000`, on the extended `pmap` output this mapping has a name `heap`, this one is the native java heap of the Java process. One can notice the next mapping address (`7ff9b91e7000`) is not adjacent, this allows the native heap to grow if necessary. The virtual size of this mapping is `~445 MiB` and the active pages amounts to `428 MiB`. 3. Then there’s a lot of mapping with this pattern, first `16 KiB` with no permission (`---p`) immediately followed by a `1012 KiB` segment with read and write permissions (`rw-p`), those are the Java threads, by default the virtual size the of the thread stack size is `1 MiB`, the `ThreadStackSize` flag control this maximum stack size. The `16 KiB` are the thread guard pages, the number of pages (`4 KiB`) is controlled by `StackReservedPages`, `StackYellowPages` and `StackRedPages` whose defaults are respectively `1`, `2`, and `1`. They are used when a stack overflow error happens, normally the guard pages cannot be written to, their permission will change in order to handle the error ; read this explanation from Andrei Pangin to learn more on this topic.  For the keen observer the virtual size of these two memory segment is `1028 KiB`, a bit more than `1 MiB`, I’ve learned a few months ago that glibc, and other allocators apparently adds one page to the allocated stack size, if the segment size is a multiple of 64K. This is to prevent aliasing on the CPU cache lines A 64K-aliasing conflict occurs when a virtual address memory references a cache line that is modulo 64K bytes apart from another cache line that already resides in the first level cache. Only one cache line with a virtual address modulo 64K bytes can reside in the first level cache at the same time. For example, accessing a byte at virtual addresses 0x10000 and 0x3000F would cause a 64K aliasing conflict. This is because the virtual addresses for the two bytes reside on cache lines that are modulo 64K bytes apart. In other words one can see an additional `4 KiB` (a page), for stack size like `512 KiB`, `256 KiB`, `128 KiB`,`64 KiB`. That being said, if pages in the mapping are not touched, they do not account as resident memory. This `55bac569c000-55bae13a2000` mapping tells the stack was at most `92 KiB`. Anyway with more threads there will be naturally more consumed resident memory. The other JVM components are harder to identify due to the way they are allocated. That being said `pmap` reveals file-backed memory mapping, these consumes pages too. #### Inspecting memory mapped files The `NativeMemoryTracking` output showed memory usage of the JVM, but it didn’t report `MappedByteBuffers`, those are the files that are memory mapped to the virtual memory of a process as explained above via the native `mmap` call. There are two ways to read a file using a file descriptor, generally it happens when opening a `FileInputStream`, or using memory mapping via a `FileChannel`. When a file is memory mapped, the range of the content is divided by pages too, and when accessed they are copied in RAM by the OS, these are accounted in RSS. For this reason they may deserve some attention if RSS usage is high but the app memory alone is not enough. The `Mapping` column on the of `pmap -x$(pgrep java)` can be parsed to identify file mappings, but this is brittle and unnecessary, one can simply look at the output of `pmap -X $(pgrep java)` (notice the big `X`) or even at the `/proc/$(pidof java)/maps` content looking for a non-zero value of the `inode` column meaning this mapping is file backed.

Using the output of `pmap -X $(pgrep java)` and selecting the matching lines with `awk` this is easy: Shared application memory mapped files ``````$ pmap -X $(pidof java) \ | head -n -2 \ (4) | awk '{ if (NR <= 2 ||$5 >0 ) \ (1)
printf "%12s %8s %8s %4s %s\n", \ (2)
$1, \$6, \
$7, \$2, \
$19 }' (2) 7: -Djava.awt.headless=true -XX:NativeMemoryTracking=summary /usr/bin/java Address Size Rss Perm Mapping (3) 561ddb94a000 4 4 r-xp java 561ddbb4b000 4 4 r--p java 561ddbb4c000 4 4 rw-p java 7f355521f000 4 4 r--s instrumentation9549273990865322165.jar 7f355964d000 4 4 r--s instrumentation14393425676176063484.jar 7f3559e50000 1160 1160 r--s dd-java-agent.jar 7f355a372000 256 192 r-xp libsunec.so 7f355a3b2000 2048 0 ---p libsunec.so 7f355a5b2000 20 20 r--p libsunec.so 7f355a5b7000 8 8 rw-p libsunec.so 7f355a7b9000 16 16 r--p libresolv-2.28.so 7f355a7bd000 52 52 r-xp libresolv-2.28.so 7f355a7ca000 16 16 r--p libresolv-2.28.so 7f355a7ce000 4 0 ---p libresolv-2.28.so 7f355a7cf000 4 4 r--p libresolv-2.28.so 7f355a7d0000 4 4 rw-p libresolv-2.28.so 7f355a7d3000 4 4 r--p libnss_dns-2.28.so 7f355a7d4000 16 16 r-xp libnss_dns-2.28.so 7f355a7d8000 4 0 r--p libnss_dns-2.28.so 7f355a7d9000 4 4 r--p libnss_dns-2.28.so 7f355a7da000 4 4 rw-p libnss_dns-2.28.so 7f355a7dd000 4 4 r--s instrumentation13129117816180832587.jar 7f355a7de000 8 8 r-xp libextnet.so 7f355a7e0000 2044 0 ---p libextnet.so 7f355a9df000 4 4 r--p libextnet.so 7f355b9e9000 4 4 r--s newrelic-bootstrap1151474907525430822.jar 7f355bfea000 24 24 r-xp libmanagement_ext.so 7f355bff0000 2044 0 ---p libmanagement_ext.so 7f355c1ef000 4 4 r--p libmanagement_ext.so 7f355c1f0000 4 4 rw-p libmanagement_ext.so 7f355c1f1000 16 16 r-xp libmanagement.so 7f355c1f5000 2048 0 ---p libmanagement.so 7f355c3f5000 4 4 r--p libmanagement.so 7f355c5f7000 8 8 r--s newrelic-weaver-api14962018995408739070.jar 7f355c5f9000 12 12 r--s newrelic-api8237374132620194936.jar 7f355c5fc000 4 4 r--s newrelic-opentracing-bridge6621669571490510163.jar 7f355c5fd000 16 16 r--s agent-bridge7978421659510986627.jar 7f355c601000 88 88 r-xp libnet.so 7f355c617000 2048 0 ---p libnet.so 7f355c817000 4 4 r--p libnet.so 7f355c818000 4 4 rw-p libnet.so 7f355c819000 64 64 r-xp libnio.so 7f355c829000 2048 0 ---p libnio.so 7f355ca29000 4 4 r--p libnio.so 7f355ca2a000 4 4 rw-p libnio.so 7f355cf30000 200 128 r--p LC_CTYPE 7f355cf62000 4 4 r--p LC_NUMERIC 7f355cf63000 4 4 r--p LC_TIME 7f355cf64000 1484 156 r--p LC_COLLATE 7f355d0d7000 4 4 r--p LC_MONETARY 7f355d0d8000 4 4 r--p SYS_LC_MESSAGES 7f355d0d9000 4 4 r--p LC_PAPER 7f355d0da000 4 4 r--p LC_NAME 7f355d0db000 28 28 r--s gconv-modules.cache 7f357663b000 138232 30036 r--s modules 7f357ed39000 104 92 r-xp libzip.so 7f357ed53000 2044 0 ---p libzip.so 7f357ef52000 4 4 r--p libzip.so 7f357ef5c000 12 12 r--p libnss_files-2.28.so 7f357ef5f000 28 28 r-xp libnss_files-2.28.so 7f357ef66000 8 8 r--p libnss_files-2.28.so 7f357ef68000 4 0 ---p libnss_files-2.28.so 7f357ef69000 4 4 r--p libnss_files-2.28.so 7f357ef6a000 4 4 rw-p libnss_files-2.28.so 7f357ef71000 4 4 r--p LC_ADDRESS 7f357ef72000 4 4 r--p LC_TELEPHONE 7f357ef73000 4 4 r--p LC_MEASUREMENT 7f357ef74000 40 40 r-xp libinstrument.so 7f357ef7e000 2044 0 ---p libinstrument.so 7f357f17d000 4 4 r--p libinstrument.so 7f357f17e000 4 4 rw-p libinstrument.so 7f357f17f000 108 64 r-xp libjimage.so 7f357f19a000 2048 0 ---p libjimage.so 7f357f39a000 8 8 r--p libjimage.so 7f357f39c000 4 4 rw-p libjimage.so 7f357f39d000 164 164 r-xp libjava.so 7f357f3c6000 2048 0 ---p libjava.so 7f357f5c6000 4 4 r--p libjava.so 7f357f5c7000 4 4 rw-p libjava.so 7f357f5c9000 68 68 r-xp libverify.so 7f357f5da000 2044 0 ---p libverify.so 7f357f7d9000 8 8 r--p libverify.so 7f357f7dc000 8 8 r--p librt-2.28.so 7f357f7de000 16 16 r-xp librt-2.28.so 7f357f7e2000 8 0 r--p librt-2.28.so 7f357f7e4000 4 4 r--p librt-2.28.so 7f357f7e5000 4 4 rw-p librt-2.28.so 7f357f8e7000 17680 15012 r-xp libjvm.so 7f3580a2b000 2044 0 ---p libjvm.so 7f3580c2a000 764 764 r--p libjvm.so 7f3580ce9000 228 228 rw-p libjvm.so 7f3580d7d000 12 12 r--p libgcc_s.so.1 7f3580d80000 68 64 r-xp libgcc_s.so.1 7f3580d91000 12 12 r--p libgcc_s.so.1 7f3580d94000 4 0 ---p libgcc_s.so.1 7f3580d95000 4 4 r--p libgcc_s.so.1 7f3580d96000 4 4 rw-p libgcc_s.so.1 7f3580d97000 52 52 r--p libm-2.28.so 7f3580da4000 636 368 r-xp libm-2.28.so 7f3580e43000 852 128 r--p libm-2.28.so 7f3580f18000 4 4 r--p libm-2.28.so 7f3580f19000 4 4 rw-p libm-2.28.so 7f3580f1a000 548 548 r--p libstdc++.so.6.0.25 7f3580fa3000 688 192 r-xp libstdc++.so.6.0.25 7f358104f000 248 64 r--p libstdc++.so.6.0.25 7f358108d000 4 0 ---p libstdc++.so.6.0.25 7f358108e000 40 40 r--p libstdc++.so.6.0.25 7f3581098000 8 8 rw-p libstdc++.so.6.0.25 7f35810a0000 136 136 r--p libc-2.28.so 7f35810c2000 1312 1208 r-xp libc-2.28.so 7f358120a000 304 152 r--p libc-2.28.so 7f3581256000 4 0 ---p libc-2.28.so 7f3581257000 16 16 r--p libc-2.28.so 7f358125b000 8 8 rw-p libc-2.28.so 7f3581261000 4 4 r--p libdl-2.28.so 7f3581262000 4 4 r-xp libdl-2.28.so 7f3581263000 4 4 r--p libdl-2.28.so 7f3581264000 4 4 r--p libdl-2.28.so 7f3581265000 4 4 rw-p libdl-2.28.so 7f3581266000 100 100 r-xp libjli.so 7f358127f000 2048 0 ---p libjli.so 7f358147f000 4 4 r--p libjli.so 7f3581480000 4 4 rw-p libjli.so 7f3581481000 24 24 r--p libpthread-2.28.so 7f3581487000 60 60 r-xp libpthread-2.28.so 7f3581496000 24 0 r--p libpthread-2.28.so 7f358149c000 4 4 r--p libpthread-2.28.so 7f358149d000 4 4 rw-p libpthread-2.28.so 7f35814a2000 4 4 r--p LC_IDENTIFICATION 7f3581878000 4 4 r--p ld-2.28.so 7f3581879000 120 120 r-xp ld-2.28.so 7f3581897000 32 32 r--p ld-2.28.so 7f358189f000 4 4 r--p ld-2.28.so 7f35818a0000 4 4 rw-p ld-2.28.so``````  1 Filter lines that have an Inode value over 0 and only from the 3rd line (included). 2 Print only some columns, `pmap -X {pid}’s output is verbose. 3 The columns are select to match the output of `pmap -x`, `Size` column is in `KiB`. 4 The last two lines are filtered out; the actual sums of the size and rss columns of the selected rows are respectively `195336 KiB` and `52316 KiB`. What may catch the eye is the multiple mapping for native libraries like `libjvm.so`. The reason for these different memory mapping is how dynamic libraries are loaded (with `dlopen`, e.g. here os::Linux::dlopen_helper). I didn’t have any system courses, but from what I believe I know `dlopen` will make multiple memory mapping with different objectives and permissions: • `r-xp` means an executable segment of the library, probably the native execution stack of the native library • `r--p` means readable memory of the library, I believe it is the library constants or symbols • `rw-p` means writable memory, I think its purpose is for the main process to set global variables of the library • `---p` is a no permission segment, I’m not sure about this one, but it’s location (between executable and writable segments) makes me think it’s about buffer overflow prevention Simple C code example that performs a `dlopen` The program below will simply load the shared dynamic library `libjvm.so`, and won’t even interact with it. The result shows the 4 mappings with the different modes. c ``````#include <stdio.h> #include <unistd.h> #include <dlfcn.h> int main (int argc, char *argv[]) { pid_t pid = getpid(); printf("pid: %d\n", pid); void* libjava_handle=dlopen("lib/server/libjvm.so", RTLD_LAZY); if (!libjava_handle) { fputs (dlerror(), stderr); exit(1); } char buffer[50]; sprintf(buffer, "pmap -X %d", pid); printf("Executing: '%s'\n", buffer); system(buffer); return 0; }`````` result ``````$ env LD_LIBRARY_PATH=$JAVA_HOME/lib/server ./test-dlopen pid: 608 Executing: 'pmap -x -p 608' 608: ./test-dlopen Address Kbytes RSS Dirty Mode Mapping 0000000000400000 4 4 0 r-x-- /src/build/exe/dlopen/test-dlopen 0000000000600000 4 4 4 r---- /src/build/exe/dlopen/test-dlopen 0000000000601000 4 4 4 rw--- /src/build/exe/dlopen/test-dlopen 0000000001ba0000 132 16 16 rw--- [ anon ] 00007f3374f11000 92 92 0 r-x-- /usr/lib64/libpthread-2.17.so 00007f3374f28000 2044 0 0 ----- /usr/lib64/libpthread-2.17.so 00007f3375127000 4 4 4 r---- /usr/lib64/libpthread-2.17.so 00007f3375128000 4 4 4 rw--- /usr/lib64/libpthread-2.17.so 00007f3375129000 16 4 4 rw--- [ anon ] 00007f337512d000 18516 5324 0 r-x-- /usr/lib/jvm/java-11-openjdk-11.0.9.11-0.el7_9.x86_64/lib/server/libjvm.so (1) 00007f3376342000 2048 0 0 ----- /usr/lib/jvm/java-11-openjdk-11.0.9.11-0.el7_9.x86_64/lib/server/libjvm.so (2) 00007f3376542000 836 836 836 r---- /usr/lib/jvm/java-11-openjdk-11.0.9.11-0.el7_9.x86_64/lib/server/libjvm.so (3) 00007f3376613000 236 216 216 rw--- /usr/lib/jvm/java-11-openjdk-11.0.9.11-0.el7_9.x86_64/lib/server/libjvm.so (4) 00007f337664e000 360 240 240 rw--- [ anon ] 00007f33766a8000 1808 1184 0 r-x-- /usr/lib64/libc-2.17.so 00007f337686c000 2044 0 0 ----- /usr/lib64/libc-2.17.so 00007f3376a6b000 16 16 16 r---- /usr/lib64/libc-2.17.so 00007f3376a6f000 8 8 8 rw--- /usr/lib64/libc-2.17.so 00007f3376a71000 20 12 12 rw--- [ anon ] 00007f3376a76000 84 64 0 r-x-- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 00007f3376a8b000 2044 0 0 ----- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 00007f3376c8a000 4 4 4 r---- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 00007f3376c8b000 4 4 4 rw--- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 00007f3376c8c000 1028 208 0 r-x-- /usr/lib64/libm-2.17.so 00007f3376d8d000 2044 0 0 ----- /usr/lib64/libm-2.17.so 00007f3376f8c000 4 4 4 r---- /usr/lib64/libm-2.17.so 00007f3376f8d000 4 4 4 rw--- /usr/lib64/libm-2.17.so 00007f3376f8e000 932 520 0 r-x-- /usr/lib64/libstdc++.so.6.0.19 00007f3377077000 2048 0 0 ----- /usr/lib64/libstdc++.so.6.0.19 00007f3377277000 32 32 32 r---- /usr/lib64/libstdc++.so.6.0.19 00007f337727f000 8 8 8 rw--- /usr/lib64/libstdc++.so.6.0.19 00007f3377281000 84 12 12 rw--- [ anon ] 00007f3377296000 8 8 0 r-x-- /usr/lib64/libdl-2.17.so 00007f3377298000 2048 0 0 ----- /usr/lib64/libdl-2.17.so 00007f3377498000 4 4 4 r---- /usr/lib64/libdl-2.17.so 00007f3377499000 4 4 4 rw--- /usr/lib64/libdl-2.17.so 00007f337749a000 136 136 0 r-x-- /usr/lib64/ld-2.17.so 00007f33776af000 24 24 24 rw--- [ anon ] 00007f33776b9000 8 8 8 rw--- [ anon ] 00007f33776bb000 4 4 4 r---- /usr/lib64/ld-2.17.so 00007f33776bc000 4 4 4 rw--- /usr/lib64/ld-2.17.so 00007f33776bd000 4 4 4 rw--- [ anon ] 00007ffc83b1d000 132 12 12 rw--- [ stack ] 00007ffc83b41000 12 0 0 r---- [ anon ] 00007ffc83b44000 4 4 0 r-x-- [ anon ] ffffffffff600000 4 0 0 r-x-- [ anon ] ---------------- ------- ------- ------- total kB 38912 9040 1496`````` In the above snippet the mapped files represents `195.3 MiB` of the address space of which `52.3 MiB` are actually resident. This app is definitely OK. Some application’s workload require to handle a lot of files suggesting raising the limit may be the right thing. I’ve seen in the past `FileChannel` unreleased mappings, leading to increasing memory consumption that weren’t easily identifiable in the Java heap (unless you had to perform a heap dump and knew what to look at). #### Inspecting the other segments Going beyond what has been mentioned is a tad more intricate due to how native code is performing allocations. Even identifying direct `ByteBuffer` is almost impossible, the little program below allocates 16 MiB segments and print the address of these memory segments, as well as the current process mapping. DirectByteBuffers.main ``````System.out.printf("max: %d%n", Runtime.getRuntime().maxMemory()); new ProcessBuilder("pmap", "-x", Long.toString(ProcessHandle.current().pid())) .redirectOutput(Redirect.INHERIT) .start() .waitFor(); var address = Buffer.class.getDeclaredField("address"); address.setAccessible(true); System.out.printf("native heap (pmap shows [heap] mapping"); for (var i = 0; i < 30; i++) { var byteBuffer = ByteBuffer.allocateDirect(16 * 1024 * 1024) .putInt(0, 0x01); System.out.printf("%s%n", Long.toHexString(address.getLong(byteBuffer))); } new ProcessBuilder("pmap", "-x", Long.toString(ProcessHandle.current().pid())) .redirectOutput(Redirect.INHERIT) .start() .waitFor();`````` The mapping output after the buffers have been `malloc`ed shows that the direct `ByteBuffer`s do not have their own segment, they are part of a bigger area. Table 1. result ``````7faa5afff010 7faa59ffe010 7faa58ffd010 7faa52fff010 7faa51ffe010 7faa50ffd010 7faa4fffc010 7faa4effb010 7faa4dffa010 7faa4cff9010 7faa4bff8010 7faa4aff7010 7faa49ff6010 7faa48ff5010 7faa47ff4010 7faa46ff3010 7faa45ff2010 7faa44ff1010 7faa43ff0010 7faa42fef010 7faa41fee010 7faa40fed010 7faa3ffec010 7faa3efeb010 7faa3dfea010 7faa3cfe9010 7faa3bfe8010 7faa3afe7010 7faa39fe6010 7faa38fe5010`````` before ``````485: java --add-opens java.base/java.nio=ALL-UNNAMED DirectByteBuffers.java Address Kbytes RSS Dirty Mode Mapping 00000000e0e00000 32768 19992 19992 rw--- [ anon ] 00000000e2e00000 477184 0 0 ----- [ anon ] 0000000100000000 1792 1776 1776 rw--- [ anon ] 00000001001c0000 1046784 0 0 ----- [ anon ] 000055d4549ed000 4 4 0 r-x-- java 000055d454bee000 4 4 4 r---- java 000055d454bef000 4 4 4 rw--- java 000055d455d9d000 132 28 28 rw--- [ anon ] (1) 00007faa5c000000 132 56 56 rw--- [ anon ] (1) 00007faa5c021000 65404 0 0 ----- [ anon ] (1) 00007faa60000000 132 4 4 rw--- [ anon ] ...``````  1 These segments are here before the creation of the dire `ByteBuffer`s. after ``````485: java --add-opens java.base/java.nio=ALL-UNNAMED DirectByteBuffers.java Address Kbytes RSS Dirty Mode Mapping 00000000e0e00000 32768 19992 19992 rw--- [ anon ] 00000000e2e00000 477184 0 0 ----- [ anon ] 0000000100000000 1920 1792 1792 rw--- [ anon ] 00000001001e0000 1046656 0 0 ----- [ anon ] 000055d4549ed000 4 4 0 r-x-- java 000055d454bee000 4 4 4 r---- java 000055d454bef000 4 4 4 rw--- java 000055d455d9d000 132 28 28 rw--- [ anon ] (1) 00007faa38fe5000 442476 442476 442476 rw--- [ anon ] (3) 00007faa54000000 132 8 8 rw--- [ anon ] (4) 00007faa54021000 65404 0 0 ----- [ anon ] (4) 00007faa58ffd000 49164 49164 49164 rw--- [ anon ] (2) 00007faa5c000000 132 56 56 rw--- [ anon ] (1) 00007faa5c021000 65404 0 0 ----- [ anon ] (1) 00007faa60000000 132 4 4 rw--- [ anon ] ...``````  1 The segments before the creation of the dire `ByteBuffer`s. 2 New segments, likely direct `ByteBuffer`s 3 New segments, likely direct `ByteBuffer`s 4 Unknown glibc `malloc` arena. Here I can say those are likely our `ByteBuffer`s, because I had the opportunity to diff the `pmap` output around code that specifically created the new direct buffers. Also, I know that direct `ByteBuffer`s are zeroed, i.e. pages are touched/dirty, i.e. direct byte buffers immediately count toward the RSS. In a real application it’s impossible to identify them with certainty without the address. The possible criteria would be RSS and size are the same for the segment, permissions are read write and of course it’s anonymous mapping, but any other allocation pattern or usage could meet these criteria. Finally, if `pmap` is run with `-X` it’s likely you’ll notice segments named `vsyscall` `vdso`, these are mechanisms that can accelerate some system calls. `vvar` is used to exchange Kernel data without requiring a system call. #### How many pages are used ? In fact, ps uses the proc file system to obtain its information. While not immediately useful, it’s interesting that the displayed RSS value by `ps` is in fact the number of page times the page size. (I suppose the equation is a tad more complex than a single multiplication when huge pages are involved). For example in the procfs documentation (the latest, as in `latest` kernel, documentation is there) gives the description of the `statm` object. statm description ```Table 1-3: Contents of the statm files (as of 2.6.8-rc3) .............................................................................. Field Content size total program size (pages) (same as VmSize in status) resident size of memory portions (pages) (same as VmRSS in status) shared number of pages that are shared (i.e. backed by a file, same as RssFile+RssShmem in status) trs number of pages that are 'code' (not including libs; broken, includes data segment) lrs number of pages of library (always 0 on 2.6) drs number of pages of data/stack (including libs; broken, includes library text) dt number of dirty pages (always 0 on 2.6)``` `ps` and `/proc/{pid}/statm` ``````$ ps -o rss,vsz,command $(pidof java) RSS VSZ COMMAND 4346704 6507368 /usr/bin/java -Dfile.encoding=UTF-8 -Duser.timezone=UTC -Djava.security.egd=file:/dev/./urandom -Djava$ cat /proc/$(pidof java)/statm | tr ' ' '\n' 1626842 (1) 1086676 (2) 12638 (3) 1 0 1283103 0``````  1 Total size in pages of the addressing space, in bytes : `6507368 KiB` 2 Resident memory in pages, in bytes : `4346704 KiB` 3 pages backed by a file plus shared memory Given the page size of `4 KiB`, the following numbers comes naturally : • vsz = `1626842 * 4 = 6507368` • rss = `1086676 * 4 = 4346704` For example lets say there is a Kubernetes memory limit (This limit is in fact a cgroup memory limit) of `6 GiB` (`6442450944 Bytes` ), a `java` process is started with a bigger memory `-Xmx16g` that the cgroup limit, we can observe that: 1. a process can over-commit, if Linux is configured to allow this (`/proc/sys/vm/overcommit_memory`), this is not an issue as long as 2. the memory used by the resident pages do not go over the cgroup limit. 3. The process will be oom-killed if it uses more than `6442450944 / 4 = 1310720` pages (of `4 KiB`). SelfPs.java ``````import java.io.*; public class SelfPs { public static void main(String[] args) throws Exception { var h = new ProcessBuilder("ps", "--no-header", "-orss,vsz", Long.toString(ProcessHandle.current().pid())) .start(); try(var br = new BufferedReader(new InputStreamReader(h.getInputStream()))) { System.out.println(br.readLine()); } } }`````` Result ``````$ env -u JDK_JAVA_OPTIONS java -Xms16g -Xmx16g SelfPs.java
143584 18996472 (1) (2)
$cat /sys/fs/cgroup/memory/memory.limit_in_bytes 6442450944 (3)$ echo $((18996472 * 1024)) 19452387328 (4)``````  1 RSS in KiB 2 virtual address space in KiB 3 cgroup limit 4 virtual address space in bytes Another interesting element of `/proc/{pid}/statm` is that it shows how many pages the mapped files take. In the output below, the third line, give the number of pages that are backed by files. Having a look at this information may also be useful when sizing the container. File backed memory ``````$ cat /proc/$(pidof java)/statm | tr ' ' '\n' 1514761 1009054 11222 (1) 1 0 1164939 0$ pmap -X $(pidof java) | head -n -2 | awk '{ if (NR > 2 &&$5 >0 ) sum += $7 } END { print sum }' 46796 (2)``````  1 Number of pages backed by files or shared memory, so `11222 * 4 = 44888` 2 Resident set size of memory mapped files in KiB, not including shared pages that are not file backed. ### Paging and the Java heap Before wrapping this article I’d like to mention an interesting effect of virtual memory, over-commit and Java Heap. The pages of the Java heap memory segment count if these pages have initialized at least once, during the life of the program, the activity of the program and the GC will increase the number of touched pages, pages that count in the RSS. For G1GC . New allocation will happen in a GC region called Eden, and more specifically in a sub-segment called TLAB dedicated for the thread that perform the allocation. 1. After some time the GC will kick in and move (or evacuate) the live objects to GC region called Survivor. 2. This cycle will go on until the object is considered old enough to be evacuated to a GC region called Old. After each evacuation, the regions that previously held the objects, are cleaned up, and their bits set to `0`, but the page are still considered dirty and count in the RSS. This region will return to a pool of Free regions. For the evacuation, the live objects can go to existing region if they have some space or if there’s not enough space in the existing regions, G1GC will convert a Free region to either a Survivor or an Old region. If the Free region has been used before then no new pages will get dirty, however if the Free region has never been in use before then this will touch more pages. GC regions usage of pages This can lead to a situation if the Java heap is large enough, where many never-used-before Free region exists. In this situation it is not obvious to distinguish from metrics what is consuming the memory. Memory usage by pool Some people may have heard of the `-XX:+AlwaysPreTouch` Hotspot option. This option tells the JVM to write a zero to every OS memory pages during the JVM startup. This option has also the effect of avoiding physical memory commit latencies later at runtime, however this only affects the heap memory zone. Other JVM component that manage areas like thread stack or metaspace work differently. In other words that means parts of the committed memory shown in NMT is not resident and as such RSS counter may not reflect what is seen in the committed memory. ## Sizing a cgroup / kubernetes `resources.memory.limit` With containerization getting traction it is likely that one has to face a memory related issue. Being equipped with the right tooling and the tool manual is a precious help. When a container is oomkilled either the application has problem or its configuration is too tight. Using the gathered information from JVM’s native memory tracking and from memory mapping, is it possible to build a simple equation to estimate the probable maximum memory usage of a process ?  Set the cgroup memory limit high enough, so the application isn’t oomkilled, this will let you analyze how the app work, and adjust settings without fear (except from your colleagues that compare everything to Go or rust). If you don’t think it’s Java heap memory leak, i.e. the heap usage isn’t alarming, try to inspect native memory, it’s easier with the flag `-XX:+AlwaysPreTouch`, however keep in mind this will bump your RSS right from the start, so anticipate this increase in your cgroup memory limit. We can already suppose there’s every reported components from the JVM, and the mapped files reported by NMT. ```Total memory = Heap + GC + Metaspace + Code Cache + Symbol tables + Compiler + Other JVM structures + Thread stacks + Direct buffers + Mapped files``` Above in this writing I noted NMT is not enough to account used memory. On a loaded application ``````$ jcmd $(pidof java) VM.native_memory \ | grep -P "Total.*committed=" \ | grep -o -P "(?<=committed=)[0-9]+(?=KB)" 3841302 (1)$ ps --no-header -o rss $(pidof java) 4204512 (2)$ pmap -X $(pidof java) | head -n -2 | awk '{ if (NR > 2 &&$5 >0 ) sum += $7 } END { print sum }' 52668 (3)$ echo \$((4204512 - 3841303 - 52668))
363209 (4)``````
 1 Total committed memory reported by NMT 2 RSS of the JVM process 3 RSS of the JVM process’s mapped files 4 The rest of the used memory possibly `malloc` or `mmap` performed by native libs, native allocator overhead

Basically this shows that one must account at least this amount of data when defining the kubernetes limit.

```Total memory = Heap + GC + Metaspace + Code Cache + Symbol tables
+ Compiler + Other JVM structures + Thread stacks
+ Direct buffers + Mapped files +
+ Native libraries allocations + Malloc overhead
+ ...```

In my experience if the application doe not exhibit leaky behavior but just need memory limit adjustment. The job is way easier using `AlwaysPreTouch`, then it’s easier to track evolution of "off-heap" memory.

Additionally, when sizing the memory limit, it’s really important to think about the OS page cache. Linux uses the unused resident memory to cache pages, usually the one backed by files.

Workloads that access the filesystem for a living, like Cassandra or ElasticSearch will profit of the OS page cache. It’s a good bet to increase the memory limit for this cgroup.

By how much, that depends. I believe page faults is a good indicator. It’s likely that if there’s page faults of the container, it means that the OS don’t have the wanted pages in resident memory and as such the kernel must fetch the backing data, likely on the slower storage device. This is likely something that will negatively affect the workload latencies.

## Closing words

A cgroup is a double-edged sword, it creates good conditions to isolate a process. Being isolated, it gets easier to provision the necessary resources in production. Defining these boundaries is however a tedious task, having limits too narrow could increase the chance of the container getting oomkilled or having poor performance, giving a vast room will reduce the ability of Kubernetes to distribute the workload, and it costs more.

In this very long writing, I showed two tools to survey the native memory of a process. Java ships with a very interesting tracking mechanism. I found out that inspecting Linux `/proc` filesystem, with the help of `pmap` complement NMT well. Together they help to sort out memory problems be it a limit should that needs adjustment or something to fix in the application.

That being said it might be cheaper to look at other solutions, e.g. if the production is running on Kubernetes it would be worth to have a look at auto-scalers provided by the platform.

I think that understanding how a workload use the memory is still a sensible task, to make a better use of the auto-scalers.

The JVM is handling it.
Famous last words

I understand that having to go down that much may look superfluous and almost zealous to when coming from the Java world.

In my opinion, containers changed the deal, the tighter constraints that helped to increase the deployment density, are now backfiring. On a side note those that tried cgroups with CPU limit have likely experienced throttled on applications. Memory wise rediscovering RSS is essential.

## Thanks

The above writing tries to piece together elements from a few things I knew, things I grepped in the JDK codebase, blog posts, stack overflow, and things learned from — awesome — people. I hope I didn’t forget someone, if I did or if I’m wrong please reach out.

I’d like to thank Pierre Laporte, Olivier Bourgain, Bastien Lemale, and Yohan Legat for their early help in proof-reading and suggestions. And many thanks to Jean-Philippe Bempel and Mathis Raguin for post-production reviews ;)