# Using Linux's memfd_secret syscall from the JVM with JEP-419

Linux 5.14 brought a new system call `memfd_secret` in order to mitigate speculative attack by preventing the kernel from being able to peek at memory segments created by this system call.

In this article I will leverage the API introduced ⸺ well, still incubating ⸺ of JEP-419 (Project Panama). JEP-419 has been delivered as part of JDK 18.

For those that don’t know, Project Panama is a project that aims to provide an easy, secure and efficient way to call native methods from Java. You can look at my previous articles on the earlier versions of the project, articles on foojay.io by Carl Dea, or those on inside.java.

 The project Panama APIs is the fruit of work started in 2014, from ideas even older. The project is still evolving which means the next iteration (JEP-424) in JDK 19 is going to promote these APIs as preview, but API and behavior adjustments are still likely.
 The following examples are based on JDK 18 2022-03-22, build 18+37 from Red Hat. The Linux distribution is a Fedora release 35 with the kernel 5.16.20-200.fc35.x86_64.

While this article will focus on Linux, the same concepts apply to other OSes (and CPU as we’ll see). Also, this article is not about introducing JEP-419, this is done in other material, which means this blog assumes the right compilation and running flags for JEP-419, usually `--add-modules=jdk.incubator.foreign`, `--enable-native-access=ALL-UNNAMED`.

tl;dr

Here’s how one can make a syscall using JEP-419.

`memfd_secret` syscall
``````var sys_memfd_secret = CLinker.systemCLinker().downcallHandle(
systemCLinker.lookup("syscall").get(),
FunctionDescriptor.of(
ValueLayout.JAVA_INT,
ValueLayout.JAVA_INT,
ValueLayout.JAVA_INT
)
);

int secret_fd = (int) sys_memfd_secret.invoke(447, 0)``````

Now to know more about the whole deal, read on.

## What is a system call (syscall) ?

Before jumping to `memfd_secret`, let’s first understand how to make a system call. And even before that, let’s see what is a system call.

For those not interested in this part you can jump to `memfd_secret` section.

In order to do something useful a program has to interact with some resources, memory, disk, network, terminal, etc. On a computer, these resources are handled by a very complex and critical software, the Operating System.

In order to use these resources, a program has to make system calls like `read`, `wait`, `write`, `exit`, etc. The standard `malloc`, the native allocator, has to actually place a request to the OS to get memory via a `mmap` syscall.

As expected the JVM does plenty of syscalls too, e.g. when logging something on `stdout` or persisting a (unified) log file.

Essentially,

a system call is a way of requesting the kernel to do something for the program.

Why system calls have to be in the kernel and not in the user space like in a standard library? As mentioned earlier the reasoning is that system calls are a way to interact or, involve, a resource like devices, file system, network, processes, etc. These resources are managed by a privileged software : the OS or kernel.

When a system call happens, the program doesn’t simply invoke a method at some whose code resides at some address, a system call is actually making the CPU switching to Kernel mode because the kernel is a privileged software.

On most modern processors there is a security model, that allows to limit the scope of what a program can do. In particular on Intel based CPUs, the model is known as processor protection ring (or hierarchical protection domains).

 It seems that Ring 1 and 2 are rarely used because paging (the way that the OS handles memory, see my blog post on [off-heap memory]) only has the concept of privileged and unprivileged which minimize the actual benefit of those rings, according to Evan Teran's answer on SO.'

When a processor executes some code (in thread), the processor knows the current mode, this way the processor is able to gate memory accesses, e.g. a Ring 3 (user-land program cannot access memory from Ring 0, the kernel). This is yet another feature of the virtual memory abstraction. The processor could also restrict some processor instructions and registers to the software running in Ring 0.

Out of scope: there’s even negative rings on some CPU architectures for hypervisor, or CPU System management, up to `Ring -3`.

Restrictions are enforced by the CPU, in order to perform its purpose a user-land program needs to place a request to the kernel. This mechanism is called syscall, it allows to transition between rings.

Syscall ring transitions

During mode switches a lot is happening, saving and restoring registers, putting the CPU in specific mode (user vs kernel) etc. And of course doing the reverse once the request is handled either with success or a failure

 Privilege context switches are sufficiently costly that most libraries try to avoid those. For example, reading `8 KiB` instead of `256 bytes` is a good idea as it drastically reduces the number of syscall and as such mode switches.

### What does the documentation says about syscalls ?

Now let’s get practical.

Looking at `man 2 syscall`, the manpage shed some details on how to make the call, specifically in the Architecture calling conventions section. Those details are in assembly, e.g.

• processor interrupt `0x80` for i386 processors (32 bits), then specific registers

• `syscall` instruction for x86_64 processors (64 bits), then specific registers

The calling convention of other architectures are also described e.g. on ARM processors, the system call is performed by a `swi 0x0` instruction, on aarch64 by `svc #0`.

 For people not aware of what exactly is a calling convention should read at leas this wikipedia article on x86 calling convention. But in a short a calling convention defines how and where parameters should be placed in order to call the code, how parameters are passed registers or/and stack, how values are returned etc.

This manual page also gives an important difference with regular functions, while we look up system calls by their names: `write`, `read`, `execve`, `exit`, `mmap`, `memfd_create` etc. The programs and the kernel actually know them by numbers.

Why numbers? The reason is that syscalls are like messages that are passed down, and these numbers somewhat like enum ordinals indicating the type of message. These numbers are part of the syscall ABI (Application Binary Interface) and as such they are stable for a CPU architecture although unbounded (new syscalls can be added).

Outside, of this scope not all syscalls are made equal nowadays, some syscalls, usually the most used ones are exported in the user space memory, to avoid the cost of switching to kernel mode. In practice, vDSO (Virtual Descriptor Shared Object) is like a library, it is loaded in memory so that it can be accessed from the program memory (glibc knows about this memory region and will use it).

`pmap -X {pid}`
``````# pmap -X 1
1:   java ...
Address Perm   Offset Device   Inode Size  Rss  Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible ProtectionKey Mapping
...
7ffe78f4c000 rw-p 00000000  00:00       0  132  112  112        112       112        0              0             0              0               0    0       0      0           0             0 [stack]
7ffe78fad000 r--p 00000000  00:00       0   16    0    0          0         0        0              0             0              0               0    0       0      0           0             0 [vvar]
7ffe78fb1000 r-xp 00000000  00:00       0    8    4    0          4         0        0              0             0              0               0    0       0      0           0             0 [vdso]  (1)
ffffffffff600000 r-xp 00000000  00:00       0    4    0    0          0         0        0              0             0              0               0    0       0      0           0             0 [vsyscall]
...``````
 1 The vDSO 8 KiB segment

To read more about it, one should read the relevant manual page (`man 7 vdso`). Typically, this page lists the exported syscalls.

E.g ` __vdso_clock_gettime`, which is called by `clock_gettime` defined in the standard libc (`man 3 clock_gettime`).

 The syscall numbers are different between architectures! On Linux one can look at their definition in the `/include/asm-/unistd-.h` files.

From the syscall manpage the Intel CPUs syscall calling convention is:

Example 1. 64-bit programs
Set the registers
1. `rax` ← System Call number

2. `rdi` ← First argument

3. `rsi` ← Second argument

4. `rdx` ← Third argument

Make the syscall
• execute `syscall` processor instruction

The actual syscall numbers (for 32 bit programs) is usually defined in `/usr/include/asm/unistd_64.h`

Example 2. 32-bit programs
Set the registers
1. `eax` ← System Call Number

2. `ebx` ← First Argument

3. `ecx` ← Second Argument

4. `edx` ← Third Argument

Make the syscall
• Place a processor interrupt `int 0x80`

The actual syscall numbers (for 32 bit programs) is usually defined in `/usr/include/asm/unistd_32.h`.

### My first syscall

In order to quickly practice a syscall, let’s do a very simple hello world. The example will be in assembler, I promise this is the only source snippet in assembly and after that I’ll be back with Java and Panama.

• `/usr/include/asm/unistd_64.h`

Example 3. 64-bits (with `syscall` instruction)
hello_syscall.asm (x86_64)
``````global _start       ; define entrypoint
section .text
_start:
mov rax, 0x1    ; syscall number for write (1)
mov rdi, 0x1    ; int fd                   (2)
mov rsi, msg    ; const void* buf
mov rdx, mlen   ; size_t count
syscall         ; make the call            (3)

mov rax, 0x3c   ; syscall number for exit  (1)
mov rdi, 0x1    ; int status               (2)
syscall         ; make the call            (3)

section .rodata
msg: db "Hello Linux syscalls!",0x0a, 0x0d  ; message string, terminated by a new line (0A, 0D)
mlen: equ $-msg ; calculate the lenght of the message``````  1 At this place this register will hold the selected the syscall (a number). Note the number comes from `/usr/include/asm/unistd_64.h`. 2 Syscall arguments are placed in next registers. 3 Make the syscall with interrupt `0x80`. ``````nasm -w+all -f elf64 -o hello_syscall.o hello_syscall.asm (1) ld -o hello_syscall hello_syscall.o ./hello_syscall``````  1 Note the `elf64` format for 64 bits. Example 4. 32-bits (with an interrupt) hello_syscall_via_int80.asm (x86, ie won’t work on ARM) ``````global _start ; define entrypoint section .text _start: mov eax, 4 ; syscall number: write (1) mov ebx, 1 ; stdout (2) mov ecx, str ; buffer address mov edx, str_len ; buffer length int 0x80 ; make the call (3) mov eax, 1 ; syscall number: exit (1) mov ebx, 0 ; exit status (2) int 0x80 ; make the call (3) section .rodata str: db "Hello Linux!", 0Ah ; message string, terminated by a new line (0A) str_len: equ$ - str         ; calculate the lenght of the message``````
 1 At this place this register will hold the selected the syscall (a number). Note the number comes from `/usr/include/asm/unistd_64.h`. 2 Syscall arguments are placed in next registers. 3 Make the syscall with interrupt `0x80`.
compile and run
``````nasm -w+all -f elf32 -o hello_syscall_via_int80.o hello_syscall_via_int80.asm (1)
ld -m elf_i386 -o hello_syscall_via_int80 hello_syscall_via_int80.o (2)
./hello_syscall_via_int80``````
 1 Note the `elf32` format for 32 bits. 2 Note the linker emulation option for `i386`

When looking at this very simplistic code, something immediately stands out: From application point of view (user land), a syscall is just like an atomic pseudo machine instruction. I believe this example is more striking than the figure above on syscall ring transitions.

We saw what is exactly a syscall and how to make one using assembly. In general though, it’s rare to invoke syscall directly as the standard library exposes wrappers that handle everything for most of the syscalls.

syscall wrappers in the standard library

Because `memfd_secret` syscall has been recently used there’s no wrapper functions in the standard library, hence we’ll need to make a system call ourselves.

## Making syscalls from the JVM

The work of the Panama project doesn’t allow us to directly write assembly code and execute it. Fortunately!

And the libc already exposes a syscall function that takes care of the calling convention as mentioned in `man 2 syscall`, ie it will place the arguments in the right CPU registers.

syscall manual example (omitting headers)
``````int main(int argc, char *argv[])
{
pid_t tid;

pid = syscall(SYS_getpid);
printf("pid: %ld\n", pid);
}``````

So, basically to make a syscall using JEP-419, I only have to perform a lookup for the `syscall` function, also since it’s part of the standard libc, this just need `CLinker.systemLinker()`.

syscall manual example with Panama
``````/*
On linux (Intel x86_64) in
- /usr/include/asm/unistd_64.h

#define __NR_getpid 39

On macOs (Intel x86_64) in either :
- /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/syscall.h
- /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/syscall.h

#define	SYS_getpid         20
*/
final static in SYS_getpid = 20; (1)

MethodHandle syscall = systemCLinker.downcallHandle(
systemCLinker.lookup("syscall").get(),
FunctionDescriptor.of(
ValueLayout.JAVA_INT, (2)
ValueLayout.JAVA_INT  (3)
)
);

int pid = (int) syscall.invoke(SYS_getpid); (4)
System.out.println("pid: " + pid);``````
 1 The syscall number. 2 The return type of the syscall function. 3 The first argument is the syscall number. 4 Making the syscall.

That’s it, we’ve made out first direct syscall using panama (and the JEP-419). Simple right?Let’s try to use that knowledge for `memfd_secret` syscall.

## `memfd_secret`

The `memfd_secret` syscall was introduced in this commit. Fortunately Linux has good commit message, so we can read and learn more about how to create "secret" memory areas.

The following example demonstrates creation of a secret mapping (error handling is omitted):

``````fd = memfd_secret(0);
ftruncate(fd, MAP_SIZE);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);``````

Basically we need to create the secret file descriptor, truncate it to the desired size, and then memory map it.

1. First get a file descriptor with `memfd_secret`

memfd_secret syscall
``````/*
On linux (Intel x86_64) in /usr/include/asm/unistd_64.h

#define __NR_memfd_secret 447
*/
final static in SYS_memfd_secret = 447; (1)

MethodHandle syscall = systemCLinker.downcallHandle(
systemCLinker.lookup("syscall").get(),
FunctionDescriptor.of(
ValueLayout.JAVA_INT, (2)
ValueLayout.JAVA_INT, (3)
ValueLayout.JAVA_INT, (4)
)
);

int secret_fd = (int) syscall.invoke(SYS_memfd_secret, 0); (5)``````
 1 The `memfd_secret` number. 2 The return type of the syscall function. 3 The first argument is the syscall number. 4 The flags passed to `memfd_secret`, currently the only supported flag is `O_CLOEXEC` according to this LWN article by Jonathan Corbet. 5 Making the syscall, not using any flags, the returned value is a file descriptor.

We can proceed with the rest of the process.

2. Then sets the desired size

``````// int ftruncate(int fd, off_t length);
MethodHandle ftruncate = systemCLinker.downcallHandle(
systemCLinker.lookup("ftruncate").get(),
FunctionDescriptor.of(
ValueLayout.JAVA_INT,
ValueLayout.JAVA_INT, // fd
ValueLayout.JAVA_LONG // length
)
);

var res = (int) ftruncate.invoke( (1)
secret_fd,
secret.length()
);``````
 1 Invoke the `ftruncate` from the libc on the file descriptor with the wanted size.
3. Finally, memory map this file descriptor, this operation has the effect to unmap this memory segment from the Kernel pages (in Ring 0), so only the user process can read these memory pages.

``````// in /usr/include/bits/mman-linux.h
// #define PROT_READ       0x1             /* Page can be read.  */
// #define PROT_WRITE      0x2             /* Page can be written.  */
final int PROT_READ = 1;
final int PROT_WRITE = 2;
// #define MAP_SHARED      0x01            /* Share changes.  */
final int MAP_SHARED = 1;

// in /usr/include/sys/mman.h
// extern void *mmap (void *__addr, size_t __len, int __prot,
//                   int __flags, int __fd, __off_t __offset) __THROW;
MethodHandle mmap = systemCLinker.downcallHandle(
systemCLinker.lookup("mmap").get(),
FunctionDescriptor.of(
ValueLayout.ADDRESS, // addr
ValueLayout.ADDRESS, // addr
ValueLayout.JAVA_LONG, // size
ValueLayout.JAVA_INT, // protection modes
ValueLayout.JAVA_INT, // flags
ValueLayout.JAVA_INT, // fd
ValueLayout.JAVA_LONG // offset
)
);

var segmentAddress = (MemoryAddress) mmap.invoke( (1)
NULL,
secret.length(),
PROT_READ | PROT_WRITE,
MAP_SHARED,
secret_fd,
0
);``````
 1 Memory-map the file descriptor, using the same wanted size, and use the right protection modes (read & write), and flags.
4. Once the memory segment is mapped, we can actually get access to it via the `MemorySegment` API.

``````var secretSegment = MemorySegment.ofAddress(segmentAddress, length, scope); (1)
secretSegment.copyFrom(MemorySegment.ofArray(secretBytes)); (2)
var roSecretSegement = secretSegment.asReadOnly(); (3)``````
 1 Create a `MemorySegment` from the memory segment address, also using the same size, and the current `ResourceScope`. 2 Since `secretSegment` is actually a `MemorySegment` off heap, the secret array as to be transformed first into an on-heap `MemorySegment` before being copied to the secret memory mapping. 3 Eventually make the segment read-only.

And to read the secret, just extract the byte array from the memory segment.

``var bytes = secretSegment.toArray(ValueLayout.JAVA_BYTE);``

With this you have a complete working example of how to use the `memfd_secret` from Java using Panama (JEP-419).

…or not!

Indeed, running this will make the JVM seg-fault!

stdout
``````#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f561919ffd7, pid=4798, tid=4799
#
# JRE version: OpenJDK Runtime Environment 22.3 (18.0+37) (build 18+37)
# Java VM: OpenJDK 64-Bit Server VM 22.3 (18+37, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /home/bob/opensource/core.4798)
#
# An error report file with more information is saved as:
# /home/bob/opensource/hs_err_pid4798.log
#
# If you would like to submit a bug report, please visit:
#   https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=java-latest-openjdk&version=35
#``````

So, what did happen ? The problematic frame isn’t helpful if you’re not familiar with JVM internals. Opening `hs_err_pid4798.log` is more helpful.

filename
``````...

Stack: [0x00007f734ae3d000,0x00007f734af3e000],  sp=0x00007f734af3c430,  free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
v  ~StubRoutines::jbyte_disjoint_arraycopy
V  [libjvm.so+0xe66d70]  Unsafe_CopyMemory0+0xd0
j  jdk.internal.misc.Unsafe.copyMemory0(Ljava/lang/Object;JLjava/lang/Object;JJ)V+0 [email protected]
j  jdk.internal.misc.Unsafe.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V+29 [email protected]
j  jdk.internal.misc.ScopedMemoryAccess.copyMemoryInternal(Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljava/lang/Object;JLjava/lang/Object;JJ)V+32 [email protected]
j  jdk.internal.misc.ScopedMemoryAccess.copyMemory(Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljava/lang/Object;JLjava/lang/Object;JJ)V+12 [email protected]
j  jdk.incubator.foreign.MemorySegment.copy(Ljdk/incubator/foreign/MemorySegment;Ljdk/incubator/foreign/ValueLayout;JLjdk/incubator/foreign/MemorySegment;Ljdk/incubator/foreign/ValueLayout;JJ)V+202 [email protected]
j  jdk.incubator.foreign.MemorySegment.copy(Ljdk/incubator/foreign/MemorySegment;JLjdk/incubator/foreign/MemorySegment;JJ)V+13 [email protected]
j  jdk.incubator.foreign.MemorySegment.copyFrom(Ljdk/incubator/foreign/MemorySegment;)Ljdk/incubator/foreign/MemorySegment;+10 [email protected] (1)
j  io.github.bric3.panama.f.syscalls.LinuxSyscall.memfd_secret_external()V+48
j  io.github.bric3.panama.f.syscalls.LinuxSyscall.main([Ljava/lang/String;)V+99
v  ~StubRoutines::call_stub
V  [libjvm.so+0x81420a]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x30a
V  [libjvm.so+0x8a2111]  jni_invoke_static(JNIEnv_*, JavaValue*, _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, JavaThread*) [clone .isra.174] [clone .constprop.397]+0x351
V  [libjvm.so+0x8a4a05]  jni_CallStaticVoidMethod+0x145
C  [libjli.so+0x47a9]  JavaMain+0xd19
C  [libjli.so+0x7d69]  ThreadJavaMain+0x9
...

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0xffffffffffffffff (2)

...``````
 1 This happened while doing the `MemorySegment::copyFrom` call. 2 Moreover, the segmentation fault appears to have been caused by a memory access to non mapped memory address `SEGV_MAPERR`. The most common other reason for segfault is `SEGV_ACCERR`, which is caused by accessing a memory address with wrong permissions.

So what happened ? Actually, the value of the file descriptor was `-1`. Which of course is not a valid file descriptor. Also, the call to `ftruncate` seems to handle well the case where the file descriptor is not valid.

The call to `mmap` the file descriptor, also returns `-1`, which is supposed to be the memory segment address.

So why did this happen? When invoking native methods, syscalls in particular, one need to be aware of the convention about error handling for these methods.

### `errno`

Indeed, when developing in C/C++, when something returns `-1`, it usually means that something went wrong, and that the result is invalid.

Moreover, the `errno` variable is a global variable that is set by the system calls and some library functions, see the relevant `man 3 errno`.

Because it is a global variable its declaration depends on the system.

Example 5. Linux’s `errno`
• `/usr/include/asm-generic/errno.h`

• `/usr/include/asm-generic/errno-base.h`

errno declaration
``````extern int *__errno_location (void) __THROW __attribute_const__;
# define errno (*__errno_location ())``````
errno codes
``````...
/*
* This error code is special: arch syscall entry code will return
* -ENOSYS if users try to call a syscall that doesn't exist.  To keep
* failures of syscalls that really do exist distinguishable from
* failures due to attempts to use a nonexistent syscall, syscall
* implementations should refrain from returning -ENOSYS.
*/
#define ENOSYS          38      /* Invalid system call number */
...``````
Example 6. macOs’s `errno`
• `/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/errno.h`

• `/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/errno.h`

errno declaration
``````extern int * __error(void);
#define errno (*__error())``````
errno codes
``````...
#define ENOLCK          77              /* No locks available */
#define ENOSYS          78              /* Function not implemented */
...``````

So we’ll need to check the errors after each call in our case, as each of these calls are system calls underneath.

On Linux we can see that `errno` definition is actually a call to a function that return a pointer : `*__errno_location ()`

checking `errno`
``````MethodHandle __errnoLocationMH = systemCLinker.downcallHandle(
systemCLinker.lookup("__errno_location"),
FunctionDescriptor.of(ValueLayout.ADDRESS)
);

int errno = ((MemoryAddress) __errnoLocationMH.invoke()) (1)
.get(ValueLayout.JAVA_INT, 0); (2)``````
 1 Get `errno` address 2 Read `errno` value

On Linux the package `more-utils` has a tool called `errno` that can be used to list all the error codes `errno -l`.

Additionally, there is a function `strerror` that returns a string from an error code.

getting the error message
``````MethodHandle strerror = systemCLinker.downcallHandle(
systemCLinker.lookup("strerror").get(),
FunctionDescriptor.of(ValueLayout.ADDRESS, ValueLayout.JAVA_INT)
);

String errmsg = ((MemoryAddress) strerror.invoke(errno)).getUtf8String(0);``````

So, placing this check after the `memfd_secret` syscall, looked like a good bet. Eventually doing something similar after each call is a good idea as well, it kinda looks like the Go lang way of checking errors.

memfd_secret error checking
``````fd = (int) sys_memfd_secret.invoke(0);
if (fd == -1) {
var errno = errno();
System.err.println(errno == ENOSYS ?
"tried to call a syscall that doesn't exist (errno=ENOSYS), may need to set the 'secretmem.enable=1' kernel boot option" :
"syscall memfd_secret failed, errno: " + errno + ", " + strerror(errno));
return Optional.empty();
}``````

While reviewing the `memfd_secret` commit, we can see there’s a check that returns `ENOSYS` when a condition is not met.

So in order to make the whole thing work, we need to tackle what’s preventing `memfd_secret` to happen.

### Linux bootloader flag

So actually, Linux is gating the `memfd_secret` syscall by a flag named `secretmem_enable`. That maybe why `memfd_secret` is not listed whe looking at `man 2 syscalls`.

It’s not quite clear from the commit that introduced `memfd_secret` but in order to work, the machine boot has to be configured with the flag `secretmem.enable=1`.

 DISCLAIMER: I am not responsible if something happens wrong on your machines / OS. The following actually changes the Linux bootloader configuration, and as such, any misconfiguration could make this system non-bootable! Please read and understand the documentation of your system before proceeding.
 Enabling this prevents hibernation whenever there are active secret memory users.

My test machine is a Fedora 35, let’s read their page on the GRUB2 bootloader.

From this page, it seems there’s a fairly simple way to change the bootloader configuration.

add `secretmem.enable=1` flag
``sudo grubby --update-kernel=ALL --args="secretmem.enable=1"``
check the configuration
``sudo grubby --info=ALL``
remove `secretmem.enable=1` flag
``sudo grubby --update-kernel=ALL --remove-args="secretmem.enable=1"``

Notice the actual flag name is `secretmem.enable`, not `secretmem_enable` !

The reboot the OS. Now if the configuration was properly applied, `memfd_secret` should return a valid file descriptor.

``````$java --add-modules=jdk.incubator.foreign --enable-native-access=ALL-UNNAMED MemfdSecret.java WARNING: Using incubator modules: jdk.incubator.foreign warning: using incubating module(s): jdk.incubator.foreign 1 warning Secret mem fd: 4 (1) Secret: super secret decryption key``````  1 `memfd_secret` here returned the file descriptor `4` Typically, this secret storage could be used to store a decryption key during startup, and it’ll be used to decrypt encrypted payload. Of course, care must be taken to prevent this data from leaving this memory. Which might not be possible under many circumstances. E.g. a library that takes a Java String, in which case the secret buffer is copied in elsewhere in the heap. ## Improvements ### Trying to replace most panama calls by JDK types So appart from the `memfd_secret` syscall, the other calls, looks to be replaceable ? `MemorySegment.mapFile` looks like a good bet to replace `mmap`. However, upon first use, things start to look problematic. The signature requires a `Path` and the mapping is limited to a single `MapMode`. `MemorySegment::mapFile` signature ``````static MemorySegment mapFile( Path path, long bytesOffset, long bytesSize, FileChannel.MapMode mapMode, ResourceScope scope ) throws IOException {`````` Supposing the file descriptor value is `4`, if it was possible to pass `/dev/fd/4` or `/proc/self/fd/4` as a `Path`, we could not map this segment as read and write via this API. And performing this operation twice, one in read-only mode and one in write-only mode, would not work as this special file descriptor is closed after the first memory mapping. There’s some interesting bits in `FileOutputStream` / `FileInputStream` as they can be created from a JDK’s `FileDescriptor`, they to allow to get the underneath `FileChannel`, which then allow to call `map()` to get a memory mapping. However, `FileDescriptor` class does not have a public constructor, and even being able to hack `FileDescriptor` (with`--add-opens=java.base/java.io=ALL-UNNAMED`) is not enough as we get in the same situation as above because it’s only possible to have a mapping in read-only or write-only. Basically, we’re stuck with using the `mmap` native function to do what’s necessary. I don’t know if it is out of scope for the JEP-419, or the next JEP-424, but I think this would be a good thing to support `MemorySegment` of arbitrary file descriptor, in particular when writing programs that run on the command line, this could enable things like `java Main.java <(cat neko | grep meow)`. Finally, I don’t believe there’s something equivalent available in JDK for the `ftruncate` function. ### Improving our syscall API. In the snippet above, we’ve declared a `MethodHandle` to the `syscall` function, if there’s multiple syscalls, we’ll need to pass the syscall number as the first argument each time. `MethodHandle`s API allows to make partial function. syscall partial function ``````var syscallAddress = systemCLinker.lookup("syscall").get(); var syscall = systemCLinker.downcallHandle( syscallAddress, FunctionDescriptor.of( ValueLayout.JAVA_INT, ValueLayout.JAVA_INT (1) ) ); var sys_getpid = MethodHandles.insertArguments(syscall, 0, SYS_getpid); (2) sys_getpid.invoke(); (3)``````  1 The first argument is the syscall number. 2 Capture the `syscall` number and creates a "partial function". 3 Invocation of the partial function don’t need argument 0. Now if the syscall has different arity, `MethodHandle::appendArgumentLayouts` has us covered, so that we can use the basic template of a syscall, sort of, and build on top of this to have specific identifiers for each syscall. syscall partial function, with added arguments ``````var sys_memfd_secret = MethodHandles.insertArguments(systemCLinker.downcallHandle( systemCLinker.lookup("syscall").get(), FunctionDescriptor.of( ValueLayout.JAVA_INT, ValueLayout.JAVA_INT ).appendArgumentLayouts(ValueLayout.JAVA_INT) (1) ), 0, SYS_memfd_secret); (2) int fd = (int) sys_memfd_secret.invoke(0); (3)``````  1 Append arguments to the function descriptor. 2 Capture the `syscall` number and creates a "partial function". 3 Simply invoke the call passing only required arguments on call site. Other things are possible with `MethodHandle`s that can be handy with Panama, yet out of scope for this blog post. Just check the API. ### Generating the `MethodHandle`s with `jextract` The JDK Panama team, also created a tool known as `jextract` whose job is to lift most of the work to generate the `MethodHandle`s. So mentioned in other blog post or conference talks I gave, `jextract` is now a separate tool, at this point there’s no binary release which means it has to be built. The `jextract` project page explains how to do this. My test machine is a Fedora, so adapt the command and the JDK distribution to your needs. Build jextract ``````sudo dnf install java-latest-openjdk-jmods.x86_64 (1) curl -LO https://github.com/llvm/llvm-project/releases/download/llvmorg-14.0.0/clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz (2) tar xf clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz git clone --depth 1 https://github.com/openjdk/jextract.git cd jextract sh ./gradlew \ -Pjdk18_home=/etc/alternatives/java_sdk_18_openjdk \ -Pllvm_home=/home/bric3/clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04/ \ clean verify (3)``````  1 At this time latest this is an OpenJDK 18 2 LLVM 14 for x86_64 3 Run the documented build command with the required home directories. If everything is alright you can use `jextract`: `jextract` Version ``````$ build/jextract/bin/jextract --version
WARNING: Using incubator modules: jdk.incubator.foreign
jextract 18.0.1
JDK version 18.0.1+10
clang version 14.0.0``````

The basic usage is `jextract <options> <header file>`. Since there is multiple headers, the trick is to specify a handcrafted header with every needed header.

memfd_secret_header.h
``````#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/syscall.h>``````

This would be really neat if `jextract` or even `java` could handle any file descriptors, this could come handy with heredocs.

jextract with here-doc, for multiple headers
``````$jextract ... <(cat <<-EOF (1) #include <errno.h> #include <unistd.h> #include <sys/syscall.h> #include <sys/mman.h> EOF )``````  1 The `<()` returns a file descriptor that is not a file on disk, something like `/proc/self/fd/13`, whose content is the string between the `<←EOF` and `EOF` markers. Of course nowadays, `jextract` bails out when something like `/proc/self/fd/13` is passed, reporting that it is not a file. Also note that `jextract` support passing the options as option file. So we can pass output options, like the target package and class name, but also the symbols we’d like. memfd_secret_header.jextract.options ``````--source (1) --output build/generated/sources/jextract-syscall/java (2) --target-package linux (3) --header-class-name syscall_h (4) --include-function syscall --include-macro SYS_memfd_secret --include-function close --include-function ftruncate --include-function mmap --include-function munmap --include-macro PROT_READ --include-macro PROT_WRITE --include-macro MAP_SHARED --include-function strerror #### Since errno macro is not supported at this time, it is necessary #### to manually resolve errno, and include OS specific declarations. #### e.g. __errno_location is Linux specific --include-function __errno_location (5)``````  1 Tells `jextract` to generate the source code, instead of classes. 2 Specifies the output directory. 3 Specifies the package name. 4 Specifies the class name. 5 Specifies the function used to resolve `errno` on Linux. Running `jextract` ``$ jextract @memfd_secret.jextract.options tmp.h (1)``
 1 Assuming `jextract` is in the PATH, or there’s an alias.

So once done, we’ll have a file with all the symbols we need.

syscall_h.java
``````// ...
public class syscall_h  {
public static MemoryAddress __errno_location () {
// ...
}

public static long syscall ( long __sysno, Object... x1) {
// ...
}

public static MemoryAddress mmap ( Addressable __addr,  long __len,  int __prot,  int __flags,  int __fd,  long __offset) {
// ...
}

// ...

public static int SYS_memfd_secret() {
return (int)447L;
}
}``````

What’s nice is that the arguments are named, eg. `sysno`, `addr`, `__fd`, etc.

Once you have made your research on which symbols you need, it’s really nice to let `jextract` generate the code for you, which is likely to be up-to date, with the best practice backed in.

There’s one thing where this a bit suboptimal, the `__errno_location` function is actually an OS specific function, that is used to revolve the `errno` value. I’m unsure if jextract should resolve macros in general, yet `errno` seems like something very common, so could it make sense if `jextract` could handle this one? But then it’s opening the door to macro resolution which is a different level of complexity.

That being said, it’s not a deal-breaker, just something to be aware of.

## Closing words

memfd_secret

Since I heard about this feature in Linux 5.14 I was hoping to test it after the spectre style attacks, at least form a developer perspective. The first thing is that you’ll need a Linux with that version, so forget Docker Desktop for now as even the latest 4.8 is still using a Linux 5.10 kernel (at least on macOs). Also, deployment wise, there’s a flag to enable at boot time, which makes it difficult to deploy, in particular in a cloud provider unless you have the hands on the bootloader. On you regular laptop, the fact that this feature disables hibernation is almost a deal-breaker for this kind of hardware.

Personally, If an application is not having a very tight control at how secrets are actually used, I fail to see the value of such feature.

JEP-419

Yet again project Panama embodied by JEP-419 in the JDK 18 delivers, it’s possible to interact with the system. And doing so with some ease. And without having to deal with different build systems. I have almost nothing relevant to mention here. I missed the possibility of creating a `MemorySegment` from a file descriptor, but this might be a rare case, especially with the topic at hand. While I still find the mandatory use of `--enable-native-access=ALL-UNNAMED` unpractical especially for an API that is arriving late after alternatives that do not have this enforcement, this restriction will be relaxed in JEP-425 in JDK19 (java/lang/foreign/package-info.java): if this flag is not specified, users will get a warning for the first call to a restricted method (one warning per module).

Yet again, I’m happy to see this part of project Panama landing in JDK to bridge the gap to native world without third party.

comments powered by Disqus