Using Linux's memfd_secret syscall from the JVM with JEP-419

Linux 5.14 brought a new system call memfd_secret in order to mitigate speculative attack by preventing the kernel from being able to peek at memory segments created by this system call.

In this article I will leverage the API introduced ⸺ well, still incubating ⸺ of JEP-419 (Project Panama). JEP-419 has been delivered as part of JDK 18.

For those that don’t know, Project Panama is a project that aims to provide an easy, secure and efficient way to call native methods from Java. You can look at my previous articles on the earlier versions of the project, articles on foojay.io by Carl Dea, or those on inside.java.

The project Panama APIs is the fruit of work started in 2014, from ideas even older. The project is still evolving which means the next iteration (JEP-424) in JDK 19 is going to promote these APIs as preview, but API and behavior adjustments are still likely.
The following examples are based on JDK 18 2022-03-22, build 18+37 from Red Hat. The Linux distribution is a Fedora release 35 with the kernel 5.16.20-200.fc35.x86_64.

While this article will focus on Linux, the same concepts apply to other OSes (and CPU as we’ll see). Also, this article is not about introducing JEP-419, this is done in other material, which means this blog assumes the right compilation and running flags for JEP-419, usually --add-modules=jdk.incubator.foreign, --enable-native-access=ALL-UNNAMED.

tl;dr

Here’s how one can make a syscall using JEP-419.

memfd_secret syscall
var sys_memfd_secret = CLinker.systemCLinker().downcallHandle(
        systemCLinker.lookup("syscall").get(),
        FunctionDescriptor.of(
                ValueLayout.JAVA_INT,
                ValueLayout.JAVA_INT,
                ValueLayout.JAVA_INT
        )
);

int secret_fd = (int) sys_memfd_secret.invoke(447, 0)

Now to know more about the whole deal, read on.

What is a system call (syscall) ?

Before jumping to memfd_secret, let’s first understand how to make a system call. And even before that, let’s see what is a system call.

For those not interested in this part you can jump to memfd_secret section.

In order to do something useful a program has to interact with some resources, memory, disk, network, terminal, etc. On a computer, these resources are handled by a very complex and critical software, the Operating System.

In order to use these resources, a program has to make system calls like read, wait, write, exit, etc. The standard malloc, the native allocator, has to actually place a request to the OS to get memory via a mmap syscall.

malloc mmap

As expected the JVM does plenty of syscalls too, e.g. when logging something on stdout or persisting a (unified) log file.

Essentially,

a system call is a way of requesting the kernel to do something for the program.

Why system calls have to be in the kernel and not in the user space like in a standard library? As mentioned earlier the reasoning is that system calls are a way to interact or, involve, a resource like devices, file system, network, processes, etc. These resources are managed by a privileged software : the OS or kernel.

When a system call happens, the program doesn’t simply invoke a method at some whose code resides at some address, a system call is actually making the CPU switching to Kernel mode because the kernel is a privileged software.

On most modern processors there is a security model, that allows to limit the scope of what a program can do. In particular on Intel based CPUs, the model is known as processor protection ring (or hierarchical protection domains).

Ring 3User space(Lowest privileges)Ring 2Ring 1Ring 0KernelKernel space(Highest privileges)Device driversDevice driversApplications

It seems that Ring 1 and 2 are rarely used because paging (the way that the OS handles memory, see my blog post on [off-heap memory]) only has the concept of privileged and unprivileged which minimize the actual benefit of those rings, according to Evan Teran's answer on SO.'

When a processor executes some code (in thread), the processor knows the current mode, this way the processor is able to gate memory accesses, e.g. a Ring 3 (user-land program cannot access memory from Ring 0, the kernel). This is yet another feature of the virtual memory abstraction. The processor could also restrict some processor instructions and registers to the software running in Ring 0.

Out of scope: there’s even negative rings on some CPU architectures for hypervisor, or CPU System management, up to Ring -3.

Restrictions are enforced by the CPU, in order to perform its purpose a user-land program needs to place a request to the kernel. This mechanism is called syscall, it allows to transition between rings.

process threadexecutingprocess threadexecutingIdlesyscallRing 3User landKernelRing 0kernal syscall executingModeSwitchModeSwitch
Syscall ring transitions

During mode switches a lot is happening, saving and restoring registers, putting the CPU in specific mode (user vs kernel) etc. And of course doing the reverse once the request is handled either with success or a failure

Privilege context switches are sufficiently costly that most libraries try to avoid those. For example, reading 8 KiB instead of 256 bytes is a good idea as it drastically reduces the number of syscall and as such mode switches.

What does the documentation says about syscalls ?

Now let’s get practical.

Looking at man 2 syscall, the manpage shed some details on how to make the call, specifically in the Architecture calling conventions section. Those details are in assembly, e.g.

  • processor interrupt 0x80 for i386 processors (32 bits), then specific registers

  • syscall instruction for x86_64 processors (64 bits), then specific registers

The calling convention of other architectures are also described e.g. on ARM processors, the system call is performed by a swi 0x0 instruction, on aarch64 by svc #0.

For people not aware of what exactly is a calling convention should read at leas this wikipedia article on x86 calling convention. But in a short a calling convention defines how and where parameters should be placed in order to call the code, how parameters are passed registers or/and stack, how values are returned etc.

This manual page also gives an important difference with regular functions, while we look up system calls by their names: write, read, execve, exit, mmap, memfd_create etc. The programs and the kernel actually know them by numbers.

Why numbers? The reason is that syscalls are like messages that are passed down, and these numbers somewhat like enum ordinals indicating the type of message. These numbers are part of the syscall ABI (Application Binary Interface) and as such they are stable for a CPU architecture although unbounded (new syscalls can be added).

Outside, of this scope not all syscalls are made equal nowadays, some syscalls, usually the most used ones are exported in the user space memory, to avoid the cost of switching to kernel mode. In practice, vDSO (Virtual Descriptor Shared Object) is like a library, it is loaded in memory so that it can be accessed from the program memory (glibc knows about this memory region and will use it).

pmap -X {pid}
# pmap -X 1
1:   java ...
         Address Perm   Offset Device   Inode Size  Rss  Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible ProtectionKey Mapping
...
    7ffe78f4c000 rw-p 00000000  00:00       0  132  112  112        112       112        0              0             0              0               0    0       0      0           0             0 [stack]
    7ffe78fad000 r--p 00000000  00:00       0   16    0    0          0         0        0              0             0              0               0    0       0      0           0             0 [vvar]
    7ffe78fb1000 r-xp 00000000  00:00       0    8    4    0          4         0        0              0             0              0               0    0       0      0           0             0 [vdso]  (1)
ffffffffff600000 r-xp 00000000  00:00       0    4    0    0          0         0        0              0             0              0               0    0       0      0           0             0 [vsyscall]
...
1 The vDSO 8 KiB segment

To read more about it, one should read the relevant manual page (man 7 vdso). Typically, this page lists the exported syscalls.

E.g ` __vdso_clock_gettime`, which is called by clock_gettime defined in the standard libc (man 3 clock_gettime).

The syscall numbers are different between architectures! On Linux one can look at their definition in the /include/asm-/unistd-.h files.

From the syscall manpage the Intel CPUs syscall calling convention is:

Example 1. 64-bit programs
Set the registers
  1. rax ← System Call number

  2. rdi ← First argument

  3. rsi ← Second argument

  4. rdx ← Third argument

Make the syscall
  • execute syscall processor instruction

The actual syscall numbers (for 32 bit programs) is usually defined in /usr/include/asm/unistd_64.h

Example 2. 32-bit programs
Set the registers
  1. eax ← System Call Number

  2. ebx ← First Argument

  3. ecx ← Second Argument

  4. edx ← Third Argument

Make the syscall
  • Place a processor interrupt int 0x80

The actual syscall numbers (for 32 bit programs) is usually defined in /usr/include/asm/unistd_32.h.

My first syscall

In order to quickly practice a syscall, let’s do a very simple hello world. The example will be in assembler, I promise this is the only source snippet in assembly and after that I’ll be back with Java and Panama.

  • /usr/include/asm/unistd_64.h

Example 3. 64-bits (with syscall instruction)
hello_syscall.asm (x86_64)
global _start       ; define entrypoint
section .text
_start:
    mov rax, 0x1    ; syscall number for write (1)
    mov rdi, 0x1    ; int fd                   (2)
    mov rsi, msg    ; const void* buf
    mov rdx, mlen   ; size_t count
    syscall         ; make the call            (3)

    mov rax, 0x3c   ; syscall number for exit  (1)
    mov rdi, 0x1    ; int status               (2)
    syscall         ; make the call            (3)

section .rodata
    msg: db "Hello Linux syscalls!",0x0a, 0x0d  ; message string, terminated by a new line (0A, 0D)
    mlen: equ $-msg                             ; calculate the lenght of the message
1 At this place this register will hold the selected the syscall (a number). Note the number comes from /usr/include/asm/unistd_64.h.
2 Syscall arguments are placed in next registers.
3 Make the syscall with interrupt 0x80.
nasm -w+all -f elf64 -o hello_syscall.o hello_syscall.asm (1)
ld -o hello_syscall hello_syscall.o
./hello_syscall
1 Note the elf64 format for 64 bits.
Example 4. 32-bits (with an interrupt)
hello_syscall_via_int80.asm (x86, ie won’t work on ARM)
global _start                ; define entrypoint
section .text
_start:
    mov eax, 4               ; syscall number: write (1)
    mov ebx, 1               ; stdout (2)
    mov ecx, str             ; buffer address
    mov edx, str_len         ; buffer length
    int 0x80                 ; make the call (3)

    mov eax, 1               ; syscall number: exit (1)
    mov ebx, 0               ; exit status (2)
    int 0x80                 ; make the call (3)

section .rodata
    str: db "Hello Linux!", 0Ah  ; message string, terminated by a new line (0A)
    str_len: equ $ - str         ; calculate the lenght of the message
1 At this place this register will hold the selected the syscall (a number). Note the number comes from /usr/include/asm/unistd_64.h.
2 Syscall arguments are placed in next registers.
3 Make the syscall with interrupt 0x80.
compile and run
nasm -w+all -f elf32 -o hello_syscall_via_int80.o hello_syscall_via_int80.asm (1)
ld -m elf_i386 -o hello_syscall_via_int80 hello_syscall_via_int80.o (2)
./hello_syscall_via_int80
1 Note the elf32 format for 32 bits.
2 Note the linker emulation option for i386

When looking at this very simplistic code, something immediately stands out: From application point of view (user land), a syscall is just like an atomic pseudo machine instruction. I believe this example is more striking than the figure above on syscall ring transitions.

We saw what is exactly a syscall and how to make one using assembly. In general though, it’s rare to invoke syscall directly as the standard library exposes wrappers that handle everything for most of the syscalls.

programlibcKernelprintf() {syscall(SYS_write,…)printf()SYS_write
syscall wrappers in the standard library

Because memfd_secret syscall has been recently used there’s no wrapper functions in the standard library, hence we’ll need to make a system call ourselves.

Making syscalls from the JVM

The work of the Panama project doesn’t allow us to directly write assembly code and execute it. Fortunately!

And the libc already exposes a syscall function that takes care of the calling convention as mentioned in man 2 syscall, ie it will place the arguments in the right CPU registers.

syscall manual example (omitting headers)
int main(int argc, char *argv[])
{
   pid_t tid;

   pid = syscall(SYS_getpid);
   printf("pid: %ld\n", pid);
}

So, basically to make a syscall using JEP-419, I only have to perform a lookup for the syscall function, also since it’s part of the standard libc, this just need CLinker.systemLinker().

syscall manual example with Panama
/*
  On linux (Intel x86_64) in
  - /usr/include/asm/unistd_64.h

  #define __NR_getpid 39

  On macOs (Intel x86_64) in either :
  - /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/syscall.h
  - /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/syscall.h

  #define	SYS_getpid         20
*/
final static in SYS_getpid = 20; (1)

MethodHandle syscall = systemCLinker.downcallHandle(
        systemCLinker.lookup("syscall").get(),
        FunctionDescriptor.of(
                ValueLayout.JAVA_INT, (2)
                ValueLayout.JAVA_INT  (3)
        )
);

int pid = (int) syscall.invoke(SYS_getpid); (4)
System.out.println("pid: " + pid);
1 The syscall number.
2 The return type of the syscall function.
3 The first argument is the syscall number.
4 Making the syscall.

That’s it, we’ve made out first direct syscall using panama (and the JEP-419). Simple right?Let’s try to use that knowledge for memfd_secret syscall.

memfd_secret

The memfd_secret syscall was introduced in this commit. Fortunately Linux has good commit message, so we can read and learn more about how to create "secret" memory areas.

The following example demonstrates creation of a secret mapping (error handling is omitted):

fd = memfd_secret(0);
ftruncate(fd, MAP_SIZE);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

Basically we need to create the secret file descriptor, truncate it to the desired size, and then memory map it.

  1. First get a file descriptor with memfd_secret

    memfd_secret syscall
    /*
      On linux (Intel x86_64) in /usr/include/asm/unistd_64.h
    
      #define __NR_memfd_secret 447
    */
    final static in SYS_memfd_secret = 447; (1)
    
    MethodHandle syscall = systemCLinker.downcallHandle(
            systemCLinker.lookup("syscall").get(),
            FunctionDescriptor.of(
                    ValueLayout.JAVA_INT, (2)
                    ValueLayout.JAVA_INT, (3)
                    ValueLayout.JAVA_INT, (4)
            )
    );
    
    int secret_fd = (int) syscall.invoke(SYS_memfd_secret, 0); (5)
    1 The memfd_secret number.
    2 The return type of the syscall function.
    3 The first argument is the syscall number.
    4 The flags passed to memfd_secret, currently the only supported flag is O_CLOEXEC according to this LWN article by Jonathan Corbet.
    5 Making the syscall, not using any flags, the returned value is a file descriptor.

    We can proceed with the rest of the process.

  2. Then sets the desired size

    // int ftruncate(int fd, off_t length);
    MethodHandle ftruncate = systemCLinker.downcallHandle(
            systemCLinker.lookup("ftruncate").get(),
            FunctionDescriptor.of(
                    ValueLayout.JAVA_INT,
                    ValueLayout.JAVA_INT, // fd
                    ValueLayout.JAVA_LONG // length
            )
    );
    
    var res = (int) ftruncate.invoke( (1)
            secret_fd,
            secret.length()
    );
    1 Invoke the ftruncate from the libc on the file descriptor with the wanted size.
  3. Finally, memory map this file descriptor, this operation has the effect to unmap this memory segment from the Kernel pages (in Ring 0), so only the user process can read these memory pages.

    // in /usr/include/bits/mman-linux.h
    // #define PROT_READ       0x1             /* Page can be read.  */
    // #define PROT_WRITE      0x2             /* Page can be written.  */
    final int PROT_READ = 1;
    final int PROT_WRITE = 2;
    // #define MAP_SHARED      0x01            /* Share changes.  */
    final int MAP_SHARED = 1;
    
    // in /usr/include/sys/mman.h
    // extern void *mmap (void *__addr, size_t __len, int __prot,
    //                   int __flags, int __fd, __off_t __offset) __THROW;
    MethodHandle mmap = systemCLinker.downcallHandle(
            systemCLinker.lookup("mmap").get(),
            FunctionDescriptor.of(
                    ValueLayout.ADDRESS, // addr
                    ValueLayout.ADDRESS, // addr
                    ValueLayout.JAVA_LONG, // size
                    ValueLayout.JAVA_INT, // protection modes
                    ValueLayout.JAVA_INT, // flags
                    ValueLayout.JAVA_INT, // fd
                    ValueLayout.JAVA_LONG // offset
            )
    );
    
    var segmentAddress = (MemoryAddress) mmap.invoke( (1)
            NULL,
            secret.length(),
            PROT_READ | PROT_WRITE,
            MAP_SHARED,
            secret_fd,
            0
    );
    1 Memory-map the file descriptor, using the same wanted size, and use the right protection modes (read & write), and flags.
  4. Once the memory segment is mapped, we can actually get access to it via the MemorySegment API.

    var secretSegment = MemorySegment.ofAddress(segmentAddress, length, scope); (1)
    secretSegment.copyFrom(MemorySegment.ofArray(secretBytes)); (2)
    var roSecretSegement = secretSegment.asReadOnly(); (3)
    1 Create a MemorySegment from the memory segment address, also using the same size, and the current ResourceScope.
    2 Since secretSegment is actually a MemorySegment off heap, the secret array as to be transformed first into an on-heap MemorySegment before being copied to the secret memory mapping.
    3 Eventually make the segment read-only.

    And to read the secret, just extract the byte array from the memory segment.

    var bytes = secretSegment.toArray(ValueLayout.JAVA_BYTE);

With this you have a complete working example of how to use the memfd_secret from Java using Panama (JEP-419).

…or not!

Indeed, running this will make the JVM seg-fault!

stdout
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f561919ffd7, pid=4798, tid=4799
#
# JRE version: OpenJDK Runtime Environment 22.3 (18.0+37) (build 18+37)
# Java VM: OpenJDK 64-Bit Server VM 22.3 (18+37, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /home/bob/opensource/core.4798)
#
# An error report file with more information is saved as:
# /home/bob/opensource/hs_err_pid4798.log
#
# If you would like to submit a bug report, please visit:
#   https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=java-latest-openjdk&version=35
#

So, what did happen ? The problematic frame isn’t helpful if you’re not familiar with JVM internals. Opening hs_err_pid4798.log is more helpful.

filename
...

Stack: [0x00007f734ae3d000,0x00007f734af3e000],  sp=0x00007f734af3c430,  free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
v  ~StubRoutines::jbyte_disjoint_arraycopy
V  [libjvm.so+0xe66d70]  Unsafe_CopyMemory0+0xd0
j  jdk.internal.misc.Unsafe.copyMemory0(Ljava/lang/Object;JLjava/lang/Object;JJ)V+0 [email protected]
j  jdk.internal.misc.Unsafe.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V+29 [email protected]
j  jdk.internal.misc.ScopedMemoryAccess.copyMemoryInternal(Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljava/lang/Object;JLjava/lang/Object;JJ)V+32 [email protected]
j  jdk.internal.misc.ScopedMemoryAccess.copyMemory(Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljava/lang/Object;JLjava/lang/Object;JJ)V+12 [email protected]
j  jdk.incubator.foreign.MemorySegment.copy(Ljdk/incubator/foreign/MemorySegment;Ljdk/incubator/foreign/ValueLayout;JLjdk/incubator/foreign/MemorySegment;Ljdk/incubator/foreign/ValueLayout;JJ)V+202 [email protected]
j  jdk.incubator.foreign.MemorySegment.copy(Ljdk/incubator/foreign/MemorySegment;JLjdk/incubator/foreign/MemorySegment;JJ)V+13 [email protected]
j  jdk.incubator.foreign.MemorySegment.copyFrom(Ljdk/incubator/foreign/MemorySegment;)Ljdk/incubator/foreign/MemorySegment;+10 [email protected] (1)
j  io.github.bric3.panama.f.syscalls.LinuxSyscall.memfd_secret_external()V+48
j  io.github.bric3.panama.f.syscalls.LinuxSyscall.main([Ljava/lang/String;)V+99
v  ~StubRoutines::call_stub
V  [libjvm.so+0x81420a]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x30a
V  [libjvm.so+0x8a2111]  jni_invoke_static(JNIEnv_*, JavaValue*, _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, JavaThread*) [clone .isra.174] [clone .constprop.397]+0x351
V  [libjvm.so+0x8a4a05]  jni_CallStaticVoidMethod+0x145
C  [libjli.so+0x47a9]  JavaMain+0xd19
C  [libjli.so+0x7d69]  ThreadJavaMain+0x9
...

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0xffffffffffffffff (2)

...
1 This happened while doing the MemorySegment::copyFrom call.
2 Moreover, the segmentation fault appears to have been caused by a memory access to non mapped memory address SEGV_MAPERR. The most common other reason for segfault is SEGV_ACCERR, which is caused by accessing a memory address with wrong permissions.

So what happened ? Actually, the value of the file descriptor was -1. Which of course is not a valid file descriptor. Also, the call to ftruncate seems to handle well the case where the file descriptor is not valid.

The call to mmap the file descriptor, also returns -1, which is supposed to be the memory segment address.

So why did this happen? When invoking native methods, syscalls in particular, one need to be aware of the convention about error handling for these methods.

errno

Indeed, when developing in C/C++, when something returns -1, it usually means that something went wrong, and that the result is invalid.

Moreover, the errno variable is a global variable that is set by the system calls and some library functions, see the relevant man 3 errno.

Because it is a global variable its declaration depends on the system.

Example 5. Linux’s errno
  • /usr/include/asm-generic/errno.h

  • /usr/include/asm-generic/errno-base.h

errno declaration
extern int *__errno_location (void) __THROW __attribute_const__;
# define errno (*__errno_location ())
errno codes
...
/*
 * This error code is special: arch syscall entry code will return
 * -ENOSYS if users try to call a syscall that doesn't exist.  To keep
 * failures of syscalls that really do exist distinguishable from
 * failures due to attempts to use a nonexistent syscall, syscall
 * implementations should refrain from returning -ENOSYS.
 */
#define ENOSYS          38      /* Invalid system call number */
...
Example 6. macOs’s errno
  • /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/errno.h

  • /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/errno.h

errno declaration
extern int * __error(void);
#define errno (*__error())
errno codes
...
#define ENOLCK          77              /* No locks available */
#define ENOSYS          78              /* Function not implemented */
...

So we’ll need to check the errors after each call in our case, as each of these calls are system calls underneath.

On Linux we can see that errno definition is actually a call to a function that return a pointer : *__errno_location ()

checking errno
MethodHandle __errnoLocationMH = systemCLinker.downcallHandle(
        systemCLinker.lookup("__errno_location"),
        FunctionDescriptor.of(ValueLayout.ADDRESS)
);

int errno = ((MemoryAddress) __errnoLocationMH.invoke()) (1)
        .get(ValueLayout.JAVA_INT, 0); (2)
1 Get errno address
2 Read errno value

On Linux the package more-utils has a tool called errno that can be used to list all the error codes errno -l.

Additionally, there is a function strerror that returns a string from an error code.

getting the error message
MethodHandle strerror = systemCLinker.downcallHandle(
      systemCLinker.lookup("strerror").get(),
      FunctionDescriptor.of(ValueLayout.ADDRESS, ValueLayout.JAVA_INT)
);

String errmsg = ((MemoryAddress) strerror.invoke(errno)).getUtf8String(0);

So, placing this check after the memfd_secret syscall, looked like a good bet. Eventually doing something similar after each call is a good idea as well, it kinda looks like the Go lang way of checking errors.

memfd_secret error checking
fd = (int) sys_memfd_secret.invoke(0);
if (fd == -1) {
  var errno = errno();
  System.err.println(errno == ENOSYS ?
                     "tried to call a syscall that doesn't exist (errno=ENOSYS), may need to set the 'secretmem.enable=1' kernel boot option" :
                     "syscall memfd_secret failed, errno: " + errno + ", " + strerror(errno));
  return Optional.empty();
}

While reviewing the memfd_secret commit, we can see there’s a check that returns ENOSYS when a condition is not met.

So in order to make the whole thing work, we need to tackle what’s preventing memfd_secret to happen.

Linux bootloader flag

So actually, Linux is gating the memfd_secret syscall by a flag named secretmem_enable. That maybe why memfd_secret is not listed whe looking at man 2 syscalls.

It’s not quite clear from the commit that introduced memfd_secret but in order to work, the machine boot has to be configured with the flag secretmem.enable=1.

DISCLAIMER: I am not responsible if something happens wrong on your machines / OS. The following actually changes the Linux bootloader configuration, and as such, any misconfiguration could make this system non-bootable! Please read and understand the documentation of your system before proceeding.
Enabling this prevents hibernation whenever there are active secret memory users.

My test machine is a Fedora 35, let’s read their page on the GRUB2 bootloader.

From this page, it seems there’s a fairly simple way to change the bootloader configuration.

add secretmem.enable=1 flag
sudo grubby --update-kernel=ALL --args="secretmem.enable=1"
check the configuration
sudo grubby --info=ALL
remove secretmem.enable=1 flag
sudo grubby --update-kernel=ALL --remove-args="secretmem.enable=1"

Notice the actual flag name is secretmem.enable, not secretmem_enable !

The reboot the OS. Now if the configuration was properly applied, memfd_secret should return a valid file descriptor.

$ java --add-modules=jdk.incubator.foreign --enable-native-access=ALL-UNNAMED MemfdSecret.java
WARNING: Using incubator modules: jdk.incubator.foreign
warning: using incubating module(s): jdk.incubator.foreign
1 warning
Secret mem fd: 4 (1)
Secret: super secret decryption key
1 memfd_secret here returned the file descriptor 4

Typically, this secret storage could be used to store a decryption key during startup, and it’ll be used to decrypt encrypted payload. Of course, care must be taken to prevent this data from leaving this memory. Which might not be possible under many circumstances. E.g. a library that takes a Java String, in which case the secret buffer is copied in elsewhere in the heap.

Improvements

Trying to replace most panama calls by JDK types

So appart from the memfd_secret syscall, the other calls, looks to be replaceable ?

MemorySegment.mapFile looks like a good bet to replace mmap.

However, upon first use, things start to look problematic. The signature requires a Path and the mapping is limited to a single MapMode.

MemorySegment::mapFile signature
static MemorySegment mapFile(
        Path path,
        long bytesOffset,
        long bytesSize,
        FileChannel.MapMode mapMode,
        ResourceScope scope
) throws IOException {

Supposing the file descriptor value is 4, if it was possible to pass /dev/fd/4 or /proc/self/fd/4 as a Path, we could not map this segment as read and write via this API. And performing this operation twice, one in read-only mode and one in write-only mode, would not work as this special file descriptor is closed after the first memory mapping.

There’s some interesting bits in FileOutputStream / FileInputStream as they can be created from a JDK’s FileDescriptor, they to allow to get the underneath FileChannel, which then allow to call map() to get a memory mapping. However, FileDescriptor class does not have a public constructor, and even being able to hack FileDescriptor (with`--add-opens=java.base/java.io=ALL-UNNAMED`) is not enough as we get in the same situation as above because it’s only possible to have a mapping in read-only or write-only.

Basically, we’re stuck with using the mmap native function to do what’s necessary. I don’t know if it is out of scope for the JEP-419, or the next JEP-424, but I think this would be a good thing to support MemorySegment of arbitrary file descriptor, in particular when writing programs that run on the command line, this could enable things like java Main.java <(cat neko | grep meow).

Finally, I don’t believe there’s something equivalent available in JDK for the ftruncate function.

Improving our syscall API.

In the snippet above, we’ve declared a MethodHandle to the syscall function, if there’s multiple syscalls, we’ll need to pass the syscall number as the first argument each time. MethodHandles API allows to make partial function.

syscall partial function
var syscallAddress = systemCLinker.lookup("syscall").get();
var syscall = systemCLinker.downcallHandle(
        syscallAddress,
        FunctionDescriptor.of(
                ValueLayout.JAVA_INT,
                ValueLayout.JAVA_INT  (1)
        )
);

var sys_getpid = MethodHandles.insertArguments(syscall, 0, SYS_getpid); (2)
sys_getpid.invoke(); (3)
1 The first argument is the syscall number.
2 Capture the syscall number and creates a "partial function".
3 Invocation of the partial function don’t need argument 0.

Now if the syscall has different arity, MethodHandle::appendArgumentLayouts has us covered, so that we can use the basic template of a syscall, sort of, and build on top of this to have specific identifiers for each syscall.

syscall partial function, with added arguments
var sys_memfd_secret = MethodHandles.insertArguments(systemCLinker.downcallHandle(
        systemCLinker.lookup("syscall").get(),
        FunctionDescriptor.of(
                ValueLayout.JAVA_INT,
                ValueLayout.JAVA_INT
        ).appendArgumentLayouts(ValueLayout.JAVA_INT) (1)
), 0, SYS_memfd_secret); (2)

int fd = (int) sys_memfd_secret.invoke(0); (3)
1 Append arguments to the function descriptor.
2 Capture the syscall number and creates a "partial function".
3 Simply invoke the call passing only required arguments on call site.

Other things are possible with MethodHandles that can be handy with Panama, yet out of scope for this blog post. Just check the API.

Generating the MethodHandles with jextract

The JDK Panama team, also created a tool known as jextract whose job is to lift most of the work to generate the MethodHandles.

So mentioned in other blog post or conference talks I gave, jextract is now a separate tool, at this point there’s no binary release which means it has to be built. The jextract project page explains how to do this. My test machine is a Fedora, so adapt the command and the JDK distribution to your needs.

Build jextract
sudo dnf install java-latest-openjdk-jmods.x86_64 (1)
curl -LO https://github.com/llvm/llvm-project/releases/download/llvmorg-14.0.0/clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz (2)
tar xf clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz
git clone --depth 1 https://github.com/openjdk/jextract.git
cd jextract
sh ./gradlew \
  -Pjdk18_home=/etc/alternatives/java_sdk_18_openjdk \
  -Pllvm_home=/home/bric3/clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04/ \
  clean verify (3)
1 At this time latest this is an OpenJDK 18
2 LLVM 14 for x86_64
3 Run the documented build command with the required home directories.

If everything is alright you can use jextract:

jextract Version
$ build/jextract/bin/jextract --version
WARNING: Using incubator modules: jdk.incubator.foreign
jextract 18.0.1
JDK version 18.0.1+10
clang version 14.0.0

The basic usage is jextract <options> <header file>. Since there is multiple headers, the trick is to specify a handcrafted header with every needed header.

memfd_secret_header.h
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/syscall.h>

This would be really neat if jextract or even java could handle any file descriptors, this could come handy with heredocs.

jextract with here-doc, for multiple headers
$ jextract ... <(cat <<-EOF (1)
#include <errno.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/mman.h>
EOF
)
1 The <() returns a file descriptor that is not a file on disk, something like /proc/self/fd/13, whose content is the string between the <←EOF and EOF markers.

Of course nowadays, jextract bails out when something like /proc/self/fd/13 is passed, reporting that it is not a file.

Also note that jextract support passing the options as option file. So we can pass output options, like the target package and class name, but also the symbols we’d like.

memfd_secret_header.jextract.options
--source (1)
--output build/generated/sources/jextract-syscall/java (2)
--target-package linux (3)
--header-class-name syscall_h (4)

--include-function syscall
--include-macro SYS_memfd_secret

--include-function close
--include-function ftruncate

--include-function mmap
--include-function munmap
--include-macro PROT_READ
--include-macro PROT_WRITE
--include-macro MAP_SHARED

--include-function strerror

#### Since errno macro is not supported at this time, it is necessary
#### to manually resolve errno, and include OS specific declarations.
#### e.g. __errno_location is Linux specific
--include-function __errno_location (5)
1 Tells jextract to generate the source code, instead of classes.
2 Specifies the output directory.
3 Specifies the package name.
4 Specifies the class name.
5 Specifies the function used to resolve errno on Linux.
Running jextract
$ jextract @memfd_secret.jextract.options tmp.h (1)
1 Assuming jextract is in the PATH, or there’s an alias.

So once done, we’ll have a file with all the symbols we need.

syscall_h.java
// ...
public class syscall_h  {
    public static MemoryAddress __errno_location () {
        // ...
    }

    public static long syscall ( long __sysno, Object... x1) {
        // ...
    }

    public static MemoryAddress mmap ( Addressable __addr,  long __len,  int __prot,  int __flags,  int __fd,  long __offset) {
        // ...
    }

    // ...

    public static int SYS_memfd_secret() {
        return (int)447L;
    }
}

What’s nice is that the arguments are named, eg. sysno, addr, __fd, etc.

Once you have made your research on which symbols you need, it’s really nice to let jextract generate the code for you, which is likely to be up-to date, with the best practice backed in.

There’s one thing where this a bit suboptimal, the __errno_location function is actually an OS specific function, that is used to revolve the errno value. I’m unsure if jextract should resolve macros in general, yet errno seems like something very common, so could it make sense if jextract could handle this one? But then it’s opening the door to macro resolution which is a different level of complexity.

That being said, it’s not a deal-breaker, just something to be aware of.

Closing words

memfd_secret

Since I heard about this feature in Linux 5.14 I was hoping to test it after the spectre style attacks, at least form a developer perspective. The first thing is that you’ll need a Linux with that version, so forget Docker Desktop for now as even the latest 4.8 is still using a Linux 5.10 kernel (at least on macOs). Also, deployment wise, there’s a flag to enable at boot time, which makes it difficult to deploy, in particular in a cloud provider unless you have the hands on the bootloader. On you regular laptop, the fact that this feature disables hibernation is almost a deal-breaker for this kind of hardware.

Personally, If an application is not having a very tight control at how secrets are actually used, I fail to see the value of such feature.

JEP-419

Yet again project Panama embodied by JEP-419 in the JDK 18 delivers, it’s possible to interact with the system. And doing so with some ease. And without having to deal with different build systems. I have almost nothing relevant to mention here. I missed the possibility of creating a MemorySegment from a file descriptor, but this might be a rare case, especially with the topic at hand. While I still find the mandatory use of --enable-native-access=ALL-UNNAMED unpractical especially for an API that is arriving late after alternatives that do not have this enforcement, this restriction will be relaxed in JEP-425 in JDK19 (java/lang/foreign/package-info.java): if this flag is not specified, users will get a warning for the first call to a restricted method (one warning per module).

Yet again, I’m happy to see this part of project Panama landing in JDK to bridge the gap to native world without third party.

comments powered by Disqus