Using Linux's memfd_secret syscall from the JVM with JEP-419
Linux 5.14 brought a new system call memfd_secret
in order to mitigate
speculative attack by preventing the kernel from being able to peek at memory
segments created by this system call.
In this article I will leverage the API introduced ⸺ well, still incubating ⸺ of JEP-419 (Project Panama). JEP-419 has been delivered as part of JDK 18.
For those that don’t know, Project Panama is a project that aims to provide an easy, secure and efficient way to call native methods from Java. You can look at my previous articles on the earlier versions of the project, articles on foojay.io by Carl Dea, or those on inside.java.
The project Panama APIs is the fruit of work started in 2014, from ideas even older. The project is still evolving which means the next iteration (JEP-424) in JDK 19 is going to promote these APIs as preview, but API and behavior adjustments are still likely. |
The following examples are based on JDK 18 2022-03-22, build 18+37 from Red Hat. The Linux distribution is a Fedora release 35 with the kernel 5.16.20-200.fc35.x86_64. |
While this article will focus on Linux, the same concepts apply to other OSes
(and CPU as we’ll see). Also, this article is not about introducing JEP-419,
this is done in other material, which means this blog assumes the right
compilation and running flags for JEP-419, usually --add-modules=jdk.incubator.foreign
,
--enable-native-access=ALL-UNNAMED
.
Now to know more about the whole deal, read on.
What is a system call (syscall) ?
Before jumping to memfd_secret
, let’s first understand how to make a system call.
And even before that, let’s see what is a system call.
For those not interested in this part you can jump to memfd_secret
section.
In order to do something useful a program has to interact with some resources, memory, disk, network, terminal, etc. On a computer, these resources are handled by a very complex and critical software, the Operating System.
In order to use these resources, a program has to make system calls like
read
, wait
, write
, exit
, etc. The standard malloc
, the native allocator,
has to actually place a request to the OS to get memory via a mmap
syscall.
As expected the JVM does plenty of syscalls too, e.g. when logging something
on stdout
or persisting a (unified) log file.
Essentially,
a system call is a way of requesting the kernel to do something for the program.
Why system calls have to be in the kernel and not in the user space like in a standard library? As mentioned earlier the reasoning is that system calls are a way to interact or, involve, a resource like devices, file system, network, processes, etc. These resources are managed by a privileged software : the OS or kernel.
When a system call happens, the program doesn’t simply invoke a method at some whose code resides at some address, a system call is actually making the CPU switching to Kernel mode because the kernel is a privileged software.
On most modern processors there is a security model, that allows to limit the scope of what a program can do. In particular on Intel based CPUs, the model is known as processor protection ring (or hierarchical protection domains).
It seems that Ring 1 and 2 are rarely used because paging (the way that the OS handles memory, see my blog post on [off-heap memory]) only has the concept of privileged and unprivileged which minimize the actual benefit of those rings, according to Evan Teran's answer on SO.' |
When a processor executes some code (in thread), the processor knows the current mode, this way the processor is able to gate memory accesses, e.g. a Ring 3 (user-land program cannot access memory from Ring 0, the kernel). This is yet another feature of the virtual memory abstraction. The processor could also restrict some processor instructions and registers to the software running in Ring 0.
Out of scope: there’s even negative rings on some CPU architectures for
hypervisor, or CPU System management, up to Ring -3
.
Restrictions are enforced by the CPU, in order to perform its purpose a user-land program needs to place a request to the kernel. This mechanism is called syscall, it allows to transition between rings.
During mode switches a lot is happening, saving and restoring registers, putting the CPU in specific mode (user vs kernel) etc. And of course doing the reverse once the request is handled either with success or a failure
Privilege context switches are sufficiently costly that most libraries
try to avoid those. For example, reading 8 KiB instead of 256 bytes is a good
idea as it drastically reduces the number of syscall and as such mode switches.
|
What does the documentation says about syscalls ?
Now let’s get practical.
Looking at man 2 syscall
,
the manpage shed some details on how to make the call, specifically in the
Architecture calling conventions section. Those details are in assembly, e.g.
-
processor interrupt
0x80
for i386 processors (32 bits), then specific registers -
syscall
instruction for x86_64 processors (64 bits), then specific registers
The calling convention of other architectures are also described e.g.
on ARM processors, the system call is performed by a swi 0x0
instruction,
on aarch64 by svc #0
.
For people not aware of what exactly is a calling convention should read at leas this wikipedia article on x86 calling convention. But in a short a calling convention defines how and where parameters should be placed in order to call the code, how parameters are passed registers or/and stack, how values are returned etc. |
This manual page also gives an important difference with regular functions, while
we look up system calls by their names: write
, read
, execve
, exit
, mmap
,
memfd_create
etc. The programs and the kernel actually know them by numbers.
Why numbers? The reason is that syscalls are like messages that are passed down, and these numbers somewhat like enum ordinals indicating the type of message. These numbers are part of the syscall ABI (Application Binary Interface) and as such they are stable for a CPU architecture although unbounded (new syscalls can be added).
Outside, of this scope not all syscalls are made equal nowadays, some syscalls, usually the most used ones are exported in the user space memory, to avoid the cost of switching to kernel mode. In practice, vDSO (Virtual Descriptor Shared Object) is like a library, it is loaded in memory so that it can be accessed from the program memory (glibc knows about this memory region and will use it). pmap -X {pid}
To read more about it, one should read the relevant manual page ( E.g ` __vdso_clock_gettime`, which is called by |
The syscall numbers are different between architectures! On Linux
one can look at their definition in the /include/asm-/unistd-.h files.
|
From the syscall manpage the Intel CPUs syscall calling convention is:
- Set the registers
-
-
rax
← System Call number -
rdi
← First argument -
rsi
← Second argument -
rdx
← Third argument
-
- Make the syscall
-
-
execute
syscall
processor instruction
-
The actual syscall numbers (for 32 bit programs) is usually defined in /usr/include/asm/unistd_64.h
- Set the registers
-
-
eax
← System Call Number -
ebx
← First Argument -
ecx
← Second Argument -
edx
← Third Argument
-
- Make the syscall
-
-
Place a processor interrupt
int 0x80
-
The actual syscall numbers (for 32 bit programs) is usually defined in /usr/include/asm/unistd_32.h
.
My first syscall
In order to quickly practice a syscall, let’s do a very simple hello world. The example will be in assembler, I promise this is the only source snippet in assembly and after that I’ll be back with Java and Panama.
-
/usr/include/asm/unistd_64.h
syscall
instruction)global _start ; define entrypoint
section .text
_start:
mov rax, 0x1 ; syscall number for write (1)
mov rdi, 0x1 ; int fd (2)
mov rsi, msg ; const void* buf
mov rdx, mlen ; size_t count
syscall ; make the call (3)
mov rax, 0x3c ; syscall number for exit (1)
mov rdi, 0x1 ; int status (2)
syscall ; make the call (3)
section .rodata
msg: db "Hello Linux syscalls!",0x0a, 0x0d ; message string, terminated by a new line (0A, 0D)
mlen: equ $-msg ; calculate the lenght of the message
1 | At this place this register will hold the selected the syscall (a number).
Note the number comes from /usr/include/asm/unistd_64.h . |
2 | Syscall arguments are placed in next registers. |
3 | Make the syscall with interrupt 0x80 . |
nasm -w+all -f elf64 -o hello_syscall.o hello_syscall.asm (1)
ld -o hello_syscall hello_syscall.o
./hello_syscall
1 | Note the elf64 format for 64 bits. |
global _start ; define entrypoint
section .text
_start:
mov eax, 4 ; syscall number: write (1)
mov ebx, 1 ; stdout (2)
mov ecx, str ; buffer address
mov edx, str_len ; buffer length
int 0x80 ; make the call (3)
mov eax, 1 ; syscall number: exit (1)
mov ebx, 0 ; exit status (2)
int 0x80 ; make the call (3)
section .rodata
str: db "Hello Linux!", 0Ah ; message string, terminated by a new line (0A)
str_len: equ $ - str ; calculate the lenght of the message
1 | At this place this register will hold the selected the syscall (a number).
Note the number comes from /usr/include/asm/unistd_64.h . |
2 | Syscall arguments are placed in next registers. |
3 | Make the syscall with interrupt 0x80 . |
nasm -w+all -f elf32 -o hello_syscall_via_int80.o hello_syscall_via_int80.asm (1)
ld -m elf_i386 -o hello_syscall_via_int80 hello_syscall_via_int80.o (2)
./hello_syscall_via_int80
1 | Note the elf32 format for 32 bits. |
2 | Note the linker emulation option for i386 |
When looking at this very simplistic code, something immediately stands out: From application point of view (user land), a syscall is just like an atomic pseudo machine instruction. I believe this example is more striking than the figure above on syscall ring transitions.
We saw what is exactly a syscall and how to make one using assembly. In general though, it’s rare to invoke syscall directly as the standard library exposes wrappers that handle everything for most of the syscalls.
Because memfd_secret
syscall has been recently used there’s no wrapper functions
in the standard library, hence we’ll need to make a system call ourselves.
Making syscalls from the JVM
The work of the Panama project doesn’t allow us to directly write assembly code and execute it. Fortunately!
And the libc already exposes a syscall function that takes care of
the calling convention as mentioned in
man 2 syscall
, ie it
will place the arguments in the right CPU registers.
int main(int argc, char *argv[])
{
pid_t tid;
pid = syscall(SYS_getpid);
printf("pid: %ld\n", pid);
}
So, basically to make a syscall using JEP-419, I only have to perform a lookup
for the syscall
function, also since it’s part of the standard libc, this
just need CLinker.systemLinker()
.
/*
On linux (Intel x86_64) in
- /usr/include/asm/unistd_64.h
#define __NR_getpid 39
On macOs (Intel x86_64) in either :
- /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/syscall.h
- /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/syscall.h
#define SYS_getpid 20
*/
final static in SYS_getpid = 20; (1)
MethodHandle syscall = systemCLinker.downcallHandle(
systemCLinker.lookup("syscall").get(),
FunctionDescriptor.of(
ValueLayout.JAVA_INT, (2)
ValueLayout.JAVA_INT (3)
)
);
int pid = (int) syscall.invoke(SYS_getpid); (4)
System.out.println("pid: " + pid);
1 | The syscall number. |
2 | The return type of the syscall function. |
3 | The first argument is the syscall number. |
4 | Making the syscall. |
That’s it, we’ve made out first direct syscall using panama (and the JEP-419).
Simple right?Let’s try to use that knowledge for memfd_secret
syscall.
memfd_secret
The memfd_secret
syscall was introduced in this commit.
Fortunately Linux has good commit message, so we can read and learn more about
how to create "secret" memory areas.
The following example demonstrates creation of a secret mapping (error handling is omitted):
fd = memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
Basically we need to create the secret file descriptor, truncate it to the desired size, and then memory map it.
-
First get a file descriptor with
memfd_secret
memfd_secret syscall/* On linux (Intel x86_64) in /usr/include/asm/unistd_64.h #define __NR_memfd_secret 447 */ final static in SYS_memfd_secret = 447; (1) MethodHandle syscall = systemCLinker.downcallHandle( systemCLinker.lookup("syscall").get(), FunctionDescriptor.of( ValueLayout.JAVA_INT, (2) ValueLayout.JAVA_INT, (3) ValueLayout.JAVA_INT, (4) ) ); int secret_fd = (int) syscall.invoke(SYS_memfd_secret, 0); (5)
1 The memfd_secret
number.2 The return type of the syscall function. 3 The first argument is the syscall number. 4 The flags passed to memfd_secret
, currently the only supported flag isO_CLOEXEC
according to this LWN article by Jonathan Corbet.5 Making the syscall, not using any flags, the returned value is a file descriptor. We can proceed with the rest of the process.
-
Then sets the desired size
// int ftruncate(int fd, off_t length); MethodHandle ftruncate = systemCLinker.downcallHandle( systemCLinker.lookup("ftruncate").get(), FunctionDescriptor.of( ValueLayout.JAVA_INT, ValueLayout.JAVA_INT, // fd ValueLayout.JAVA_LONG // length ) ); var res = (int) ftruncate.invoke( (1) secret_fd, secret.length() );
1 Invoke the ftruncate
from the libc on the file descriptor with the wanted size. -
Finally, memory map this file descriptor, this operation has the effect to unmap this memory segment from the Kernel pages (in Ring 0), so only the user process can read these memory pages.
// in /usr/include/bits/mman-linux.h // #define PROT_READ 0x1 /* Page can be read. */ // #define PROT_WRITE 0x2 /* Page can be written. */ final int PROT_READ = 1; final int PROT_WRITE = 2; // #define MAP_SHARED 0x01 /* Share changes. */ final int MAP_SHARED = 1; // in /usr/include/sys/mman.h // extern void *mmap (void *__addr, size_t __len, int __prot, // int __flags, int __fd, __off_t __offset) __THROW; MethodHandle mmap = systemCLinker.downcallHandle( systemCLinker.lookup("mmap").get(), FunctionDescriptor.of( ValueLayout.ADDRESS, // addr ValueLayout.ADDRESS, // addr ValueLayout.JAVA_LONG, // size ValueLayout.JAVA_INT, // protection modes ValueLayout.JAVA_INT, // flags ValueLayout.JAVA_INT, // fd ValueLayout.JAVA_LONG // offset ) ); var segmentAddress = (MemoryAddress) mmap.invoke( (1) NULL, secret.length(), PROT_READ | PROT_WRITE, MAP_SHARED, secret_fd, 0 );
1 Memory-map the file descriptor, using the same wanted size, and use the right protection modes (read & write), and flags. -
Once the memory segment is mapped, we can actually get access to it via the
MemorySegment
API.var secretSegment = MemorySegment.ofAddress(segmentAddress, length, scope); (1) secretSegment.copyFrom(MemorySegment.ofArray(secretBytes)); (2) var roSecretSegement = secretSegment.asReadOnly(); (3)
1 Create a MemorySegment
from the memory segment address, also using the same size, and the currentResourceScope
.2 Since secretSegment
is actually aMemorySegment
off heap, the secret array as to be transformed first into an on-heapMemorySegment
before being copied to the secret memory mapping.3 Eventually make the segment read-only. And to read the secret, just extract the byte array from the memory segment.
var bytes = secretSegment.toArray(ValueLayout.JAVA_BYTE);
With this you have a complete working example of how to use the memfd_secret
from Java using Panama (JEP-419).
…or not!
Indeed, running this will make the JVM seg-fault!
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f561919ffd7, pid=4798, tid=4799
#
# JRE version: OpenJDK Runtime Environment 22.3 (18.0+37) (build 18+37)
# Java VM: OpenJDK 64-Bit Server VM 22.3 (18+37, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# v ~StubRoutines::jbyte_disjoint_arraycopy
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /home/bob/opensource/core.4798)
#
# An error report file with more information is saved as:
# /home/bob/opensource/hs_err_pid4798.log
#
# If you would like to submit a bug report, please visit:
# https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=java-latest-openjdk&version=35
#
So, what did happen ? The problematic frame isn’t helpful if you’re not familiar with JVM internals.
Opening hs_err_pid4798.log
is more helpful.
...
Stack: [0x00007f734ae3d000,0x00007f734af3e000], sp=0x00007f734af3c430, free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
v ~StubRoutines::jbyte_disjoint_arraycopy
V [libjvm.so+0xe66d70] Unsafe_CopyMemory0+0xd0
j jdk.internal.misc.Unsafe.copyMemory0(Ljava/lang/Object;JLjava/lang/Object;JJ)V+0 [email protected]
j jdk.internal.misc.Unsafe.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V+29 [email protected]
j jdk.internal.misc.ScopedMemoryAccess.copyMemoryInternal(Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljava/lang/Object;JLjava/lang/Object;JJ)V+32 [email protected]
j jdk.internal.misc.ScopedMemoryAccess.copyMemory(Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljdk/internal/misc/ScopedMemoryAccess$Scope;Ljava/lang/Object;JLjava/lang/Object;JJ)V+12 [email protected]
j jdk.incubator.foreign.MemorySegment.copy(Ljdk/incubator/foreign/MemorySegment;Ljdk/incubator/foreign/ValueLayout;JLjdk/incubator/foreign/MemorySegment;Ljdk/incubator/foreign/ValueLayout;JJ)V+202 [email protected]
j jdk.incubator.foreign.MemorySegment.copy(Ljdk/incubator/foreign/MemorySegment;JLjdk/incubator/foreign/MemorySegment;JJ)V+13 [email protected]
j jdk.incubator.foreign.MemorySegment.copyFrom(Ljdk/incubator/foreign/MemorySegment;)Ljdk/incubator/foreign/MemorySegment;+10 [email protected] (1)
j io.github.bric3.panama.f.syscalls.LinuxSyscall.memfd_secret_external()V+48
j io.github.bric3.panama.f.syscalls.LinuxSyscall.main([Ljava/lang/String;)V+99
v ~StubRoutines::call_stub
V [libjvm.so+0x81420a] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x30a
V [libjvm.so+0x8a2111] jni_invoke_static(JNIEnv_*, JavaValue*, _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, JavaThread*) [clone .isra.174] [clone .constprop.397]+0x351
V [libjvm.so+0x8a4a05] jni_CallStaticVoidMethod+0x145
C [libjli.so+0x47a9] JavaMain+0xd19
C [libjli.so+0x7d69] ThreadJavaMain+0x9
...
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0xffffffffffffffff (2)
...
1 | This happened while doing the MemorySegment::copyFrom call. |
2 | Moreover, the segmentation fault appears to have been caused by a memory access
to non mapped memory address SEGV_MAPERR . The most common other reason for segfault
is SEGV_ACCERR , which is caused by accessing a memory address with wrong permissions. |
So what happened ? Actually, the value of the file descriptor was -1
. Which of course
is not a valid file descriptor. Also, the call to ftruncate
seems to handle well
the case where the file descriptor is not valid.
The call to mmap
the file descriptor, also returns -1
, which is supposed to
be the memory segment address.
So why did this happen? When invoking native methods, syscalls in particular, one need to be aware of the convention about error handling for these methods.
errno
Indeed, when developing in C/C++, when something returns -1
, it usually means
that something went wrong, and that the result is invalid.
Moreover, the errno
variable is a global variable that is set by the system
calls and some library functions, see the relevant
man 3 errno
.
Because it is a global variable its declaration depends on the system.
errno
-
/usr/include/asm-generic/errno.h
-
/usr/include/asm-generic/errno-base.h
extern int *__errno_location (void) __THROW __attribute_const__;
# define errno (*__errno_location ())
...
/*
* This error code is special: arch syscall entry code will return
* -ENOSYS if users try to call a syscall that doesn't exist. To keep
* failures of syscalls that really do exist distinguishable from
* failures due to attempts to use a nonexistent syscall, syscall
* implementations should refrain from returning -ENOSYS.
*/
#define ENOSYS 38 /* Invalid system call number */
...
errno
-
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/errno.h
-
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/errno.h
extern int * __error(void);
#define errno (*__error())
...
#define ENOLCK 77 /* No locks available */
#define ENOSYS 78 /* Function not implemented */
...
So we’ll need to check the errors after each call in our case, as each of these calls are system calls underneath.
On Linux we can see that errno
definition is actually a call to a function
that return a pointer : *__errno_location ()
errno
MethodHandle __errnoLocationMH = systemCLinker.downcallHandle(
systemCLinker.lookup("__errno_location"),
FunctionDescriptor.of(ValueLayout.ADDRESS)
);
int errno = ((MemoryAddress) __errnoLocationMH.invoke()) (1)
.get(ValueLayout.JAVA_INT, 0); (2)
1 | Get errno address |
2 | Read errno value |
On Linux the package more-utils
has a tool called errno
that can be used
to list all the error codes errno -l
.
Additionally, there is a function strerror
that returns a string from an error
code.
MethodHandle strerror = systemCLinker.downcallHandle(
systemCLinker.lookup("strerror").get(),
FunctionDescriptor.of(ValueLayout.ADDRESS, ValueLayout.JAVA_INT)
);
String errmsg = ((MemoryAddress) strerror.invoke(errno)).getUtf8String(0);
So, placing this check after the memfd_secret
syscall, looked like a good bet.
Eventually doing something similar after each call is a good idea as well, it
kinda looks like the Go lang way of checking errors.
fd = (int) sys_memfd_secret.invoke(0);
if (fd == -1) {
var errno = errno();
System.err.println(errno == ENOSYS ?
"tried to call a syscall that doesn't exist (errno=ENOSYS), may need to set the 'secretmem.enable=1' kernel boot option" :
"syscall memfd_secret failed, errno: " + errno + ", " + strerror(errno));
return Optional.empty();
}
While reviewing the memfd_secret
commit, we can see there’s a check that
returns ENOSYS
when a
condition is not met.
So in order to make the whole thing work, we need to tackle what’s preventing
memfd_secret
to happen.
Linux bootloader flag
So actually, Linux is gating the memfd_secret
syscall by a flag named
secretmem_enable
. That maybe why memfd_secret
is not listed whe looking at
man 2 syscalls
.
It’s not quite clear from the commit
that introduced memfd_secret
but in order to work, the machine boot has to
be configured with the flag secretmem.enable=1
.
DISCLAIMER: I am not responsible if something happens wrong on your machines / OS. The following actually changes the Linux bootloader configuration, and as such, any misconfiguration could make this system non-bootable! Please read and understand the documentation of your system before proceeding. |
Enabling this prevents hibernation whenever there are active secret memory users. |
My test machine is a Fedora 35, let’s read their page on the GRUB2 bootloader.
From this page, it seems there’s a fairly simple way to change the bootloader configuration.
secretmem.enable=1
flagsudo grubby --update-kernel=ALL --args="secretmem.enable=1"
sudo grubby --info=ALL
secretmem.enable=1
flagsudo grubby --update-kernel=ALL --remove-args="secretmem.enable=1"
Notice the actual flag name is secretmem.enable
, not secretmem_enable
!
The reboot the OS. Now if the configuration was properly applied,
memfd_secret
should return a valid file descriptor.
$ java --add-modules=jdk.incubator.foreign --enable-native-access=ALL-UNNAMED MemfdSecret.java
WARNING: Using incubator modules: jdk.incubator.foreign
warning: using incubating module(s): jdk.incubator.foreign
1 warning
Secret mem fd: 4 (1)
Secret: super secret decryption key
1 | memfd_secret here returned the file descriptor 4 |
Typically, this secret storage could be used to store a decryption key during startup, and it’ll be used to decrypt encrypted payload. Of course, care must be taken to prevent this data from leaving this memory. Which might not be possible under many circumstances. E.g. a library that takes a Java String, in which case the secret buffer is copied in elsewhere in the heap.
Improvements
Trying to replace most panama calls by JDK types
So appart from the memfd_secret
syscall, the other calls, looks to be
replaceable ?
MemorySegment.mapFile
looks like a good bet to replace mmap
.
However, upon first use, things start to look problematic. The signature
requires a Path
and the mapping is limited to a single MapMode
.
MemorySegment::mapFile
signaturestatic MemorySegment mapFile(
Path path,
long bytesOffset,
long bytesSize,
FileChannel.MapMode mapMode,
ResourceScope scope
) throws IOException {
Supposing the file descriptor value is 4
, if it was possible to pass
/dev/fd/4
or /proc/self/fd/4
as a Path
, we could not map this segment
as read and write via this API.
And performing this operation twice, one in read-only mode and one in
write-only mode, would not work as this special file descriptor is closed
after the first memory mapping.
There’s some interesting bits in FileOutputStream
/ FileInputStream
as they
can be created from a JDK’s FileDescriptor
, they to allow to get the underneath
FileChannel
, which then allow to call map()
to get a memory mapping. However,
FileDescriptor
class does not have a public constructor, and even being able to
hack FileDescriptor
(with`--add-opens=java.base/java.io=ALL-UNNAMED`) is not
enough as we get in the same situation as above because it’s only possible to
have a mapping in read-only or write-only.
Basically, we’re stuck with using the mmap
native function to do what’s
necessary. I don’t know if it is out of scope for the JEP-419, or the next
JEP-424, but I think this would be a good thing to support MemorySegment
of
arbitrary file descriptor, in particular when writing programs that run on the
command line, this could enable things like
java Main.java <(cat neko | grep meow)
.
Finally, I don’t believe there’s something equivalent available in JDK for the
ftruncate
function.
Improving our syscall API.
In the snippet above, we’ve declared a MethodHandle
to the syscall
function,
if there’s multiple syscalls, we’ll need to pass the syscall number as the
first argument each time. MethodHandle
s API allows to make partial function.
var syscallAddress = systemCLinker.lookup("syscall").get();
var syscall = systemCLinker.downcallHandle(
syscallAddress,
FunctionDescriptor.of(
ValueLayout.JAVA_INT,
ValueLayout.JAVA_INT (1)
)
);
var sys_getpid = MethodHandles.insertArguments(syscall, 0, SYS_getpid); (2)
sys_getpid.invoke(); (3)
1 | The first argument is the syscall number. |
2 | Capture the syscall number and creates a "partial function". |
3 | Invocation of the partial function don’t need argument 0. |
Now if the syscall has different arity, MethodHandle::appendArgumentLayouts
has us covered, so that we can use the basic template of a syscall, sort of,
and build on top of this to have specific identifiers for each syscall.
var sys_memfd_secret = MethodHandles.insertArguments(systemCLinker.downcallHandle(
systemCLinker.lookup("syscall").get(),
FunctionDescriptor.of(
ValueLayout.JAVA_INT,
ValueLayout.JAVA_INT
).appendArgumentLayouts(ValueLayout.JAVA_INT) (1)
), 0, SYS_memfd_secret); (2)
int fd = (int) sys_memfd_secret.invoke(0); (3)
1 | Append arguments to the function descriptor. |
2 | Capture the syscall number and creates a "partial function". |
3 | Simply invoke the call passing only required arguments on call site. |
Other things are possible with MethodHandle
s that can be handy with Panama,
yet out of scope for this blog post. Just check the API.
Generating the MethodHandle
s with jextract
The JDK Panama team, also created a tool known as jextract
whose job is to
lift most of the work to generate the MethodHandle
s.
So mentioned in other blog post or conference talks I gave, jextract
is now
a separate tool, at this point there’s no binary release which means it has
to be built. The jextract
project page
explains how to do this. My test machine is a Fedora, so adapt the command
and the JDK distribution to your needs.
sudo dnf install java-latest-openjdk-jmods.x86_64 (1)
curl -LO https://github.com/llvm/llvm-project/releases/download/llvmorg-14.0.0/clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz (2)
tar xf clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz
git clone --depth 1 https://github.com/openjdk/jextract.git
cd jextract
sh ./gradlew \
-Pjdk18_home=/etc/alternatives/java_sdk_18_openjdk \
-Pllvm_home=/home/bric3/clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04/ \
clean verify (3)
1 | At this time latest this is an OpenJDK 18 |
2 | LLVM 14 for x86_64 |
3 | Run the documented build command with the required home directories. |
If everything is alright you can use jextract
:
jextract
Version$ build/jextract/bin/jextract --version
WARNING: Using incubator modules: jdk.incubator.foreign
jextract 18.0.1
JDK version 18.0.1+10
clang version 14.0.0
The basic usage is jextract <options> <header file>
. Since there is
multiple headers, the trick is to specify a handcrafted header with
every needed header.
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/syscall.h>
This would be really neat if jextract with here-doc, for multiple headers
Of course nowadays, |
Also note that jextract
support passing the options as option file.
So we can pass output options, like the target package and class name,
but also the symbols we’d like.
--source (1)
--output build/generated/sources/jextract-syscall/java (2)
--target-package linux (3)
--header-class-name syscall_h (4)
--include-function syscall
--include-macro SYS_memfd_secret
--include-function close
--include-function ftruncate
--include-function mmap
--include-function munmap
--include-macro PROT_READ
--include-macro PROT_WRITE
--include-macro MAP_SHARED
--include-function strerror
#### Since errno macro is not supported at this time, it is necessary
#### to manually resolve errno, and include OS specific declarations.
#### e.g. __errno_location is Linux specific
--include-function __errno_location (5)
1 | Tells jextract to generate the source code, instead of classes. |
2 | Specifies the output directory. |
3 | Specifies the package name. |
4 | Specifies the class name. |
5 | Specifies the function used to resolve errno on Linux. |
jextract
$ jextract @memfd_secret.jextract.options tmp.h (1)
1 | Assuming jextract is in the PATH, or there’s an alias. |
So once done, we’ll have a file with all the symbols we need.
// ...
public class syscall_h {
public static MemoryAddress __errno_location () {
// ...
}
public static long syscall ( long __sysno, Object... x1) {
// ...
}
public static MemoryAddress mmap ( Addressable __addr, long __len, int __prot, int __flags, int __fd, long __offset) {
// ...
}
// ...
public static int SYS_memfd_secret() {
return (int)447L;
}
}
What’s nice is that the arguments are named, eg. sysno
, addr
, __fd
, etc.
Once you have made your research on which symbols you need, it’s really nice to
let jextract
generate the code for you, which is likely to be up-to date, with
the best practice backed in.
There’s one thing where this a bit suboptimal, the __errno_location
function
is actually an OS specific function, that is used to revolve the errno
value.
I’m unsure if jextract should resolve macros in general, yet errno
seems
like something very common, so could it make sense if jextract
could handle
this one? But then it’s opening the door to macro resolution which is
a different level of complexity.
That being said, it’s not a deal-breaker, just something to be aware of.
Closing words
Since I heard about this feature in Linux 5.14 I was hoping to test it after the spectre style attacks, at least form a developer perspective. The first thing is that you’ll need a Linux with that version, so forget Docker Desktop for now as even the latest 4.8 is still using a Linux 5.10 kernel (at least on macOs). Also, deployment wise, there’s a flag to enable at boot time, which makes it difficult to deploy, in particular in a cloud provider unless you have the hands on the bootloader. On you regular laptop, the fact that this feature disables hibernation is almost a deal-breaker for this kind of hardware.
Personally, If an application is not having a very tight control at how secrets are actually used, I fail to see the value of such feature.
Yet again project Panama embodied by JEP-419 in the JDK 18 delivers, it’s possible
to interact with the system. And doing so with some ease. And without having to
deal with different build systems. I have almost nothing relevant to mention here.
I missed the possibility of creating a MemorySegment
from a file descriptor,
but this might be a rare case, especially with the topic at hand.
While I still find the mandatory use of --enable-native-access=ALL-UNNAMED
unpractical especially for an API that is arriving late after alternatives
that do not have this enforcement, this restriction will be relaxed in JEP-425
in JDK19 (java/lang/foreign/package-info.java):
if this flag is not specified, users will get a warning for the first call to
a restricted method (one warning per module).
Yet again, I’m happy to see this part of project Panama landing in JDK to bridge the gap to native world without third party.
-
Athijegannathan Sundararajan and the Panama team