Mimic kubernetes deployment options with Docker
In this article I will show how to duplicate options one could see in a deployment descriptor locally in via the docker command line.
I will also show what the JVM sees when it is running within these containers.
In the following article I will focus on the cgroup system paths, however the
JVM has a nifty option, -XshowSettings:system
, that reads the content of these
files on Linux.
There’s an additional way to discover these values with -Xlog=container*=trace
,
however I find this quite unpractical to be used in the console output. Also
the cgroup information is repeatedly logged, which is spamming the console.
Container resources
Resources in a kubernetes deployment descriptor :
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: "java-app"
spec:
template:
spec:
containers:
- name: "java-app"
resources:
limits:
cpu: "6"
memory: "5Gi"
requests:
cpu: "3"
memory: "3Gi"
Kubernetes translates
-
The limit is expressed as 1000 millicores or 1 core, they in fact represent a quota of time period of the Linux CFS, it is based on two values
cfs_period_us
andcfs_quota_us
. On Kubernetes the period is 100ms (100000µs) and it cannot be changed in the deployment descriptor at this time. For example 100m CPU, 100 milliCPU, and 0.1 CPU are all the same, and they are translated to 10000µs quota. -
The request 1000 millicores or 1 core as 1024 cpu shares. For example 1500 millicores are translated into 1536 cpu shares.
The resources tree declared above is equivalent to these docker params
$ docker run \
--cpu-shares=3072 \ (1)
--cpu-quota=600000 \ (2)
--memory=5g \ (3)
--memory-swap=5g \ (4)
...
1 | cpu request, this is the relative weight of that container for CPU time |
2 | cpu limit, this limits the CPU time of container’s processes, that means throttling |
3 | memory limit, tells the OS to kill (oomkill) the container’s processes if they hit this limit |
4 | kubernetes disable swap, so need to set the amount of physical memory (--memory ) and the sam of
Physical memory and swap (--memory-swap ) to the same value |
In the command above I used --cpu-quota=600000
to expresse the CPU limit,
but I could have done the same using --cpus=6
, docker cli computes the quota
value for us.
These values can be retrieved within the container via the /sys
filesystem.
$ cat /sys/fs/cgroup/cpu/{cpu.cfs_period_us,cpu.cfs_quota_us,cpu.shares}
100000
600000
3072
Using CPU limits may have severe performance drawbacks, in particular if the process is multi-threaded, the quota indicates a time period that is shared for all process threads, or more exactly all process threads in the same cgroup. That means if the limit is 8 cpus (400000µs quota), and they are 20 parallel threads running, then the quota for the period will be consumed in 400000 / 20 in 20000 µs = 20ms and the process in this cgroup will get throttled for the rest of the period, for 80ms. It’s possible to examine how much a process has been throttled by CFS, for example on a container with a 0.1 CPU limit
|
The Kubernetes memory request is used for scheduling the pod on nodes.
Beware that by default the CPU share is 1024, in this case the JVM will not use this value and instead use the number of processors of the host machine.
$ cat /sys/fs/cgroup/cpu/cpu.shares
1024 (1)
$ env -u JDK_JAVA_OPTIONS jshell -J-XshowSettings:system -s - \
s<<<'System.out.printf("procs: %d%n", Runtime.getRuntime().availableProcessors())'
Operating System Metrics:
Provider: cgroupv1
Effective CPU Count: 32
CPU Period: 100000us
CPU Quota: -1
CPU Shares: -1 (2)
List of Processors, 32 total:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
List of Effective Processors, 32 total:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
List of Memory Nodes, 1 total:
0
List of Available Memory Nodes, 1 total:
0
CPUSet Memory Pressure Enabled: false
Memory Limit: 2.00G
Memory Soft Limit: Unlimited
Memory & Swap Limit: Unlimited
Kernel Memory Limit: Unlimited
TCP Memory Limit: Unlimited
Out Of Memory Killer Enabled: true
procs: 32
1 | CPU shares is this cgroup file is by default 1024 |
2 | The JVM reports the default |
TODO other resources like hugepages
spec.containers[].resources.limits.hugepages-<size>
spec.containers[].resources.requests.hugepages-<size>
NUMA (Non-Uniform Memory Address) or Topology
Even if I understand how that works, I have no concrete experience with it, so I didn’t test it. This is exposed as a beta feature since Kubernetes 1.18, whereas in docker it doesn’t see to be well-supported or supported at all.
A container in my Kubernetes cluster gives this.
$ cat /proc/6/numa_maps
700000000 default anon=1052736 dirty=1052736 N0=1052736 kernelpagesize_kB=4
801040000 default
5653635fd000 default file=/usr/lib/jvm/java-11-amazon-corretto/bin/java mapped=1 mapmax=2 N0=1 kernelpagesize_kB=4
5653637fe000 default file=/usr/lib/jvm/java-11-amazon-corretto/bin/java anon=1 dirty=1 N0=1 kernelpagesize_kB=4
5653637ff000 default file=/usr/lib/jvm/java-11-amazon-corretto/bin/java anon=1 dirty=1 N0=1 kernelpagesize_kB=4
565363a10000 default heap anon=85366 dirty=85366 N0=85366 kernelpagesize_kB=4
7f808eafb000 default
7f808eaff000 default anon=8 dirty=8 N0=8 kernelpagesize_kB=4
...
However, the numa_maps
is missing on my local docker.
Security context
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: "java-app"
spec:
template:
spec:
containers:
- name: "java-app"
securityContext:
fsGroup: 43600
runAsUser: 43514
$ id
uid=43514 gid=0(root) groups=0(root),43600
$ ps -A -o pid,user,group,command
PID USER GROUP COMMAND
1 43514 root /usr/bin/dumb-init -- /usr/bin/java -Dfile.encoding=UT
6 43514 root /usr/bin/java -Dfile.encoding=UTF-8 -Duser.timezone=UT
1039 43514 root /bin/bash
1069 43514 root ps -A -o pid,user,group,command
$ ls -lah
total 98M
drwxr-xr-x 1 root root 4.0K Oct 28 08:50 .
drwxr-xr-x 1 root root 4.0K Oct 28 08:50 ..
drwxr-xr-x 1 root root 4.0K Apr 7 2020 bin
drwxr-xr-x 2 root root 4.0K Feb 1 2020 boot
drwxr-xr-x 5 root root 360 Oct 28 08:50 dev
-rw-r--r-- 1 root root 60M Oct 23 08:30 java-app.jar
drwxr-xr-x 1 root root 4.0K Oct 28 08:50 etc
drwxrwsrwx 2 root 43600 4.0K Oct 28 11:57 diag
drwxr-xr-x 2 root root 4.0K Feb 1 2020 home
...
dr-xr-xr-x 1263 root root 0 Oct 28 08:50 proc
drwx------ 2 root root 4.0K Mar 27 2020 root
...
drwxr-xr-x 1 root root 4.0K Mar 27 2020 var
The fsGroup
option is not dynamically re-mappable in docker (see this issue
moby/moby#2259).You’ll need to chown
these mounts within the container. However, if a mounted volume have files with
the groupid 46000
then the right way to be able to read them is to enable the
supplementary group via --group-add
.
$ docker run \
--user 43514 \
--group-add 43600 \
...
I have no name!@3f7dc5eef417:/$ id
uid=43514 gid=0(root) groups=0(root),43600
However, if the runAsGroup
is present it means the user 43514
is no longer
part of the root
group :
- name: "java-app"
securityContext:
fsGroup: 43600
runAsUser: 43514
runAsGroup: 43500
$ docker run \
--user 43514:43500 \
--group-add 43600 \
...
I have no name!@3f7dc5eef417:/$ id
uid=43514 gid=43500 groups=43500,43600
Volumes mounts
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: "java-app"
spec:
template:
spec:
containers:
- name: "java-app"
volumeMounts:
- mountPath: "/diag"
name: "diagnostic-files"
- mountPath: "/etc/java-app/config.yaml"
name: "config"
subPath: "config.yaml"
volumes:
- emptyDir: {}
name: "diagnostic-files"
- configMap:
defaultMode: 420
name: "java-app"
name: "config"
$ docker run \
--mount=type=bind,source=$(pwd)/test.yaml,target=/etc/user-action/config.yaml \ (1)
--mount=type=bind,source=$(pwd)/tmp-diag,target=/diag \ (2)
...
1 | Bind mount equivalent to the config volume mount |
2 | Bind mount using local folder ./tmp-diag , but this can be replaced by another docker volume |
Environment variables
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: "java-app"
spec:
template:
spec:
containers:
- name: "java-app"
- name: "JDK_JAVA_OPTIONS"
value: "-Xms3g -Xmx3g -XX:+AlwaysPreTouch"
- name: "SECRET_TOKEN"
valueFrom:
secretKeyRef:
key: "secret-token"
name: "component-token"
- name: "APP_VERSION"
valueFrom:
fieldRef:
fieldPath: "metadata.labels['java.app.image/version']"
- name: "HOST_IP"
valueFrom:
fieldRef:
fieldPath: "status.hostIP"
$ docker run \
--env JDK_JAVA_OPTIONS="-Xms3g -Xmx3g -XX:+AlwaysPreTouch" \
...
This one is straightforward, no surprises here.
Other there are other flags that can be passed to mimic the Kubernetes behavior
-
spec.template.spec.restartPolicy
can be mapped to the same values as--restart
to control the restart policy, but it’s rarely useful to test that.