Kernel Image Lockdown and eBPF Flexibility!

The Kernel Lockdown feature that was merged in Linux 5.4 is designed to prevent both direct and indirect access to a running kernel image, attempting to protect against unauthorized modification of the kernel image and to prevent access to security and cryptographic data located in kernel memory, whilst still permitting driver modules to be loaded.

Introduction

In his post Linux kernel lockdown, integrity, and confidentiality Matthew Garrett first explains that it took around 7 years journey to get the lockdown patches merged! so thank you for getting it upstreamed! then he follows up, why recent kernels need this in UEFI secure boot context: "If you verify your boot chain but allow root to modify that kernel, the benefits of the verified boot chain are significantly reduced. Even if root can't modify the on-disk kernel, root can just hot-patch the kernel and then make this persistent by dropping a binary that repeats the process on system boot."

Lockdown is intended as a mechanism to avoid that, it closes off some interfaces that allow root to modify the kernel image.

If a prohibited or restricted feature is accessed or used, the kernel will emit a message that looks like:

1   Lockdown: X: Y is restricted, see man kernel_lockdown.7

Where X indicates the process name and Y indicates what is restricted.

Lockdown supports two modes: integrity that prevents userland from using interfaces that will modify the kernel, and a second confidentiality mode that is a superset the integrity one to prevents userland from extracting some kernel memory to disclose secrets (such as the EVM signing key) which can be used in offline attacks.

Lockdown in integrity mode will restrict access:

 1• /dev/mem, /dev/kmem, /dev/kcore, /dev/ioports, BPF, kprobes
 2
 3• The use of module parameters that directly specify hardware
 4  parameters to drivers through the kernel command line or when
 5  loading a module.
 6
 7* Machine hibernation
 8
 9* Only signed modules may be loaded or signed binaries can be kexec'd
10
11• The use of direct PCI BAR access.
12
13• The use of the ioperm and iopl instructions on x86.
14
15• The use of the TIOCSSERIAL serial ioctl.
16
17• The alteration of MSR registers on x86.
18
19* The use of different ACPI interfaces and ovverriding ACPI tables.
20
21....

To see if Lockdown is enabled you can read:

1cat /sys/kernel/security/lockdown

Note: be careful when changing the mode, as you will need to reboot the machine to bring it back to none. For changing it, write the corresponding mode to the same file as root.

Balancing Lockdown and Usability

Lockdown was implemented as a Linux Security Module layer, and as of today it is compiled in by default in most kernels.

On the server side and containers workload, Lockdown is not enabled by default. On the desktop side, if it is turned on, then some users disable it as it blocks machine hibernation that conflicts with the original design decisions.

Under the integrity mode lot of features are blocked. If it is turned off due to usability, then having a more fine grained policy should be another reasonable option, it will improve the security of such machines, and protect from other attack vectors.

If kernel Lockdown is off, then there are no conflicts here.

SELinux already does that, so I decided to give it a try by using eBPF, and after a discussion with Daniel Borkmann co-creator of eBPF, this seems to make sense. Thanks Daniel for the feedback!

eBPF to the rescue!

These days we can write dynamic loadable LSM programs using eBPF, it allows to quickly adapt and support all the various workloads.

The first experimental result is kimglock of the bpflock project. It is a standalone bpf program with its userspace part that can be used in different contexts.

The different supported profiles are privileged, baseline or restricted which later can be easily translated to kubernetes pod security standards. By default everything is logged and allowed, but if in baseline profile then only programs or containers that are in the initial pid and network namespaces are allowed (privileged in kubernetes terminology), which applies to cilium, bcc, or programs on the desktop. The restricted profile will block all applications. A more complex filter based on cgroups should be integrated in the future.

kimglock supports the following parameters under privileged or baseline profiles to restrict access to:

  • unsigned_module : block unsigned module loading.
  • unsafe_module_parameters : block module with parameters that directly specify hardware parameters to drivers
  • dev_mem : access to /dev/{mem,kmem,port} is blocked.
  • kexec : kexec of unsigned images is blocked.
  • hibernation : hibernation is blocked.
  • pci_access : block direct PCI BAR access.
  • ioport : raw io port access is blocked.
  • msr : raw msr access is blocked.
  • mmiotrace : tracing memory mapped I/O is blocked.
  • debugfs : debugfs is blocked.
  • xmon_rw : xmon write access is blocked.
  • bpf_write : block bpf write to user RAM.

Also, after the discussion with Daniel, it seems it would be more user friendly to regroup some of these under one option, to easily allow trusted applications/containers like bpf userland. Future versions will probably have this.

Block Kernel Image Modifications

Now let's give it a try, we run bpflock container with exec_snoop enabled and using the allow|privileged default profile mode:

1docker run --name bpflock -it --rm --cgroupns=host \
2    --pid=host --privileged \
3    -e "BPFLOCK_EXEC_SNOOP=all" \
4    -e "BPFLOCK_KIMGLOCK_PROFILE=allow" \
5    -v /sys/kernel/:/sys/kernel/ \
6    -v /sys/fs/bpf:/sys/fs/bpf linuxlock/bpflock

Then we run:

1$ sudo head -c 1 /dev/mem

Logs from bpflock:

1time="2022-02-11T11:17:06Z" level=info msg="event=syscall_execve tgid=70711 pid=70711 ppid=70710 uid=0 cgroupid=7014 comm=head pcomm=sudo filename=/usr/bin/head retval=0" bpfprog=execsnoop subsys=bpf
2
3time="2022-02-11T11:17:06Z" level=info msg="event=lsm_locked_down operation=/dev/mem,kmem,port tgid=70711 pid=70711 ppid=70710 uid=0 cgroupid=7014 comm=head pcomm=sudo retval=0 reason=allow (privileged)" bpfprog=kimglock subsys=bpf

The second log entry displays the operation operation=/dev/mem,kmem,port that was allowed.

Now, let's rerun it under the baseline profile which will allow only init pid and network namespaces by default:

1docker run --name bpflock -it --rm --cgroupns=host \
2    --pid=host --privileged \
3    -e "BPFLOCK_EXEC_SNOOP=all" \
4    -e "BPFLOCK_KIMGLOCK_PROFILE=baseline" \
5    -v /sys/kernel/:/sys/kernel/ \
6    -v /sys/fs/bpf:/sys/fs/bpf linuxlock/bpflock

Then we test /dev/mem access in new namespaces:

1$ sudo unshare -f -p -n bash
2# head -c 1 /dev/mem
3head: cannot open '/dev/mem' for reading: Operation not permitted

Logs from bpflock:

1time="2022-02-11T11:23:58Z" level=info msg="event=syscall_execve tgid=70902 pid=70902 ppid=70895 uid=0 cgroupid=7014 comm=head pcomm=bash filename=/usr/bin/head retval=0" bpfprog=execsnoop subsys=bpf
2
3time="2022-02-11T11:23:58Z" level=info msg="event=lsm_locked_down operation=/dev/mem,kmem,port tgid=70902 pid=70902 ppid=70895 uid=0 cgroupid=7014 comm=head pcomm=bash retval=-1 reason=denied (baseline)" bpfprog=kimglock subsys=bpf

The second log entry displays the operation operation=/dev/mem,kmem,port that was denied from the non init namespaces.

 1{
 2    "subsys": "bpf",
 3    "bpfprog": "kimglock",
 4    "event": "lsm_locked_down",
 5    "operation": "/dev/mem,kmem,port",
 6    "tgid": 70902,
 7    "pid": 70902,
 8    "cgroupid": 7014,
 9    "comm": "head",
10    "retval": -1,
11    "reason": "denied (baseline)"
12}

Running bpflock with kimglock in a restricted profile will deny access for all processes including privileged containers.

Conclusion

  • bpflock which is experimental takes advantage of BPF LSM and Lockdown LSM to restrict access to some kernel features. eBPF (https://ebpf.io/) as many others have said before, is a great technology. In this particulare case, it allows loadable LSM programs that can be pinned or removed at any time.
  • bpflock kimglock does not replace the upstream kernel Lockdown, as the later is more effective, restrictive and covers "trusting the kernel" problem. However, a subset of these restrictions used here are useful in other contexts.
  • With eBPF dynamic behaviour, hibernation can be allowed and all the rest blocked on the dekstop. This can be achieved in systemd user context and connected applications where it is easy to track physically logged in users and active sessions.

Note: this work is still experimental and it may contain bugs.