systemd Sandbox or systemd Lightweight Containers

Sandboxing IoT Apps using lightweight containers is an important step for Linux-IoT based devices, it allows to reduce the exposure from mis-configuration, bugs, or vulnerability exploitation.

As a simple example the BrickerBot and similar worms did not use complex 0day exploits. They used simple attack vectors like unprotected remote shells accounts and according to internet resources, lot of IoT devices were affected. The straightforward solution in this case should be a firewall solution plus a powerful sandbox mechanism for apps.

While with every new systemd release we continue to introduce new sandbox mechanisms, by default all these mechanisms are an opt-in operation, in the future we are planning to maybe add another run-time mode to make the sandbox an opt-out operation. Meanwhile, systemd manager now allows you to run your apps from an image like most other container runtimes. However, systemd does not use any standard format, since most of the container run-times that are using this schema are usually over-engineered, and some of them are abusing some Linux kernel features to hide some other misbehavior, etc. In systemd, right now we support Lightweight Containers, by using only file system Mount Namespaces to isolate and ship Apps with their dependencies, we avoid for now the container managers complexity. Network namespaces are used rarely to only block internet access for Apps by disconnecting network interfaces. We may improve network namespace usage, but only to make it easy to integrate within the “ip” tool, that should handle all possible network cases for Embedded Linux setups.

systemd Security Sandbox features

The following lists some new systemd sandbox options:

1. New File system sandbox option:

1    RootImage=

Takes a path to a block device, loopback file, etc that can be mounted as the new root filesystem for your App.

2. Some User privileges sandbox options:

1    DynamicUser=

If set to yes, allows to run your App under different User (Unix UID/GID). The UID is allocated dynamically and released when the App stops, allowing IoT devices to follow Android model where each App is executed under a different user, separating Apps and their file access permissions.

1    NoNewPrivileges=

If set, ensures that the App and all its children can never gain new privileges through execve().

3. Some Network sandbox options:

1    PrivateNetwork=

If set to yes, will set up a new private network namespace with only loopback interface inside, disconnecting internet access.

1    IPAddressDeny=

Takes an IP address prefix, all traffic from and to this address will be blocked for the App.

1    IPAddressAllow=

The whitelist or permitted IP address/network mask list.

To block raw packets AF_PACKET you should also use:

1    RestrictAddressFamilies=~AF_PACKET (blacklisting mode).

We are working to make this more user friendly, maybe in the near feature we will add: “ACCESS_INTERNET=yes|no” alias for those options to effectively block all inet or internet operations, including constructing raw packets and binding privileged ports.

4. Kernel attack surface reduction:

1    RestrictNamespaces=
2    “RestrictNamespaces=yes” or “RestrictNamespaces=~user”

Restrict Access to Linux namespaces. Most IoT devices should reduce access to Linux User Namespaces since some vulnerabilities and exploits are still targeting this feature.

1    ProtectKernelTunables=

Blocks tuning Kernel parameters by making /proc and related /sys files read-only.

1    ProtectKernelModules=

If set, removes the CAP_SYS_MODULE capability and blocks your App from explicitly loading or unloading modules.

1    SystemCallFilter=

Seccomp system call filtering feature. In systemd we have organized Linux system calls in groups inspired from Google Chromium browser. You can restrict your App by functionality by blacklisting the system calls using the “~” before each group. As an example:

1    “@reboot”

will block all related reboot system calls.

1    “@module”

will block all kernel module system calls.

1    “@mount”

will block all file system mount and umount system calls.

For more system call filtering please refer to official systemd documentation systemd.exec [1]. We have a pretty usable system call filtering feature, and we are actively working on improving it.


Future plans for systemd: as the systemd project continue to evolve to handle new use cases, we have to face reality: we need 1) reduce our functionality to better handle some IoT requirements, 2) integrate with software update mechanisms. On a more generic approach we have to support more user friendly features. In the past, systemd was intended to be used by experienced service developers and SysVinit experts, today the user base is more of Container and Android Apps model users. This does not mean that we have to copy those models, but we should start with a new smooth run-time model.