Linux Capabilities
Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled. Capabilities are a per-thread attribute. See capabilities man-page.
When bpfman
is run as a systemd service, the set of linux capabilities are restricted to only the
required set of capabilities via the bpfman.service
file using the AmbientCapabilities
and
CapabilityBoundingSet
fields (see bpfman.service).
All spawned threads are stripped of all capabilities, removing all sudo privileges
(see drop_linux_capabilities()
usage), leaving only the main thread with only the needed set of capabilities.
Current bpfman Linux Capabilities
Below are the current set of Linux capabilities required by bpfman to operate:
- CAP_BPF:
- Required to load BPF programs and create BPF maps.
- CAP_DAC_READ_SEARCH:
- Required by Tracepoint programs, needed by aya to check the tracefs mount point. For example, trying to read "/sys/kernel/tracing" and "/sys/kernel/debug/tracing".
- CAP_NET_ADMIN:
- Required for TC programs to attach/detach to/from a qdisc.
- CAP_SETPCAP:
- Required to allow bpfman to drop Linux Capabilities on spawned threads.
- CAP_SYS_ADMIN:
- Kprobe (Kprobe and Uprobe) and Tracepoint programs are considered perfmon programs and require CAP_PERFMON and CAP_SYS_ADMIN to load.
- TC and XDP programs are considered admin programs and require CAP_NET_ADMIN and CAP_SYS_ADMIN to load.
- CAP_SYS_RESOURCE:
- Required by bpfman to call
setrlimit()
onRLIMIT_MEMLOCK
.
- Required by bpfman to call
Debugging Linux Capabilities
As new features are added, the set of Linux capabilities required by bpfman may change over time.
The following describes the steps to determine the set of capabilities required by bpfman.
If there are any Permission denied (os error 13)
type errors when starting or running bpfman as a
systemd service, adjusting the linux capabilities is a good place to start.
Determine Required Capabilities
The first step is to turn all capabilities on and see if that fixes the problem.
This can be done without recompiling the code by editing bpfman.service
.
Comment out the finite list of granted capabilities and set to ~
, which indicates all capabilities.
sudo vi /usr/lib/systemd/system/bpfman.service
:
[Service]
:
AmbientCapabilities=~
CapabilityBoundingSet=~
#AmbientCapabilities=CAP_BPF CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_NET_ADMIN CAP_PERFMON CAP_SETPCAP CAP_SYS_ADMIN CAP_SYS_RESOURCE
#CapabilityBoundingSet=CAP_BPF CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_NET_ADMIN CAP_PERFMON CAP_SETPCAP CAP_SYS_ADMIN CAP_SYS_RESOURCE
Reload the service file and start/restart bpfman and watch the bpfman logs and see if the problem is resolved:
If so, then the next step is to watch the set of capabilities being requested by bpfman.
Run the bcc capable
tool to watch capabilities being requested real-time and restart bpfman:
$ sudo /usr/share/bcc/tools/capable
TIME UID PID COMM CAP NAME AUDIT
:
16:36:00 979 75553 tokio-runtime-w 8 CAP_SETPCAP 1
16:36:00 979 75553 tokio-runtime-w 8 CAP_SETPCAP 1
16:36:00 979 75553 tokio-runtime-w 8 CAP_SETPCAP 1
16:36:00 0 616 systemd-journal 19 CAP_SYS_PTRACE 1
16:36:00 0 616 systemd-journal 19 CAP_SYS_PTRACE 1
16:36:00 979 75550 bpfman 24 CAP_SYS_RESOURCE 1
16:36:00 979 75550 bpfman 1 CAP_DAC_OVERRIDE 1
16:36:00 979 75550 bpfman 21 CAP_SYS_ADMIN 1
16:36:00 979 75550 bpfman 21 CAP_SYS_ADMIN 1
16:36:00 0 75555 modprobe 16 CAP_SYS_MODULE 1
16:36:00 0 628 systemd-udevd 2 CAP_DAC_READ_SEARCH 1
16:36:00 0 75556 bpf_preload 24 CAP_SYS_RESOURCE 1
16:36:00 0 75556 bpf_preload 39 CAP_BPF 1
16:36:00 0 75556 bpf_preload 39 CAP_BPF 1
16:36:00 0 75556 bpf_preload 39 CAP_BPF 1
16:36:00 0 75556 bpf_preload 38 CAP_PERFMON 1
16:36:00 0 75556 bpf_preload 38 CAP_PERFMON 1
16:36:00 0 75556 bpf_preload 38 CAP_PERFMON 1
:
Compare the output to list in bpfman.service
and determine the delta.
Determine Capabilities Per Thread
For additional debugging, it may be helpful to know the granted capabilities on a per thread basis. As mentioned above, all spawned threads are stripped of all Linux capabilities, so if a thread is requesting a capability, that functionality should be moved off the spawned thread and onto the main thread.
First, determine the bpfman
process id, then determine the set of threads:
$ ps -ef | grep bpfman
:
bpfman 75550 1 0 16:36 ? 00:00:00 /usr/sbin/bpfman
:
$ ps -T -p 75550
PID SPID TTY TIME CMD
75550 75550 ? 00:00:00 bpfman
75550 75551 ? 00:00:00 tokio-runtime-w
75550 75552 ? 00:00:00 tokio-runtime-w
75550 75553 ? 00:00:00 tokio-runtime-w
75550 75554 ? 00:00:00 tokio-runtime-w
Then dump the capabilities of each thread:
$ grep Cap /proc/75550/status
CapInh: 000000c001201106
CapPrm: 000000c001201106
CapEff: 000000c001201106
CapBnd: 000000c001201106
CapAmb: 000000c001201106
$ grep Cap /proc/75551/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000
$ grep Cap /proc/75552/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000
:
$ capsh --decode=000000c001201106
0x000000c001201106=cap_dac_override,cap_dac_read_search,cap_setpcap,cap_net_admin,cap_sys_admin,cap_sys_resource,cap_perfmon,cap_bpf
Removing CAP_BPF from bpfman Clients
One of the advantages of using bpfman is that it is doing all the loading and unloading of eBPF programs, so it requires CAP_BPF, but clients of bpfman are just making gRPC calls to bpfman, so they do not need to be privileged or require CAP_BPF. It must be noted that this is only true for kernels 5.19 or higher. Prior to kernel 5.19, all eBPF sys calls required CAP_BPF, which are used to access maps shared between the BFP program and the userspace program. In kernel 5.19, a change went in that only requires CAP_BPF for map creation (BPF_MAP_CREATE) and loading programs (BPF_PROG_LOAD). See bpf: refine kernel.unprivileged_bpf_disabled behaviour.