This is the multi-page printable view of this section. Click here to print.
Concepts
- 1: Events
- 2: Tracing Policy
- 2.1: Example
- 2.2: Hook points
- 2.3: Options
- 2.4: Selectors
- 2.5: Tags
- 2.6: Kubernetes Identity Aware Policies
- 3: Runtime Hooks
- 4: Enforcement
- 5: BPF programs statistics
- 6: Event throttling
1 - Events
Tetragon’s events are exposed to the system through either the gRPC endpoint or JSON logs. Commands in this section assume the Getting Started guide was used, but are general other than the namespaces chosen and should work in most environments.
JSON
The first way is to observe the raw json output from the stdout container log:
kubectl logs -n kube-system -l app.kubernetes.io/name=tetragon -c export-stdout -f
The raw JSON events provide Kubernetes API, identity metadata, and OS
level process visibility about the executed binary, its parent and the execution
time. A base Tetragon installation will produce process_exec
and process_exit
events encoded in JSON as shown here,
Process execution event
{
"process_exec": {
"process": {
"exec_id": "Z2tlLWpvaG4tNjMyLWRlZmF1bHQtcG9vbC03MDQxY2FjMC05czk1OjEzNTQ4Njc0MzIxMzczOjUyNjk5",
"pid": 52699,
"uid": 0,
"cwd": "/",
"binary": "/usr/bin/curl",
"arguments": "https://ebpf.io/applications/#tetragon",
"flags": "execve rootcwd",
"start_time": "2023-10-06T22:03:57.700327580Z",
"auid": 4294967295,
"pod": {
"namespace": "default",
"name": "xwing",
"container": {
"id": "containerd://551e161c47d8ff0eb665438a7bcd5b4e3ef5a297282b40a92b7c77d6bd168eb3",
"name": "spaceship",
"image": {
"id": "docker.io/tgraf/netperf@sha256:8e86f744bfea165fd4ce68caa05abc96500f40130b857773186401926af7e9e6",
"name": "docker.io/tgraf/netperf:latest"
},
"start_time": "2023-10-06T21:52:41Z",
"pid": 49
},
"pod_labels": {
"app.kubernetes.io/name": "xwing",
"class": "xwing",
"org": "alliance"
},
"workload": "xwing"
},
"docker": "551e161c47d8ff0eb665438a7bcd5b4",
"parent_exec_id": "Z2tlLWpvaG4tNjMyLWRlZmF1bHQtcG9vbC03MDQxY2FjMC05czk1OjEzNTQ4NjcwODgzMjk5OjUyNjk5",
"tid": 52699
},
"parent": {
"exec_id": "Z2tlLWpvaG4tNjMyLWRlZmF1bHQtcG9vbC03MDQxY2FjMC05czk1OjEzNTQ4NjcwODgzMjk5OjUyNjk5",
"pid": 52699,
"uid": 0,
"cwd": "/",
"binary": "/bin/bash",
"arguments": "-c \"curl https://ebpf.io/applications/#tetragon\"",
"flags": "execve rootcwd clone",
"start_time": "2023-10-06T22:03:57.696889812Z",
"auid": 4294967295,
"pod": {
"namespace": "default",
"name": "xwing",
"container": {
"id": "containerd://551e161c47d8ff0eb665438a7bcd5b4e3ef5a297282b40a92b7c77d6bd168eb3",
"name": "spaceship",
"image": {
"id": "docker.io/tgraf/netperf@sha256:8e86f744bfea165fd4ce68caa05abc96500f40130b857773186401926af7e9e6",
"name": "docker.io/tgraf/netperf:latest"
},
"start_time": "2023-10-06T21:52:41Z",
"pid": 49
},
"pod_labels": {
"app.kubernetes.io/name": "xwing",
"class": "xwing",
"org": "alliance"
},
"workload": "xwing"
},
"docker": "551e161c47d8ff0eb665438a7bcd5b4",
"parent_exec_id": "Z2tlLWpvaG4tNjMyLWRlZmF1bHQtcG9vbC03MDQxY2FjMC05czk1OjEzNTQ4NjQ1MjQ1ODM5OjUyNjg5",
"tid": 52699
}
},
"node_name": "gke-john-632-default-pool-7041cac0-9s95",
"time": "2023-10-06T22:03:57.700326678Z"
}
Will only highlight a few important fields here. For a full specification of events see the Reference
section. All events in Tetragon contain a process_exec
block to identify the process generating the
event. For execution events this is the primary block. For Tracing Policy events the
hook that generated the event will attach further data to this. The process_exec
event provides
a cluster wide unique id the process_exec.exec_id
for this process along with the metadata expected
in a Kubernetes cluster process_exec.process.pod
. The binary and args being executed are part of
the event here process_exec.process.binary
and process_exec.process.args
. Finally, a node_name
and time
provide the location and time for the event and will be present in all event types.
A default deployment writes the JSON log to /var/run/cilium/tetragon/tetragon.log
where it can
be exported through normal log collection tooling, e.g. ‘fluentd’, logstash, etc.. The file will
be rotated and compressed by default. See [Helm Options] for details on how to customize this location.
Export Filtering
Export filters restrict the JSON event output to a subset of desirable events. These export filters are configured as a line-separated list of JSON objects, where each object can contain one or more filter expressions. Filters are combined by taking the logical OR of each line-separated filter object and the logical AND of sibling expressions within a filter object. As a concrete example, suppose we had the following filter configuration:
{"event_set": ["PROCESS_EXEC", "PROCESS_EXIT"], "namespace": "foo"}
{"event_set": ["PROCESS_KPROBE"]}
The above filter configuration would result in a match if:
- The event type is
PROCESS_EXEC
orPROCESS_EXIT
AND the pod namespace is “foo”; OR - The event type is
PROCESS_KPROBE
Tetragon supports two groups of export filters: an allowlist and a denylist. If neither is configured, all events are exported. If only an allowlist is configured, event exports are considered default-deny, meaning only the events in the allowlist are exported. The denylist takes precedence over the allowlist in cases where two filter configurations match on the same event.
You can configure export filters using the provided helm options, command line flags, or environment variables.
List of Process Event Filters
Filter | Description |
---|---|
event_set | Filter process events by event types. Supported types include: PROCESS_EXEC , PROCESS_EXIT , PROCESS_KPROBE , PROCESS_UPROBE , PROCESS_TRACAEPOINT , PROCESS_LOADER |
binary_regex | Filter process events by a list of regular expressions of process binary names (e.g. "^/home/kubernetes/bin/kubelet$" ). You can find the full syntax here. |
health_check | Filter process events if their binary names match Kubernetes liveness / readiness probe commands of their corresponding pods. |
namespace | Filter by Kubernetes pod namespaces. An empty string ("" ) filters processes that do not belong to any pod namespace. |
pid | Filter by process PID. |
pid_set | Like pid but also includes processes that are descendants of the listed PIDs. |
pod_regex | Filter by pod name using a list of regular expressions. You can find the full syntax here. |
arguments_regex | Filter by process arguments using a list of regular expressions. You can find the full syntax here. |
labels | Filter events by pod labels using Kubernetes label selector syntax Note that this filter never matches events without the pod field (i.e. host process events). |
policy_names | Filter events by tracing policy names. |
capabilities | Filter events by Linux process capability. |
parent_binary_regex | Filter process events by a list of regular expressions of parent process binary names (e.g. "^/home/kubernetes/bin/kubelet$" ). You can find the full syntax here. |
parent_arguments_regex | Filter by parent process arguments using a list of regular expressions. You can find the full syntax here. |
Field Filtering
In some cases, it is not desirable to include all of the fields exported in Tetragon events by default. In these cases, you can use field filters to restrict the set of exported fields for a given event type. Field filters are configured similarly to export filters, as line-separated lists of JSON objects.
Field filters select fields using the protobuf field mask syntax
under the "fields"
key. You can define a path of fields using field
names separated by period (.
) characters. To define multiple paths in
a single field filter, separate them with comma (,
) characters. For
example, "fields":"process.binary,parent.binary,pod.name"
would select
only the process.binary
, parent.binary
, and pod.name
fields.
By default, a field filter applies to all process events, although you
can control this behaviour with the "event_set"
key. For example, you
can apply a field filter to PROCESS_CONNECT
and PROCESS_CLOSE
events
by specifying "event_set":["PROCESS_CONNECT","PROCESS_CLOSE"]
in the
filter definition.
Each field filter has an "action"
that determines what the filter
should do with the selected field. The supported action types are
"INCLUDE"
and "EXCLUDE"
. A value of "INCLUDE"
will cause the field
to appear in an event, while a value of "EXCLUDE"
will hide the field.
In the absence of any field filter for a given event type, the export
will include all fields by default. Defining one or more "INCLUDE"
filters for a given event type changes that behaviour to exclude all
other event types by default.
As a simple example of the above, consider the case where we want to include
only exec_id
and parent_exec_id
in all event types except for
PROCESS_EXEC
:
{"fields":"process.exec_id,process.parent_exec_id", "event_set": ["PROCESS_EXEC"], "invert_event_set": true, "action": "INCLUDE"}
Redacting Sensitive Information
Since Tetragon traces the entire system, event exports might sometimes contain
sensitive information (for example, a secret passed via a command line argument
to a process). To prevent this information from being exfiltrated via Tetragon
JSON export, Tetragon provides a mechanism called Redaction Filters which can be
used to string patterns to redact from exported process arguments. These filters are written
in JSON and passed to the Tetragon agent via the --redaction-filters
command
line flag or the redactionFilters
Helm value.
To perform redactions, redaction filters define RE2 regular expressions in the
redact
field. Any capture groups in these RE2 regular expressions are redacted and
replaced with "*****"
.
\Wpasswd\W?
would be written as {"redact": "\\Wpasswd\\W?"}
.For more control, you can select which binary or binaries should have their
arguments redacted with the binary_regex
field.
As a concrete example, the following will redact all passwords passed to
processes with the "--password"
argument:
{"redact": ["--password(?:\\s+|=)(\\S*)"]}
Now, an event that contains the string "--password=foo"
would have that string
replaced with "--password=*****"
.
Suppose we also see some passwords passed via the -p shorthand for a specific binary, foo. We can also redact these as follows:
{"binary_regex": ["(?:^|/)foo$"], "redact": ["-p(?:\\s+|=)(\\S*)"]}
With both of the above redaction filters in place, we are now redacting all password arguments.
tetra
CLI
A second way is to use the tetra
CLI. This
has the advantage that it can also be used to filter and pretty print the output. The tool
allows filtering by process, pod, and other fields. To install tetra see the
Tetra Installation Guide
To start printing events run:
kubectl logs -n kube-system -l app.kubernetes.io/name=tetragon -c export-stdout -f | tetra getevents -o compact
The tetra
CLI is also available inside tetragon
container.
kubectl exec -it -n kube-system ds/tetragon -c tetragon -- tetra getevents -o compact
This was used in the quick start and generates a pretty printing of the events, To further filter by a specific binary and/or pod do the following,
kubectl logs -n kube-system -l app.kubernetes.io/name=tetragon -c export-stdout -f | tetra getevents -o compact --processes curl --pod xwing
Will filter and report just the relevant events.
🚀 process default/xwing /usr/bin/curl https://ebpf.io/applications/#tetragon
💥 exit default/xwing /usr/bin/curl https://ebpf.io/applications/#tetragon 60
gRPC
In addition Tetragon can expose a gRPC endpoint listeners may attach to. The
gRPC is exposed by default helm install on localhost:54321
, but the address
can be configured with the --server-address
option. This can be
set from helm with the tetragon.grpc.address
flag or disabled completely if
needed with tetragon.grpc.enabled
.
helm install tetragon cilium/tetragon -n kube-system --set tetragon.grpc.enabled=true --set tetragon.grpc.address=localhost:54321
An example gRPC endpoint is the Tetra CLI when its not piped JSON output directly,
kubectl exec -ti -n kube-system ds/tetragon -c tetragon -- tetra getevents -o compact
2 - Tracing Policy
Tetragon’s TracingPolicy
is a user-configurable Kubernetes custom resource (CR) that
allows users to trace arbitrary events in the kernel and optionally define
actions to take on a match. Policies consist of a hook point (kprobes,
tracepoints, and uprobes are supported), and selectors for in-kernel filtering
and specifying actions. For more details, see
hook points page and the
selectors page.
TracingPolicy
allows for powerful, yet low-level configuration and, as such,
requires knowledge about the Linux kernel and containers to avoid unexpected
issues such as TOCTOU bugs.For the complete custom resource definition (CRD) refer to the YAML file
cilium.io_tracingpolicies.yaml
.
One practical way to explore the CRD is to use kubectl explain
against a
Kubernetes API server on which it is installed, for example kubectl explain tracingpolicy.spec.kprobes
provides field-specific documentation and details
on kprobe spec.
Tracing Policies can be loaded and unloaded at runtime in Tetragon, or on startup using flags.
- With Kubernetes, you can use
kubectl
to add and remove aTracingPolicy
. - You can use
tetra
gRPC CLI to add and remove aTracingPolicy
. - You can use the
--tracing-policy
and--tracing-policy-dir
flags to statically add policies at startup time, see more in the daemon configuration page.
Hence, even though Tracing Policies are structured as a Kubernetes CR, they can also be used in non-Kubernetes environments using the last two loading methods.
2.1 - Example
To discover TracingPolicy
, let’s understand via an example that will be
explained, part by part, in this document:
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "fd-install"
spec:
kprobes:
- call: "fd_install"
syscall: false
args:
- index: 0
type: "int"
- index: 1
type: "file"
selectors:
- matchArgs:
- index: 1
operator: "Equal"
values:
- "/tmp/tetragon"
matchActions:
- action: Sigkill
The policy checks for file descriptors being created, and sends a SIGKILL
signal to any process that
creates a file descriptor to a file named /tmp/tetragon
. We discuss the policy in more detail
next.
Required fields
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "fd-install"
The first part follows a common pattern among all Cilium Policies or more widely Kubernetes object. It first declares the Kubernetes API used, then the kind of Kubernetes object it is in this API and an arbitrary name for the object that has to comply with Kubernetes naming convention.
Hook point
spec:
kprobes:
- call: "fd_install"
syscall: false
args:
- index: 0
type: "int"
- index: 1
type: "file"
The beginning of the specification describes the hook point to use. Here we are
using a kprobe, hooking on the kernel function fd_install
. That’s the kernel
function that gets called when a new file descriptor is created. We
indicate that it’s not a syscall, but a regular kernel function. We then
specify the function arguments, so that Tetragon’s BPF code will extract
and optionally perform filtering on them.
See the hook points page for further information on the various hook points available and arguments.
Selectors
selectors:
- matchArgs:
- index: 1
operator: "Equal"
values:
- "/tmp/tetragon"
matchActions:
- action: Sigkill
Selectors allow you to filter on the events to extract only a subset of the events based on different properties and optionally take an action.
In the example, we filter on the argument at index 1, passing a file
struct
to the function. Tetragon has the knowledge on how to apply the Equal
operator over a Linux kernel file
struct and match on the
path of the file.
Then we add the Sigkill
action, meaning, that any match of the selector
should send a SIGKILL signal to the process that initiated the event.
Learn more about the various selectors in the dedicated selectors page.
Message
The message
field is an optional short message that will be included in
the generated event to inform users what is happening.
spec:
kprobes:
- call: "fd_install"
message: "Installing a file descriptor"
Tags
Tags are optional fields of a Tracing Policy that are used to categorize generated events. Further reference here: Tags documentation.
Policy effect
First, let’s create the /tmp/tetragon
file with some content:
echo eBPF! > /tmp/tetragon
You can save the policy in an example.yaml
file, compile Tetragon locally, and start Tetragon:
sudo ./tetragon --bpf-lib bpf/objs --tracing-policy example.yaml
(See Quick Kubernetes Install and Quick Local Docker Install for other ways to start Tetragon.)
Once the Tetragon starts, you can monitor events using tetra
, the tetragon CLI:
./tetra tetra getevents -o compact
Reading the /tmp/tetragon
file with cat
:
cat /tmp/tetragon
Results in the following events:
🚀 process /usr/bin/cat /tmp/tetragon
📬 open /usr/bin/cat /tmp/tetragon
💥 exit /usr/bin/cat /tmp/tetragon SIGKILL
And the shell where the cat
command was performed will return:
Killed
See more
For more examples of tracing policies, take a look at the examples/tracingpolicy folder in the Tetragon repository. Also read the following sections on hook points and selectors.
2.2 - Hook points
Tetragon can hook into the kernel using kprobes
and tracepoints
, as well as in user-space
programs using uprobes
. Users can configure these hook points using the correspodning sections of
the TracingPolicy
specification (.spec
). These hook points include arguments and return values
that can be specified using the args
and returnArg
fields as detailed in the following sections.
Kprobes
Kprobes enables you to dynamically hook into any kernel function and execute BPF code. Because kernel functions might change across versions, kprobes are highly tied to your kernel version and, thus, might not be portable across different kernels.
Conveniently, you can list all kernel symbols reading the /proc/kallsyms
file. For example to search for the write
syscall kernel function, you can
execute sudo grep sys_write /proc/kallsyms
, the output should be similar to
this, minus the architecture specific prefixes.
ffffdeb14ea712e0 T __arm64_sys_writev
ffffdeb14ea73010 T ksys_write
ffffdeb14ea73140 T __arm64_sys_write
ffffdeb14eb5a460 t proc_sys_write
ffffdeb15092a700 d _eil_addr___arm64_sys_writev
ffffdeb15092a740 d _eil_addr___arm64_sys_write
You can see that the exact name of the symbol for the write syscall on our
kernel version is __arm64_sys_write
. Note that on x86_64
, the prefix would
be __x64_
instead of __arm64_
.
Kernel symbols contain an architecture specific prefix when they refer to syscall symbols. To write portable tracing policies, i.e. policies that can run on multiple architectures, just use the symbol name without the prefix.
For example, instead of writing call: "__arm64_sys_write"
or call: "__x64_sys_write"
, just write call: "sys_write"
, Tetragon will adapt and add
the correct prefix based on the architecture of the underlying machine. Note
that the event generated as output currently includes the prefix.
In our example, we will explore a kprobe
hooking into the
fd_install
kernel function. The fd_install
kernel function is called each time a file
descriptor is installed into the file descriptor table of a process, typically
referenced within system calls like open
or openat
. Hooking fd_install
has its benefits and limitations, which are out of the scope of this guide.
spec:
kprobes:
- call: "fd_install"
syscall: false
syscall
field, specific to a kprobe
spec, with default value
false
, that indicates whether Tetragon will hook a syscall or just a regular
kernel function. Tetragon needs this information because syscall and kernel
function use a different ABI.Kprobes calls can be defined independently in different policies, or together in the same Policy. For example, we can define trace multiple kprobes under the same tracing policy:
spec:
kprobes:
- call: "sys_read"
syscall: true
# [...]
- call: "sys_write"
syscall: true
# [...]
Tracepoints
Tracepoints are statically defined in the kernel and have the advantage of being stable across kernel versions and thus more portable than kprobes.
To see the list of tracepoints available on your kernel, you can list them
using sudo ls /sys/kernel/debug/tracing/events
, the output should be similar
to this.
alarmtimer ext4 iommu page_pool sock
avc fib ipi pagemap spi
block fib6 irq percpu swiotlb
bpf_test_run filelock jbd2 power sync_trace
bpf_trace filemap kmem printk syscalls
bridge fs_dax kvm pwm task
btrfs ftrace libata qdisc tcp
cfg80211 gpio lock ras tegra_apb_dma
cgroup hda mctp raw_syscalls thermal
clk hda_controller mdio rcu thermal_power_allocator
cma hda_intel migrate regmap thermal_pressure
compaction header_event mmap regulator thp
cpuhp header_page mmap_lock rpm timer
cros_ec huge_memory mmc rpmh tlb
dev hwmon module rseq tls
devfreq i2c mptcp rtc udp
devlink i2c_slave napi sched vmscan
dma_fence initcall neigh scmi wbt
drm interconnect net scsi workqueue
emulation io_uring netlink signal writeback
enable iocost oom skb xdp
error_report iomap page_isolation smbus xhci-hcd
You can then choose the subsystem that you want to trace, and look the
tracepoint you want to use and its format. For example, if we choose the
netif_receive_skb
tracepoints from the net
subsystem, we can read its
format with sudo cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/format
,
the output should be similar to the following.
name: netif_receive_skb
ID: 1398
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:void * skbaddr; offset:8; size:8; signed:0;
field:unsigned int len; offset:16; size:4; signed:0;
field:__data_loc char[] name; offset:20; size:4; signed:0;
print fmt: "dev=%s skbaddr=%px len=%u", __get_str(name), REC->skbaddr, REC->len
Similarly to kprobes, tracepoints can also hook into system calls. For more
details, see the raw_syscalls
and syscalls
subysystems.
An example of tracepoints TracingPolicy
could be the following, observing all
syscalls and getting the syscall ID from the argument at index 4:
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "raw-syscalls"
spec:
tracepoints:
- subsystem: "raw_syscalls"
event: "sys_enter"
args:
- index: 4
type: "int64"
Uprobes
Uprobes are similar to kprobes, but they allow you to dynamically hook into any user-space function and execute BPF code. Uprobes are also tied to the binary version of the user-space program, so they may not be portable across different versions or architectures.
To use uprobes, you need to specify the path to the executable or library file,
and the symbol of the function you want to probe. You can use tools like
objdump
, nm
, or readelf
to find the symbol of a function in a binary
file. For example, to find the readline symbol in /bin/bash
using nm
, you
can run:
nm -D /bin/bash | grep readline
The output should look similar to this, with a few lines redacted:
[...]
000000000009f2b0 T pcomp_set_readline_variables
0000000000097e40 T posix_readline_initialize
00000000000d5690 T readline
00000000000d52f0 T readline_internal_char
00000000000d42d0 T readline_internal_setup
[...]
You can see in the nm
output: first the symbol value, then the symbol type,
for the readline
symbol T
meaning that this symbol is in the text (code)
section of the binary, and finally the symbol name. This confirms that the
readline
symbol is present in the /bin/bash
binary and might be a function
name that we can hook with a uprobe.
You can define multiple uprobes in the same policy, or in different policies. You can also combine uprobes with kprobes and tracepoints to get a comprehensive view of the system behavior.
Here is an example of a policy that defines an uprobe for the readline function in the bash executable, and applies it to all processes that use the bash binary:
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "example-uprobe"
spec:
uprobes:
- path: "/bin/bash"
symbols:
- "readline"
This example shows how to use uprobes to hook into the readline function running in all the bash shells.
LSM BPF
LSM BPF programs allow runtime instrumentation of the LSM hooks by privileged users to implement system-wide MAC (Mandatory Access Control) and Audit policies using eBPF.
List of LSM hooks which can be instrumented can be found in security/security.c
.
To verify if BPF LSM is available use the following command:
cat /boot/config-$(uname -r) | grep BPF_LSM
The output should be similar to this if BPF LSM is supported:
CONFIG_BPF_LSM=y
Then, if provided above conditions are met, use this command to check if BPF LSM is enabled:
cat /sys/kernel/security/lsm
The output might look like this:
bpf,lockdown,integrity,apparmor
If the output includes the bpf
, than BPF LSM is enabled. Otherwise, you can modify /etc/default/grub
:
GRUB_CMDLINE_LINUX="lsm=lockdown,integrity,apparmor,bpf"
Then, update the grub configuration and restart the system.
The provided example of LSM BPF TracingPolicy
monitors access to files
/etc/passwd
and /etc/shadow
with /usr/bin/cat
executable.
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "lsm-file-open"
spec:
lsmhooks:
- hook: "file_open"
args:
- index: 0
type: "file"
selectors:
- matchBinaries:
- operator: "In"
values:
- "/usr/bin/cat"
matchArgs:
- index: 0
operator: "Equal"
values:
- "/etc/passwd"
- "/etc/shadow"
Arguments
Kprobes, uprobes and tracepoints all share a needed arguments fields called args
. It is a list of
arguments to include in the trace output. Tetragon’s BPF code requires
information about the types of arguments to properly read, print and
filter on its arguments. This information needs to be provided by the user under the
args
section. For the available
types,
check the TracingPolicy
CRD.
Following our example, here is the part that defines the arguments:
args:
- index: 0
type: "int"
- index: 1
type: "file"
Each argument can optionally include a ’label’ parameter, which will be included in the output. This can be used to annotate the arguments to help with understanding and processing the output. As an example, here is the same definition, with an appropriate label on the int argument:
args:
- index: 0
type: "int"
label: "FD"
- index: 1
type: "file"
To properly read and hook onto the fd_install(unsigned int fd, struct file *file)
function, the YAML snippet above tells the BPF code that the first
argument is an int
and the second argument is a file
, which is the
struct file
of the kernel. In this way, the BPF code and its printer can properly collect
and print the arguments.
These types are sorted by the index
field, where you can specify the order.
The indexing starts with 0. So, index: 0
means, this is going to be the first
argument of the function, index: 1
means this is going to be the second
argument of the function, etc.
Note that for some args types, char_buf
and char_iovec
, there are
additional fields named returnCopy
and sizeArgIndex
available:
returnCopy
indicates that the corresponding argument should be read later (when the kretprobe for the symbol is triggered) because it might not be populated when the kprobe is triggered at the entrance of the function. For example, a buffer supplied toread(2)
won’t have content until kretprobe is triggered.sizeArgIndex
indicates the (1-based, see warning below) index of the arguments that represents the size of thechar_buf
oriovec
. For example, forwrite(2)
, the third argument,size_t count
is the number ofchar
element that we can read from theconst void *buf
pointer from the second argument. Similarly, if we would like to capture the__x64_sys_writev(long, iovec *, vlen)
syscall, theniovec
has a size ofvlen
, which is going to be the third argument.
sizeArgIndex
is inconsistent at the moment and does not take the index, but
the number of the index (or index + 1). So if the size is the third argument,
index 2, the value should be 3.These flags can be combined, see the example below.
- call: "sys_write"
syscall: true
args:
- index: 0
type: "int"
- index: 1
type: "char_buf"
returnCopy: true
sizeArgIndex: 3
- index: 2
type: "size_t"
Note that you can specify which arguments you would like to print from a
specific syscall. For example if you don’t care about the file descriptor,
which is the first int
argument with index: 0
and just want the char_buf
,
what is written, then you can leave this section out and just define:
args:
- index: 1
type: "char_buf"
returnCopy: true
sizeArgIndex: 3
- index: 2
type: "size_t"
This tells the printer to skip printing the int
arg because it’s not useful.
For char_buf
type up to the 4096 bytes are stored. Data with bigger size are
cut and returned as truncated bytes.
You can specify maxData
flag for char_buf
type to read maximum possible data
(currently 327360 bytes), like:
args:
- index: 1
type: "char_buf"
maxData: true
sizeArgIndex: 3
- index: 2
type: "size_t"
This field is only used for char_buff
data. When this value is false (default),
the bpf program will fetch at most 4096 bytes. In later kernels (>=5.4) tetragon
supports fetching up to 327360 bytes if this flag is turned on.
The maxData
flag does not work with returnCopy
flag at the moment, so it’s
usable only for syscalls/functions that do not require return probe to read the
data.
Return values
A TracingPolicy
spec can specify that the return value should be reported in
the tracing output. To do this, the return
parameter of the call needs to be
set to true
, and the returnArg
parameter needs to be set to specify the
type
of the return argument. For example:
- call: "sk_alloc"
syscall: false
return: true
args:
- index: 1
type: int
label: "family"
returnArg:
type: sock
In this case, the sk_alloc
hook is specified to return a value of type sock
(a pointer to a struct sock
). Whenever the sk_alloc
hook is hit, not only
will it report the family
parameter in index 1, it will also report the socket
that was created.
Return values for socket tracking
A unique feature of a sock
being returned from a hook such as sk_alloc
is that
the socket it refers to can be tracked. Most networking hooks in the network stack
are run in a context that is not that of the process that owns the socket for which
the actions relate; this is because networking happens asynchronously and not
entirely in-line with the process. The sk_alloc
hook does, however, occur in the
context of the process, such that the task, the PID, and the TGID are of the process
that requested that the socket was created.
Specifying socket tracking tells Tetragon to store a mapping between the socket
and the process’ PID and TGID; and to use that mapping when it sees the socket in a
sock
argument in another hook to replace the PID and TGID of the context with the
process that actually owns the socket. This can be done by adding a returnArgAction
to the call. Available actions are TrackSock
and UntrackSock
.
See TrackSock
and UntrackSock
.
- call: "sk_alloc"
syscall: false
return: true
args:
- index: 1
type: int
label: "family"
returnArg:
type: sock
returnArgAction: TrackSock
Socket tracking is only available on kernels >=5.3.
Lists
It’s possible to define list of functions and use it in the kprobe’s call
field.
Following example traces all sys_dup[23]
syscalls.
spec:
lists:
- name: "dups"
type: "syscalls"
values:
- "sys_dup"
- "sys_dup2"
- "sys_dup3"
kprobes:
- call: "list:dups"
It is basically a shortcut for following policy:
spec:
kprobes:
- call: "sys_dup"
syscall: true
- call: "sys_dup2"
syscall: true
- call: "sys_dup3"
syscall: true
As shown in subsequent examples, its main benefit is allowing a single definition for calls that have the same filters.
The list is defined under lists
field with arbitrary values for name
and values
fields.
spec:
lists:
- name: "dups"
type: "syscalls"
values:
- "sys_dup"
- "sys_dup2"
- "sys_dup3"
...
It’s possible to define multiple lists.
spec:
lists:
- name: "dups"
type: "syscalls"
values:
- "sys_dup"
- "sys_dup2"
- "sys_dup3"
name: "another"
- "sys_open"
- "sys_close"
Syscalls specified with sys_
prefix are translated to their 64 bit equivalent function names.
It’s possible to specify a syscall for an alternative ABI by using the ABI name as a prefix. For example:
spec:
lists:
- name: "dups"
type: "syscalls"
values:
- "sys_dup"
- "i386/sys_dup"
name: "another"
- "sys_open"
- "sys_close"
Specific list can be referenced in kprobe’s call
field with "list:NAME"
value.
spec:
lists:
- name: "dups"
...
kprobes:
- call: "list:dups"
The kprobe definition creates a kprobe for each item in the list and shares the rest of the config specified for kprobe.
List can also specify type
field that implies extra checks on the values (like for syscall
type)
or denote that the list is generated automatically (see below).
User must specify syscall
type for list with syscall functions. Also syscall
functions
can’t be mixed with regular functions in the list.
The additional selector configuration is shared with all functions in the list. In following example we create 3 kprobes that share the same pid filter.
spec:
lists:
- name: "dups"
type: "syscalls"
values:
- "sys_dup"
- "sys_dup2"
- "sys_dup3"
kprobes:
- call: "list:dups"
selectors:
- matchPIDs:
- operator: In
followForks: true
isNamespacePID: false
values:
- 12345
It’s possible to use argument filter together with the list
.
It’s important to understand that the argument will be retrieved by using the specified argument type for all the functions in the list.
Following example adds argument filter for first argument on all functions in dups list to match value 9999.
spec:
lists:
- name: "dups"
type: "syscalls"
values:
- "sys_dup"
- "sys_dup2"
- "sys_dup3"
kprobes:
- call: "list:dups"
args:
- index: 0
type: int
selectors:
- matchArgs:
- index: 0
operator: "Equal"
values:
- 9999
There are two additional special types of generated lists.
The generated_syscalls
type of list that generates list with all possible
syscalls on the system.
Following example traces all syscalls for /usr/bin/kill
binary.
spec:
lists:
- name: "all-syscalls"
type: "generated_syscalls"
kprobes:
- call: "list:all-syscalls"
selectors:
- matchBinaries:
- operator: "In"
values:
- "/usr/bin/kill"
The generated_ftrace
type of list that generates functions from ftrace available_filter_functions
file with specified filter. The filter is specified with pattern
field and expects regular expression.
Following example traces all kernel ksys_*
functions for /usr/bin/kill
binary.
spec:
lists:
- name: "ksys"
type: "generated_ftrace"
pattern: "^ksys_*"
kprobes:
- call: "list:ksys"
selectors:
- matchBinaries:
- operator: "In"
values:
- "/usr/bin/kill"
Note that if syscall list is used in selector with InMap operator, the argument type needs to be syscall64
, like.
spec:
lists:
- name: "dups"
type: "syscalls"
values:
- "sys_dup"
- "i386/sys_dup"
tracepoints:
- subsystem: "raw_syscalls"
event: "sys_enter"
args:
- index: 4
type: "syscall64"
selectors:
- matchArgs:
- index: 0
operator: "InMap"
values:
- "list:dups"
2.3 - Options
It’s possible to pass options through spec file as an array of name and value pairs:
spec:
options:
- name: "option-1"
value: "True"
- name: "option-2"
value: "10"
Options array is passed and processed by each hook used in the spec file that supports options. At the moment it’s availabe for kprobe and uprobe hooks.
Kprobe Options
: options for kprobe hooks.
Kprobe options
disable-kprobe-multi
: disable kprobe multi link
disable-kprobe-multi
This option disables kprobe multi link interface for all the kprobes defined in the spec file. If enabled, all the defined kprobes will be atached through standard kprobe interface. It stays enabled for another spec file without this option.
It takes boolean as value, by default it’s false.
Example:
options:
- name: "disable-kprobe-multi"
value: "1"
Uprobe Options
: options for uprobe hooks.
Uprobe options
disable-uprobe-multi
: disable uprobe multi link
disable-uprobe-multi
This option disables uprobe multi link interface for all the uprobes defined in the spec file. If enabled, all the defined uprobes will be atached through standard uprobe interface. It stays enabled for another spec file without this option.
It takes boolean as value, by default it’s false.
Example:
options:
- name: "disable-uprobe-multi"
value: "1"
2.4 - Selectors
Selectors are a way to perform in-kernel BPF filtering on the events to export, or on the events on which to apply an action.
A TracingPolicy
can contain from 0 to 5 selectors. A selector is composed of
1 or more filters. The available filters are the following:
matchArgs
: filter on the value of arguments.matchReturnArgs
: filter on the return value.matchPIDs
: filter on PID.matchBinaries
: filter on binary path.matchNamespaces
: filter on Linux namespaces.matchCapabilities
: filter on Linux capabilities.matchNamespaceChanges
: filter on Linux namespaces changes.matchCapabilityChanges
: filter on Linux capabilities changes.matchActions
: apply an action on selector matching.matchReturnActions
: apply an action on return selector matching.
Arguments filter
Arguments filters can be specified under the matchArgs
field and provide
filtering based on the value of the function’s argument.
In the next example, a selector is defined with a matchArgs
filter that tells
the BPF code to process only the function call for which the second argument,
index equal to 1, concerns the file under the path /etc/passwd
or
/etc/shadow
. It’s using the operator Equal
to match against the value of
the argument.
Note that conveniently, we can match against a path directly when the argument
is of type file
.
selectors:
- matchArgs:
- index: 1
operator: "Equal"
values:
- "/etc/passwd"
- "/etc/shadow"
The available operators for matchArgs
are:
Equal
NotEqual
Prefix
Postfix
Mask
Further examples
In the previous example, we used the operator Equal
, but we can also use the
Prefix
operator and match against all files under /etc
with:
selectors:
- matchArgs:
- index: 1
operator: "Prefix"
values:
- "/etc"
In this situation, an event will be created every time a process tries to
access a file under /etc
.
Although it makes less sense, you can also match over the first argument, to only detect events that will use the file descriptor 4, which is usually the first that come afters stdin, stdout and stderr in process. And combine that with the previous example.
- matchArgs:
- index: 0
operator: "Equal"
values:
- "3"
- index: 1
operator: "Prefix"
values:
- "/etc"
Return args filter
Arguments filters can be specified under the returnMatchArgs
field and
provide filtering based on the value of the function return value. It allows
you to filter on the return value, thus success, error or value returned by a
kernel call.
matchReturnArgs:
- operator: "NotEqual"
values:
- 0
The available operators for matchReturnArgs
are:
Equal
NotEqual
Prefix
Postfix
A use case for this would be to detect the failed access to certain files, like
/etc/shadow
. Doing cat /etc/shadow
will use a openat
syscall that will
returns -1
for a failed attempt with an unprivileged user.
PIDs filter
PIDs filters can be specified under the matchPIDs
field and provide filtering
based on the value of host pid of the process. For example, the following
matchPIDs
filter tells the BPF code that observe only hooks for which the
host PID is equal to either pid1
or pid2
or pid3
:
- matchPIDs:
- operator: "In"
followForks: true
values:
- "pid1"
- "pid2"
- "pid3"
The available operators for matchPIDs
are:
In
NotIn
Further examples
Another example can be to collect all processes not associated with a
container’s init PID, which is equal to 1. In this way, we are able to detect
if there was a kubectl exec
performed inside a container because processes
created by kubectl exec
are not children of PID 1.
- matchPIDs:
- operator: NotIn
followForks: false
isNamespacePID: true
values:
- 1
Binaries filter
Binary filters can be specified under the matchBinaries
field and provide
filtering based on the value of a certain binary name. For example, the
following matchBinaries
selector tells the BPF code to process only system
calls and kernel functions that are coming from cat
or tail
.
- matchBinaries:
- operator: "In"
values:
- "/usr/bin/cat"
- "/usr/bin/tail"
The available operators for matchBinaries
are:
In
NotIn
Prefix
NotPrefix
Postfix
NotPostfix
The values
field has to be a map of strings
. The default behaviour
is followForks: true
, so all the child processes are followed.
Follow children
the matchBinaries
filter can be configured to also apply to children of matching processes. To do
this, set followChildren
to true
. For example:
- matchBinaries:
- operator: "In"
values:
- "/usr/sbin/sshd"
followChildren: true
There are a number of limitations when using followChildren:
- Children created before the policy was installed will not be matched
- The number of
matchBinaries
sections withfollowChildren: true
cannot exceed 64. - Operators other than
In
are not supported.
Further examples
One example can be to monitor all the sys_write
system calls which are
coming from the /usr/sbin/sshd
binary and its child processes and writing to
stdin/stdout/stderr
.
This is how we can monitor what was written to the console by different users
during different ssh sessions. The matchBinaries
selector in this case is the
following:
- matchBinaries:
- operator: "In"
values:
- "/usr/sbin/sshd"
while the whole kprobe
call is the following:
- call: "sys_write"
syscall: true
args:
- index: 0
type: "int"
- index: 1
type: "char_buf"
sizeArgIndex: 3
- index: 2
type: "size_t"
selectors:
# match to /sbin/sshd
- matchBinaries:
- operator: "In"
values:
- "/usr/sbin/sshd"
# match to stdin/stdout/stderr
matchArgs:
- index: 0
operator: "Equal"
values:
- "1"
- "2"
- "3"
Namespaces filter
Namespaces filters can be specified under the matchNamespaces
field and
provide filtering of calls based on Linux namespace. You can specify the
namespace inode or use the special host_ns
keyword, see the example and
description for more information.
An example syntax is:
- matchNamespaces:
- namespace: Pid
operator: In
values:
- "4026531836"
- "4026531835"
This will match if: [Pid
namespace is 4026531836
] OR
[Pid
namespace is
4026531835
]
namespace
can be:Uts
,Ipc
,Mnt
,Pid
,PidForChildren
,Net
,Cgroup
, orUser
.Time
andTimeForChildren
are also available in Linux >= 5.6.operator
can beIn
orNotIn
values
can be raw numeric values (i.e. obtained fromlsns
) or"host_ns"
which will automatically be translated to the appropriate value.
Limitations
- We can have up to 4
values
. These can be both numeric andhost_ns
inside a singlenamespace
. - We can have up to 4
namespace
values undermatchNamespaces
in Linux kernel < 5.3. In Linux >= 5.3 we can have up to 10 values (i.e. the maximum number of namespaces that modern kernels provide).
Further examples
We can have multiple namespace filters:
selectors:
- matchNamespaces:
- namespace: Pid
operator: In
values:
- "4026531836"
- "4026531835"
- namespace: Mnt
operator: In
values:
- "4026531833"
- "4026531834"
This will match if: ([Pid
namespace is 4026531836
] OR
[Pid
namespace is
4026531835
]) AND
([Mnt
namespace is 4026531833
] OR
[Mnt
namespace
is 4026531834
])
Use cases examples
Generate a kprobe event if
/etc/shadow
was opened by/bin/cat
which either had hostNet
orMnt
namespace access
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "example_ns_1"
spec:
kprobes:
- call: "fd_install"
syscall: false
args:
- index: 0
type: int
- index: 1
type: "file"
selectors:
- matchBinaries:
- operator: "In"
values:
- "/bin/cat"
matchArgs:
- index: 1
operator: "Equal"
values:
- "/etc/shadow"
matchNamespaces:
- namespace: Mnt
operator: In
values:
- "host_ns"
- matchBinaries:
- operator: "In"
values:
- "/bin/cat"
matchArgs:
- index: 1
operator: "Equal"
values:
- "/etc/shadow"
matchNamespaces:
- namespace: Net
operator: In
values:
- "host_ns"
This example has 2 selectors
. Note that each selector starts with -
.
Selector 1:
- matchBinaries:
- operator: "In"
values:
- "/bin/cat"
matchArgs:
- index: 1
operator: "Equal"
values:
- "/etc/shadow"
matchNamespaces:
- namespace: Mnt
operator: In
values:
- "host_ns"
Selector 2:
- matchBinaries:
- operator: "In"
values:
- "/bin/cat"
matchArgs:
- index: 1
operator: "Equal"
values:
- "/etc/shadow"
matchNamespaces:
- namespace: Net
operator: In
values:
- "host_ns"
We have [
Selector1 OR
Selector2]
. Inside each selector we have filters
.
Both selectors have 3 filters (i.e. matchBinaries
, matchArgs
, and
matchNamespaces
) with different arguments. Adding a -
in the beginning of a
filter will result in a new selector.
So the previous CRD will match if:
[
binary == /bin/cat AND
arg1 == /etc/shadow AND
MntNs == host]
OR
[
binary == /bin/cat AND
arg1 == /etc/shadow AND
NetNs is host]
We can modify the previous example as follows:
Generate a kprobe event if
/etc/shadow
was opened by/bin/cat
which has hostNet
andMnt
namespace access
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "example_ns_2"
spec:
kprobes:
- call: "fd_install"
syscall: false
args:
- index: 0
type: int
- index: 1
type: "file"
selectors:
- matchBinaries:
- operator: "In"
values:
- "/bin/cat"
matchArgs:
- index: 1
operator: "Equal"
values:
- "/etc/shadow"
matchNamespaces:
- namespace: Mnt
operator: In
values:
- "host_ns"
- namespace: Net
operator: In
values:
- "host_ns"
Here we have a single selector. This CRD will match if:
[
binary == /bin/cat AND
arg1 == /etc/shadow AND
(
MntNs == host AND
NetNs == host)
]
Capabilities filter
Capabilities filters can be specified under the matchCapabilities
field and
provide filtering of calls based on Linux capabilities in the specific sets.
An example syntax is:
- matchCapabilities:
- type: Effective
operator: In
values:
- "CAP_CHOWN"
- "CAP_NET_RAW"
This will match if: [Effective
capabilities contain CAP_CHOWN
] OR
[Effective
capabilities contain CAP_NET_RAW
]
type
can be:Effective
,Inheritable
, orPermitted
.operator
can beIn
orNotIn
values
can be any supported capability. A list of all supported capabilities can be found in/usr/include/linux/capability.h
.
Limitations
- There is no limit in the number of capabilities listed under
values
. - Only one
type
field can be specified undermatchCapabilities
.
Namespace changes filter
Namespace changes filter can be specified under the matchNamespaceChanges
field and provide filtering based on calls that are changing Linux namespaces.
This filter can be useful to track execution of code in a new namespace or even
container escapes that change their namespaces.
For instance, if an unprivileged process creates a new user namespace, it gains full privileges within that namespace. This grants the process the ability to perform some privileged operations within the context of this new namespace that would otherwise only be available to privileged root user. As a result, such filter is useful to track namespace creation, which can be abused by untrusted processes.
To keep track of the changes, when a process_exec
happens, the namespaces of
the process are recorded and these are compared with the current namespaces on
the event with a matchNamespaceChanges
filter.
matchNamespaceChanges:
- operator: In
values:
- "Mnt"
The unshare
command, or executing in the host namespace using nsenter
can
be used to test this feature. See a
demonstration example
of this feature.
Capability changes filter
Capability changes filter can be specified under the matchCapabilityChanges
field and provide filtering based on calls that are changing Linux capabilities.
To keep track of the changes, when a process_exec
happens, the capabilities
of the process are recorded and these are compared with the current
capabilities on the event with a matchCapabilityChanges
filter.
matchCapabilityChanges:
- type: Effective
operator: In
isNamespaceCapability: false
values:
- "CAP_SETUID"
See a demonstration example of this feature.
Actions filter
Actions filters are a list of actions that execute when an appropriate selector
matches. They are defined under matchActions
and currently, the following
action
types are supported:
- Sigkill action
- Signal action
- Override action
- FollowFD action
- UnfollowFD action
- CopyFD action
- GetUrl action
- DnsLookup action
- Post action
- NoPost action
- TrackSock action
- UntrackSock action
- Notify Enforcer action
Sigkill
, Override
, FollowFD
, UnfollowFD
, CopyFD
, Post
,
TrackSock
and UntrackSock
are
executed directly in the kernel BPF code while GetUrl
and DnsLookup
are
happening in userspace after the reception of events.Sigkill action
Sigkill
action terminates synchronously the process that made the call that
matches the appropriate selectors from the kernel. In the example below, every
sys_write
system call with a PID not equal to 1 or 0 attempting to write to
/etc/passwd
will be terminated. Indeed when using kubectl exec
, a new
process is spawned in the container PID namespace and is not a child of PID 1.
- call: "sys_write"
syscall: true
args:
- index: 0
type: "fd"
- index: 1
type: "char_buf"
sizeArgIndex: 3
- index: 2
type: "size_t"
selectors:
- matchPIDs:
- operator: NotIn
followForks: true
isNamespacePID: true
values:
- 0
- 1
matchArgs:
- index: 0
operator: "Prefix"
values:
- "/etc/passwd"
matchActions:
- action: Sigkill
Signal action
Signal
action sends specified signal to current process. The signal number
is specified with argSig
value.
Following example is equivalent to the Sigkill action example above.
The difference is to use the signal action with SIGKILL(9)
signal.
- call: "sys_write"
syscall: true
args:
- index: 0
type: "fd"
- index: 1
type: "char_buf"
sizeArgIndex: 3
- index: 2
type: "size_t"
selectors:
- matchPIDs:
- operator: NotIn
followForks: true
isNamespacePID: true
values:
- 0
- 1
matchArgs:
- index: 0
operator: "Prefix"
values:
- "/etc/passwd"
matchActions:
- action: Signal
argSig: 9
Override action
Override
action allows to modify the return value of call. While Sigkill
will terminate the entire process responsible for making the call, Override
will run in place of the original kprobed function and return the value
specified in the argError
field. It’s then up to the code path or the user
space process handling the returned value to whether stop or proceed with the
execution.
For example, you can create a TracingPolicy
that intercepts sys_symlinkat
and will make it return -1
every time the first argument is equal to the
string /etc/passwd
:
kprobes:
- call: "sys_symlinkat"
syscall: true
args:
- index: 0
type: "string"
- index: 1
type: "int"
- index: 2
type: "string"
selectors:
- matchArgs:
- index: 0
operator: "Equal"
values:
- "/etc/passwd\0"
matchActions:
- action: Override
argError: -1
Override
uses the kernel error injection framework and is only available
on kernels compiled with CONFIG_BPF_KPROBE_OVERRIDE
configuration option.
Overriding system calls is the primary use case, but there are other kernel
functions that support error injections too. These functions are annotated
with ALLOW_ERROR_INJECTION()
in the kernel source, and can be identified by
reading the file /sys/kernel/debug/error_injection/list
.
Starting from kernel version 5.7
overriding security_
hooks is also possible.
FollowFD action
The FollowFD
action allows to create a mapping using a BPF map between file
descriptors and filenames. After its creation, the mapping can be maintained
through UnfollowFD
and CopyFD
actions. Note that proper maintenance of the mapping is up to the tracing policy
writer.
FollowFD
is typically used at hook points where a file descriptor and its
associated filename appear together. The kernel function fd_install
is a good example.
The fd_install
kernel function is called each time a file descriptor must be
installed into the file descriptor table of a process, typically referenced
within system calls like open
or openat
. It is a good place for tracking
file descriptor and filename matching.
Let’s take a look at the following example:
- call: "fd_install"
syscall: false
args:
- index: 0
type: int
- index: 1
type: "file"
selectors:
- matchPIDs:
# [...]
matchArgs:
# [...]
matchActions:
- action: FollowFD
argFd: 0
argName: 1
This action uses the dedicated argFd
and argName
fields to get respectively
the index of the file descriptor argument and the index of the name argument in
the call.
While the mapping between the file descriptor and filename remains in place
(that is, between FollowFD
and UnfollowFD
for the same file descriptor)
tracing policies may refer to filenames instead of file descriptors. This
offers greater convenience and allows more functionality to reside inside the
kernel, thereby reducing overhead.
For instance, assume that you want to prevent writes into file
/etc/passwd
. The system call sys_write
only receives a file descriptor,
not a filename, as argument. Yet with a bracketing pair of FollowFD
and UnfollowFD
actions in place the tracing policy that hooks into sys_write
can nevertheless refer to the filename /etc/passwd
,
if it also marks the relevant argument as of type fd
.
The following example combines actions FollowFD
and UnfollowFD
as well
as an argument of type fd
to such effect:
kprobes:
- call: "fd_install"
syscall: false
args:
- index: 0
type: int
- index: 1
type: "file"
selectors:
- matchArgs:
- index: 1
operator: "Equal"
values:
- "/tmp/passwd"
matchActions:
- action: FollowFD
argFd: 0
argName: 1
- call: "sys_write"
syscall: true
args:
- index: 0
type: "fd"
- index: 1
type: "char_buf"
sizeArgIndex: 3
- index: 2
type: "size_t"
selectors:
- matchArgs:
- index: 0
operator: "Equal"
values:
- "/tmp/passwd"
matchActions:
- action: Sigkill
- call: "sys_close"
syscall: true
args:
- index: 0
type: "int"
selectors:
- matchActions:
- action: UnfollowFD
argFd: 0
argName: 0
UnfollowFD action
The UnfollowFD
action takes a file descriptor from a system call and deletes
the corresponding entry from the BPF map, where it was put under the FollowFD
action.
It is typically used at hooks points where the scope of association between
a file descriptor and a filename ends. The system call sys_close
is a
good example.
Let’s take a look at the following example:
- call: "sys_close"
syscall: true
args:
- index: 0
type: "int"
selectors:
- matchPIDs:
- operator: NotIn
followForks: true
isNamespacePID: true
values:
- 0
- 1
matchActions:
- action: UnfollowFD
argFd: 0
Similar to the FollowFD
action, the index of the file descriptor is described
under argFd
:
matchActions:
- action: UnfollowFD
argFd: 0
In this example, argFD
is 0. So, the argument from the sys_close
system
call at index: 0
will be deleted from the BPF map whenever a sys_close
is
executed.
- index: 0
type: "int"
FollowFD
block,
there should be a matching UnfollowFD
block, otherwise the BPF map will be
broken.CopyFD action
The CopyFD
action is specific to duplication of file descriptor use cases.
Similary to FollowFD
, it takes an argFd
and argName
arguments. It can
typically be used tracking the dup
, dup2
or dup3
syscalls.
See the following example for illustration:
- call: "sys_dup2"
syscall: true
args:
- index: 0
type: "fd"
- index: 1
type: "int"
selectors:
- matchPIDs:
# [...]
matchActions:
- action: CopyFD
argFd: 0
argName: 1
- call: "sys_dup3"
syscall: true
args:
- index: 0
type: "fd"
- index: 1
type: "int"
- index: 2
type: "int"
selectors:
- matchPIDs:
# [...]
matchActions:
- action: CopyFD
argFd: 0
argName: 1
GetUrl action
The GetUrl
action can be used to perform a remote interaction such as
triggering Thinkst canaries or any system that can be triggered via an URL
request. It uses the argUrl
field to specify the URL to request using GET
method.
matchActions:
- action: GetUrl
argUrl: http://ebpf.io
DnsLookup action
The DnsLookup
action can be used to perform a remote interaction such as
triggering Thinkst canaries or any system that can be triggered via an DNS
entry request. It uses the argFqdn
field to specify the domain to lookup.
matchActions:
- action: DnsLookup
argFqdn: ebpf.io
Post action
The Post
action allows an event to be transmitted to the agent, from
kernelspace to userspace. By default, all TracingPolicy
hook will create an
event with the Post
action except in those situations:
- a
NoPost
action was specified in amatchActions
; - a rate-limiting parameter is in place, see details below.
This action allows you to specify parameters for the Post
action.
Rate limiting
Post
takes the rateLimit
parameter with a time value. This value defaults
to seconds, but post-fixing ’m’ or ‘h’ will cause the value to be interpreted
in minutes or hours. When this parameter is specified for an action, that
action will check if the same action has fired, for the same thread, within
the time window, with the same inspected arguments. (Only the first 40 bytes
of each inspected argument is used in the matching. Only supported on kernels
v5.3 onwards.)
For example, you can specify a selector to only generate an event every 5 minutes with adding the following action and its paramater:
matchActions:
- action: Post
rateLimit: 5m
By default, the rate limiting is applied per thread, meaning that only repeated actions by the same thread will be rate limited. This can be expanded to all threads for a process by specifying a rateLimitScope with value “process”; or can be expanded to all processes by specifying the same with the value “global”.
Stack traces
Post
takes the kernelStackTrace
parameter, when turned to true
(by default to
false
) it enables dump of the kernel stack trace to the hook point in kprobes
events. To dump user space stack trace set userStackTrace
parameter to true
.
For example, the following kprobe hook can be used to retrieve the
kernel stack to kfree_skb_reason
, the function called in the kernel to drop
kernel socket buffers.
kprobes:
- call: kfree_skb_reason
selectors:
- matchActions:
- action: Post
kernelStackTrace: true
userStackTrace: true
By default Tetragon does not expose the linear addresses from kernel space or
user space, you need to enable the flag --expose-stack-addresses
to get the
addresses along the rest.
Note that the Tetragon agent is using its privilege to read the kernel symbols and their address. Being able to retrieve kernel symbols address can be used to break kernel address space layout randomization (KASLR) so only privileged users should be able to enable this feature and read events containing stack traces. The same thing we can say about retrieving address for user mode processes. Stack trace addresses can be used to bypass address space layout randomization (ASLR).
Once loaded, events created from this policy will contain a new kernel_stack_trace
field on the process_kprobe
event with an output similar to:
{
"address": "18446744072119856613",
"offset": "5",
"symbol": "kfree_skb_reason"
},
{
"address": "18446744072119769755",
"offset": "107",
"symbol": "__sys_connect_file"
},
{
"address": "18446744072119769989",
"offset": "181",
"symbol": "__sys_connect"
},
[...]
The “address” is the kernel function address, “offset” is the offset into the native instruction for the function and “symbol” is the function symbol name.
User mode stack trace is contained in user_stack_trace
field on the
process_kprobe
event and looks like:
{
"address": "140498967885099",
"offset": "1209643",
"symbol": "__connect",
"module": "/usr/lib/x86_64-linux-gnu/libc.so.6"
},
{
"address": "140498968021470",
"offset": "1346014",
"symbol": "inet_pton",
"module": "/usr/lib/x86_64-linux-gnu/libc.so.6"
},
{
"address": "140498971185511",
"offset": "106855",
"module": "/usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0"
},
The “address” is the function address, “offset” is the function offset from the beginning of the binary module. “module” is the absolute path of the binary file to which address belongs. “symbol” is the function symbol name. “symbol” may be missing if the binary file is stripped.
Information from procfs (/proc/<pid>/maps)
is used to symbolize user
stack trace addresses. Stack trace addresses extraction and symbolizing are async.
It might happen that process is terminated and the /proc/<pid>/maps
file will be
not existed at user stack trace symbolization step. In such case user stack traces
for very short living process might be not collected.
For Linux kernels before 5.15 user stack traces may be incomplete (some stack traces entries may be missed).
This output can be enhanced in a more human friendly using the tetra getevents -o compact
command. Indeed, by default, it will print the stack trace along
the compact output of the event similarly to this:
❓ syscall /usr/bin/curl kfree_skb_reason
Kernel:
0xffffffffa13f2de5: kfree_skb_reason+0x5
0xffffffffa13dda9b: __sys_connect_file+0x6b
0xffffffffa13ddb85: __sys_connect+0xb5
0xffffffffa13ddbd8: __x64_sys_connect+0x18
0xffffffffa1714bd8: do_syscall_64+0x58
0xffffffffa18000e6: entry_SYSCALL_64_after_hwframe+0x6e
User space:
0x7f878cf2752b: __connect (/usr/lib/x86_64-linux-gnu/libc.so.6+0x12752b)
0x7f878cf489de: inet_pton (/usr/lib/x86_64-linux-gnu/libc.so.6+0x1489de)
0x7f878d1b6167: (/usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0+0x1a167)
The printing format for kernel stack trace is "0x%x: %s+0x%x", address, symbol, offset
.
The printing format for user stack trace is "0x%x: %s (%s+0x%x)", address, symbol, module, offset
.
0x0
, see the above note on
--expose-stack-addresses
for more info.File hash collection with IMA
Post
takes the imaHash
parameter, when turned to true
(by default to
false
) it adds file hashes in LSM events calculated by Linux integrity subsystem.
The following list of LSM hooks is supported:
- bprm_check_security
- bprm_committed_creds
- bprm_committing_creds
- bprm_creds_from_file
- file_ioctl
- file_lock
- file_open
- file_post_open
- file_receive
- mmap_file
First, you need to be sure that LSB BPF is enabled.
To verify if IMA-measurement is available use the following command:
cat /boot/config-$(uname -r) | grep "CONFIG_IMA\|CONFIG_INTEGRITY"
The output should be similar to this if IMA-measurement is supported:
CONFIG_INTEGRITY=y
CONFIG_IMA=y
If provided above conditions are met, you can enable IMA-measurement by modifying /etc/deault/grub
:
GRUB_CMDLINE_LINUX="lsm=integrity,bpf ima_policy=tcb"
Then, update the grub configuration and restart the system.
ima_policy=
is used to define which files will be measured. tcb
measures all executables run,
all mmap’d files for execution (such as shared libraries), all kernel modules loaded,
and all firmware loaded. Additionally, a files opened for read by root are measured as well.
ima_policy=
can be specified multiple times, and the result is the union of the policies.
To know more about ima_policy
you can follow this link.
iversion
. Mounting with iversion
helps IMA not recalculating hash if file is not changed. From kernel 6.1 iversion
is by default.
It is not necessary to enable IMA to calculate hashes with Tetragon if you have kernel 6.1+.
But hashes will be recalculated no matter if file is not changed. See implementation details of
bpf_ima_file_hash
helper.The provided example of TracingPolicy
collects hashes of executed binaries from
zsh
and bash
interpreters:
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
spec:
lsmhooks:
- hook: "bprm_check_security"
args:
- index: 0
type: "linux_binprm"
selectors:
- matchBinaries:
- operator: "In"
values:
- "/usr/bin/zsh"
- "/usr/bin/bash"
matchActions:
- action: Post
imaHash: true
LSM event with file hash can look like this:
{
"process_lsm": {
"process": {
...
},
"parent": {
...
},
"function_name": "bprm_check_security",
"policy_name": "file-integrity-monitoring",
"args": [
{
"linux_binprm_arg": {
"path": "/usr/bin/grep",
"permission": "-rwxr-xr-x"
}
}
],
"action": "KPROBE_ACTION_POST",
"ima_hash": "sha256:73abb4280520053564fd4917286909ba3b054598b32c9cdfaf1d733e0202cc96"
},
}
ima_hash
field contains information about hashing algorithm and the hash value itself
separated by ‘:’.
This output can be enhanced in a more human friendly using the
tetra getevents -e PROCESS_LSM -o compact
command.
🔒 LSM user-nix /usr/bin/zsh bprm_check_security
/usr/bin/cat sha256:dd5526c5872cce104a80f4d4e7f787c56ab7686a5b8dedda0ba4e8b36a3c084c
🔒 LSM user-nix /usr/bin/zsh bprm_check_security
/usr/bin/grep sha256:73abb4280520053564fd4917286909ba3b054598b32c9cdfaf1d733e0202cc96
NoPost action
The NoPost
action can be used to suppress the event to be generated, but at
the same time all its defined actions are performed.
It’s useful when you are not interested in the event itself, just in the action being performed.
Following example override openat syscall for “/etc/passwd” file but does not generate any event about that.
- call: "sys_openat"
return: true
syscall: true
args:
- index: 0
type: int
- index: 1
type: "string"
- index: 2
type: "int"
returnArg:
type: "int"
selectors:
- matchPIDs:
matchArgs:
- index: 1
operator: "Equal"
values:
- "/etc/passwd"
matchActions:
- action: Override
argError: -2
- action: NoPost
TrackSock action
The TrackSock
action allows to create a mapping using a BPF map between sockets
and processes. It however needs to maintain a state
correctly, see UntrackSock
related action. TrackSock
works similarly to FollowFD
, specifying the argument with the sock
type using
argSock
instead of specifying the FD argument with argFd
.
It is however more likely that socket tracking will be performed on the return
value of sk_alloc
as described above.
Socket tracking is only available on kernel >=5.3.
UntrackSock action
The UntrackSock
action takes a struct sock pointer from a function call and deletes
the corresponding entry from the BPF map, where it was put under the TrackSock
action.
Let’s take a look at the following example:
- call: "__sk_free"
syscall: false
args:
- index: 0
type: sock
selectors:
- matchActions:
- action: UntrackSock
argSock: 0
Similar to the TrackSock
action, the index of the sock is described under argSock
:
- matchActions:
- action: UntrackSock
argSock: 0
In this example, argSock
is 0. So, the argument from the __sk_free
function
call at index: 0
will be deleted from the BPF map whenever a __sk_free
is
executed.
- index: 0
type: "sock"
TrackSock
block,
there should be a matching UntrackSock
block, otherwise the BPF map will be
broken.Socket tracking is only available on kernel >=5.3.
Notify Enforcer action
The NotifyEnforcer
action notifies the enforcer program to kill or override a syscall.
It’s meant to be used on systems with kernel that lacks multi kprobe feature, that allows to attach many kprobes quickly). To workaround that the enforcer sensor uses the raw syscall tracepoint and attaches simple program to syscalls that we need to kill or override.
The specs needs to have enforcer
program definition, that instructs tetragon to load
the enforcer
program and attach it to specified syscalls.
spec:
enforcers:
- calls:
- "list:dups"
The syscalls expects list of syscalls or list:XXX
pointer to list.
Note that currently only single enforcer definition is allowed.
The NotifyEnforcer
action takes 2 arguments.
matchActions:
- action: "NotifyEnforcer"
argError: -1
argSig: 9
If specified the argError will be passed to bpf_override_return
helper to override the syscall return value.
If specified the argSig will be passed to bpf_send_signal
helper to override the syscall return value.
The following is spec for killing /usr/bin/bash
program whenever it calls sys_dup
or sys_dup2
syscalls.
spec:
lists:
- name: "dups"
type: "syscalls"
values:
- "sys_dup"
- "sys_dup2"
enforcers:
- calls:
- "list:dups"
tracepoints:
- subsystem: "raw_syscalls"
event: "sys_enter"
args:
- index: 4
type: "syscall64"
selectors:
- matchArgs:
- index: 0
operator: "InMap"
values:
- "list:dups"
matchBinaries:
- operator: "In"
values:
- "/usr/bin/bash"
matchActions:
- action: "NotifyEnforcer"
argSig: 9
Note as mentioned above the NotifyEnforcer
with enforcer program is meant to be used only on kernel versions
with no support for fast attach of multiple kprobes (kprobe_multi
link).
With kprobe_multi
link support the above example can be easily replaced with:
spec:
lists:
- name: "syscalls"
type: "syscalls"
values:
- "sys_dup"
- "sys_dup2"
kprobes:
- call: "list:syscalls"
selectors:
- matchBinaries:
- operator: "In"
values:
- "/usr/bin/bash"
matchActions:
- action: "Sigkill"
Selector Semantics
The selector
semantics of the CiliumTracingPolicy
follows the standard
Kubernetes semantics and the principles that are used by Cilium to create a
unified policy definition.
To explain deeper the structure and the logic behind it, let’s consider first the following example:
selectors:
- matchPIDs:
- operator: In
followForks: true
values:
- pid1
- pid2
- pid3
matchArgs:
- index: 0
operator: "Equal"
values:
- fdString1
In the YAML above matchPIDs
and matchArgs
are logically AND
together
giving the expression:
(pid in {pid1, pid2, pid3} AND arg0=fdstring1)
Multiple values
When multiple values are given, we apply the OR
operation between them. In
case of having multiple values under the matchPIDs
selector, if any value
matches with the given pid from pid1
, pid2
or pid3
then we accept the
event:
pid==pid1 OR pid==pid2 OR pid==pid3
As an example, we can filter for sys_read()
syscalls that were not part of
the container initialization and the main pod process and tried to read from
the /etc/passwd
file by using:
selectors:
- matchPIDs:
- operator: NotIn
followForks: true
values:
- 0
- 1
matchArgs:
- index: 0
operator: "Equal"
values:
- "/etc/passwd"
Similarly, we can use multiple values under the matchArgs
selector:
(pid in {pid1, pid2, pid3} AND arg0={fdstring1, fdstring2})
If any value matches with fdstring1
or fdstring2
, specifically
(string==fdstring1 OR string==fdstring2)
then we accept the event.
For example, we can monitor sys_read()
syscalls accessing both the
/etc/passwd
or the /etc/shadow
files:
selectors:
- matchPIDs:
- operator: NotIn
followForks: true
values:
- 0
- 1
matchArgs:
- index: 0
operator: "Equal"
values:
- "/etc/passwd"
- "/etc/shadow"
Multiple operators
When multiple operators are supported under matchPIDs
or matchArgs
, they
are logically AND
together. In case if we have multiple operators under
matchPIDs
:
selectors:
- matchPIDs:
- operator: In
followForks: true
values:
- pid1
- operator: NotIn
followForks: true
values:
- pid2
then we would build the following expression on the BPF side:
(pid == 0[following forks]) && (pid != 1[following forks])
In case of having multiple matchArgs
:
selectors:
- matchPIDs:
- operator: In
followForks: true
values:
- pid1
- pid2
- pid3
matchArgs:
- index: 0
operator: "Equal"
values:
- 1
- index: 2
operator: "lt"
values:
- 500
Then we would build the following expression on the BPF side
(pid in {pid1, pid2, pid3} AND arg0=1 AND arg2 < 500)
Operator types
There are different types supported for each operator. In case of matchArgs
:
- Equal
- NotEqual
- Prefix
- Postfix
- Mask
- GreaterThan (aka GT)
- LessThan (aka LT)
- SPort - Source Port
- NotSPort - Not Source Port
- SPortPriv - Source Port is Privileged (0-1023)
- NotSPortPriv - Source Port is Not Privileged (Not 0-1023)
- DPort - Destination Port
- NotDPort - Not Destination Port
- DPortPriv - Destination Port is Privileged (0-1023)
- NotDPortPriv - Destination Port is Not Privileged (Not 0-1023)
- SAddr - Source Address, can be IPv4/6 address or IPv4/6 CIDR (for ex 1.2.3.4/24 or 2a1:56::1/128)
- NotSAddr - Not Source Address
- DAddr - Destination Address
- NotDAddr - Not Destination Address
- Protocol
- Family
- State
The operator types Equal
and NotEqual
are used to test whether the certain
argument of a system call is equal to the defined value in the CR.
For example, the following YAML snippet matches if the argument at index 0 is
equal to /etc/passwd
:
matchArgs:
- index: 0
operator: "Equal"
values:
- "/etc/passwd"
Both Equal
and NotEqual
are set operations. This means if multiple values
are specified, they are OR
d together in case of Equal
, and AND
d together
in case of NotEqual
.
For example, in case of Equal
the following YAML snippet matches if the
argument at index 0 is in the set of {arg0, arg1, arg2}
.
matchArgs:
- index: 0
operator: "Equal"
values:
- "arg0"
- "arg1"
- "arg2"
The above would be executed in the kernel as
arg == arg0 OR arg == arg1 OR arg == arg2
In case of NotEqual
the following YAML snippet matches if the argument at
index 0 is not in the set of {arg0, arg1}
.
matchArgs:
- index: 0
operator: "NotEqual"
values:
- "arg0"
- "arg1"
The above would be executed in the kernel as
arg != arg0 AND arg != arg1
The operator type Mask
performs and bitwise operation on the argument value
and defined values. The argument type needs to be one of the value types.
For example in following YAML snippet we match second argument for bits 1 and 9 (0x200 value). We could use single value 0x201 as well.
matchArgs:
- index: 2
operator: "Mask"
values:
- 1
- 0x200
The above would be executed in the kernel as
arg & 1 OR arg & 0x200
The value can be specified as hexadecimal (with 0x prefix) octal (with 0 prefix) or decimal value (no prefix).
The operator Prefix
checks if the certain argument starts with the defined value,
while the operator Postfix
compares if the argument matches to the defined value
as trailing.
The operators relating to ports, addresses and protocol are used with sock or skb
types. Port operators can accept a range of ports specified as min:max
as well
as lists of individual ports. Address operators can accept IPv4/6 CIDR ranges as well
as lists of individual addresses.
The Protocol
operator can accept integer values to match against, or the equivalent
IPPROTO_ enumeration. For example, UDP can be specified as either IPPROTO_UDP
or 17;
TCP can be specified as either IPPROTO_TCP
or 6.
The Family
operator can accept integer values to match against or the equivalent
AF_ enumeration. For example, IPv4 can be specified as either AF_INET
or 2; IPv6
can be specified as either AF_INET6
or 10.
The State
operator can accept integer values to match against or the equivalent
TCP_ enumeration. For example, an established socket can be matched with
TCP_ESTABLISHED
or 1; a closed socket with TCP_CLOSE
or 7.
In case of matchPIDs
:
- In
- NotIn
The operator types In
and NotIn
are used to test whether the pid of a
system call is found in the provided values
list in the CR. Both In
and
NotIn
are set operations, which means if multiple values are specified they
are OR
d together in case of In
and AND
d together in case of NotIn
.
For example, in case of In
the following YAML snippet matches if the pid of a
certain system call is being part of the list of {0, 1}
:
- matchPIDs:
- operator: In
followForks: true
isNamespacePID: true
values:
- 0
- 1
The above would be executed in the kernel as
pid == 0 OR pid == 1
In case of NotIn
the following YAML snippet matches if the pid of a certain
system call is not being part of the list of {0, 1}
:
- matchPIDs:
- operator: NotIn
followForks: true
isNamespacePID: true
values:
- 0
- 1
The above would be executed in the kernel as
pid != 0 AND pid != 1
In case of matchBinaries
:
- In
The In
operator type is used to test whether a binary name of a system call
is found in the provided values
list. For example, the following YAML snippet
matches if the binary name of a certain system call is being part of the list
of {binary0, binary1, binary2}
:
- matchBinaries:
- operator: "In"
values:
- "binary0"
- "binary1"
- "binary2"
Multiple selectors
When multiple selectors are configured they are logically OR
d together.
selectors:
- matchPIDs:
- operator: In
followForks: true
values:
- pid1
- pid2
- pid3
matchArgs:
- index: 0
operator: "Equal"
values:
- 1
- index: 2
operator: "lt"
values:
- 500
- matchPIDs:
- operator: In
followForks: true
values:
- pid1
- pid2
- pid3
matchArgs:
- index: 0
operator: "Equal"
values:
- 2
The above would be executed in kernel as:
(pid in {pid1, pid2, pid3} AND arg0=1 AND arg2 < 500) OR
(pid in {pid1, pid2, pid3} AND arg0=2)
Limitations
Those limitations might be outdated, see issue #709.
Because BPF must be bounded we have to place limits on how many selectors can exist.
- Max Selectors 8.
- Max PID values per selector 4
- Max MatchArgs per selector 5 (one per index)
- Max MatchArg Values per MatchArgs 1 (limiting initial implementation can bump to 16 or so)
Return Actions filter
Return actions filters are a list of actions that execute when an return selector
matches. They are defined under matchReturnActions
and currently support all
the Actions filter action
types.
2.5 - Tags
Tags are optional fields of a Tracing Policy that are used to categorize generated events.
Introduction
Tags are specified in Tracing policies and will be part of the generated event.
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "file-monitoring-filtered"
spec:
kprobes:
- call: "security_file_permission"
message: "Sensitive file system write operation"
syscall: false
args:
- index: 0
type: "file" # (struct file *) used for getting the path
- index: 1
type: "int" # 0x04 is MAY_READ, 0x02 is MAY_WRITE
selectors:
- matchArgs:
- index: 0
operator: "Prefix"
values:
- "/etc" # Writes to sensitive directories
- "/boot"
- "/lib"
- "/lib64"
- "/bin"
- "/usr/lib"
- "/usr/local/lib"
- "/usr/local/sbin"
- "/usr/local/bin"
- "/usr/bin"
- "/usr/sbin"
- "/var/log" # Writes to logs
- "/dev/log"
- "/root/.ssh" # Writes to sensitive files add here.
- index: 1
operator: "Equal"
values:
- "2" # MAY_WRITE
tags: [ "observability.filesystem", "observability.process" ]
Every kprobe call can have up to max 16 tags.
Namespaces
Observability namespace
Events in this namespace relate to collect and export data about the internal system state.
- “observability.filesystem”: the event is about file system operations.
- “observability.privilege_escalation”: the event is about raising permissions of a user or a process.
- “observability.process”: the event is about an instance of a Linux program being executed.
User defined Tags
Users can define their own tags inside Tracing Policies. The official supported tags are documented in the Namespaces section.
2.6 - Kubernetes Identity Aware Policies
Motivation
Tetragon is configured via TracingPolicies. Broadly speaking, TracingPolicies define what situations Tetragon should react to and how. The what can be, for example, specific system calls with specific argument values. The how defines what action the Tetragon agent should perform when the specified situation occurs. The most common action is generating an event, but there are others (e.g., returning an error without executing the function or killing the corresponding process).
Here, we discuss how to apply tracing policies only on a subset of pods running on the system via the followings mechanisms:
- namespaced policies
- pod-label filters
- container field filters
Tetragon implements these mechanisms in-kernel via eBPF. This is important for both observability and enforcement use-cases. For observability, copying only the relevant events from kernel- to user-space reduces overhead. For enforcement, performing the enforcement action in the kernel avoids the race-condition of doing it in user-space. For example, let us consider the case where we want to block an application from performing a system call. Performing the filtering in-kernel means that the application will never finish executing the system call, which is not possible if enforcement happens in user-space (after the fact).
To ensure that namespaced tracing policies are always correctly applied, Tetragon needs to perform actions before containers start executing. Tetragon supports this via OCI runtime hooks. If such hooks are not added, Tetragon will apply policies in a best-effort manner using information from the k8s API server.
Namespace filtering
For namespace filtering we use TracingPolicyNamespaced
which has the same contents as a
TracingPolicy
, but it is defined in a specific namespace and it is only applied to pods of that
namespace.
Pod label filters
For pod label filters, we use the PodSelector
field of tracing policies to select the pods that
the policy is applied to.
Container field filters
For container field filters, we use the containerSelector
field of tracing policies to select the containers that the policy is applied to. At the moment, the only supported field is name
.
Demo
Setup
For this demo, we use containerd and configure appropriate run-time hooks using minikube.
First, let us start minikube, build and load images, and install Tetragon and OCI hooks:
minikube start --container-runtime=containerd
./contrib/tetragon-rthooks/minikube-containerd-install-hook.sh
make image image-operator
minikube image load --daemon=true cilium/tetragon:latest cilium/tetragon-operator:latest
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf
helm install --namespace kube-system \
--set tetragonOperator.image.override=cilium/tetragon-operator:latest \
--set tetragon.image.override=cilium/tetragon:latest \
--set tetragon.grpc.address="unix:///var/run/cilium/tetragon/tetragon.sock" \
tetragon ./install/kubernetes/tetragon
Once the tetragon pod is up and running, we can get its name and store it in a variable for convenience.
tetragon_pod=$(kubectl -n kube-system get pods -l app.kubernetes.io/name=tetragon -o custom-columns=NAME:.metadata.name --no-headers)
Once the tetragon operator pod is up and running, we can also get its name and store it in a variable for convenience.
tetragon_operator=$(kubectl -n kube-system get pods -l app.kubernetes.io/name=tetragon-operator -o custom-columns=NAME:.metadata.name --no-headers)
Next, we check the tetragon-operator logs and tetragon agent logs to ensure that everything is in order.
First, we check if the operator installed the TracingPolicyNamespaced CRD.
kubectl -n kube-system logs -c tetragon-operator $tetragon_operator
The expected output is:
level=info msg="Tetragon Operator: " subsys=tetragon-operator
level=info msg="CRD (CustomResourceDefinition) is installed and up-to-date" name=TracingPolicy/v1alpha1 subsys=k8s
level=info msg="Creating CRD (CustomResourceDefinition)..." name=TracingPolicyNamespaced/v1alpha1 subsys=k8s
level=info msg="CRD (CustomResourceDefinition) is installed and up-to-date" name=TracingPolicyNamespaced/v1alpha1 subsys=k8s
level=info msg="Initialization complete" subsys=tetragon-operator
Next, we check that policyfilter (the low-level mechanism that implements the desired functionality) is indeed enabled.
kubectl -n kube-system logs -c tetragon $tetragon_pod
The output should include:
level=info msg="Enabling policy filtering"
Namespaced policies
For illustration purposes, we will use the lseek system call with an invalid argument. Specifically a file descriptor (the first argument) of -1. Normally, this operation would return a “Bad file descriptor error”.
Let us start a pod in the default namespace:
kubectl -n default run test --image=python -it --rm --restart=Never -- python
Above command will result in the following python shell:
If you don't see a command prompt, try pressing enter.
>>>
There is no policy installed, so attempting to do the lseek operation will just return an error. Using the python shell, we can execute an lseek and see the returned error.
>>> import os
>>> os.lseek(-1,0,0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor
>>>
In another terminal, we install a policy in the default namespace:
cat << EOF | kubectl apply -n default -f -
apiVersion: cilium.io/v1alpha1
kind: TracingPolicyNamespaced
metadata:
name: "lseek-namespaced"
spec:
kprobes:
- call: "sys_lseek"
syscall: true
args:
- index: 0
type: "int"
selectors:
- matchArgs:
- index: 0
operator: "Equal"
values:
- "-1"
matchActions:
- action: Sigkill
EOF
The above tracing policy will kill the process that performs a lseek system call with a file
descriptor of -1
. Note that we use a SigKill
action only for illustration purposes because it’s
easier to observe its effects.
Then, attempting the lseek operation on the previous terminal, will result in the process getting killed:
>>> os.lseek(-1, 0, 0)
pod "test" deleted
pod default/test terminated (Error)
The same is true for a newly started container:
kubectl -n default run test --image=python -it --rm --restart=Never -- python
If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
pod "test" deleted
pod default/test terminated (Error)
Doing the same on another namespace:
kubectl create namespace test
kubectl -n test run test --image=python -it --rm --restart=Never -- python
Will not kill the process and result in an error:
If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor
Pod label filters
Let’s install a tracing policy with a pod label filter.
cat << EOF | kubectl apply -f -
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "lseek-podfilter"
spec:
podSelector:
matchLabels:
app: "lseek-test"
kprobes:
- call: "sys_lseek"
syscall: true
args:
- index: 0
type: "int"
selectors:
- matchArgs:
- index: 0
operator: "Equal"
values:
- "-1"
matchActions:
- action: Sigkill
EOF
Pods without the label will not be affected:
kubectl run test --image=python -it --rm --restart=Never -- python
If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor
>>>
But pods with the label will:
kubectl run test --labels "app=lseek-test" --image=python -it --rm --restart=Never -- python
If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
pod "test" deleted
pod default/test terminated (Error)
Container field filters
Let’s install a tracing policy with a container field filter.
cat << EOF | kubectl apply -f -
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "lseek-containerfilter"
spec:
containerSelector:
matchExpressions:
- key: name
operator: In
values:
- main
kprobes:
- call: "sys_lseek"
syscall: true
args:
- index: 0
type: "int"
selectors:
- matchArgs:
- index: 0
operator: "Equal"
values:
- "-1"
matchActions:
- action: Sigkill
EOF
Let’s create a pod with 2 containers:
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: lseek-pod
spec:
containers:
- name: main
image: python
command: ['sh', '-c', 'sleep infinity']
- name: sidecar
image: python
command: ['sh', '-c', 'sleep infinity']
EOF
Containers that don’t match the name main
will not be affected:
kubectl exec -it lseek-pod -c sidecar -- python3
>>> import os
>>> os.lseek(-1, 0, 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor
>>>
But containers matching the name main
will:
kubectl exec -it lseek-pod -c main -- python3
>>> import os
>>> os.lseek(-1, 0, 0)
command terminated with exit code 137
3 - Runtime Hooks
Applying Kubernetes Identity Aware Policies requires information about Kubernetes (K8s) pods (e.g., namespaces and labels). Based on this information, the Tetragon agent can update the state so that Kubernetes Identify filtering can be applied in-kernel via BPF.
One way that this information is available to the Tetragon agent is via the K8s API server. Relying on the API server, however, can lead to a delay before the container starts and the policy is applied. This might be undesirable, especially for enforcement policies.
Runtime hooks address this issue by “hooking” into the container run-time system, and ensuring that the Tetragon agent sets up the necessary state for filtering before the container starts.
The OCI hooks are implemented via a tetragon-oci-hook
binary which is responsible for contacting
the agent via a gRPC socket. tetragon-oci-hook
can be configured to either fail or succeed when
connecting to the tetragon agent fails (this is needed, so that Tetragon itself, as well as other
mission critical containers can be started).
┌────────────────────┐ ┌────────────────────┐ ┌──────────────────┐
│ tetragon-oci-hook │ │ tetragon.sock │ │ tetragon agent │
│ (binary) │─────────┤ (gRPC UNIX socket) │──────► │ │
│ │ │ │ │ │
└────────────────────┘ └────────────────────┘ └──────────────────┘
Depending on the container runtime, there are different ways to configure the runtime so that
tetragon-oci-hook
is executed before a container starts:
CRI-O
CRI-O implements the OCI hooks configuration directories as described in: https://github.com/containers/common/blob/main/pkg/hooks/docs/oci-hooks.5.md. Hence, enabling the hook requires adding an appropriate file to this directory.
Containerd (with NRI)
Recent versions of containerd support NRI: NRI support was
added in 1.7 and will be enabled by
default starting with
2.0.
To use tetragon-oci-hook
with NRI, there is a simple NRI plugin (called tetragon-nri-hook
) that
adds the tetragon-oci-hook
to the container spec.
Containerd (without NRI)
Containerd can be configured to use a custom container spec that includes tetragon-oci-hook
.
Configuration
4 - Enforcement
Tetragon allows enforcing events in the kernel inline with the operation itself. This document describes the types of enforcement provided by Tetragon and concerns policy implementors must be aware of.
There are two ways that Tetragon performs enforcement: overriding the return value of a function and
sending a signal (e.g., SIGKILL
) to the process.
Override return value
Override the return value of a call means that the function will never be executed and, instead, a value (typically an error) will be returned to the caller. Generally speaking, only system calls and security check functions allow to change their return value in this manner. Details about how users can configure tracing policies to override the return value can be found in the Override action documentation.
Signals
Another type of enforcement is signals. For example, users can write a TracingPolicy (details can be
found in the Signal action
documentation) that sends a SIGKILL
to a process matching certain criteria and thus terminate it.
In contrast with overriding the return value, sending a SIGKILL
signal does not always stop the
operation being performed by the process that triggered the operation. For example, a SIGKILL
sent
in a write()
system call does not guarantee that the data will not be written to the file.
However, it does ensure that the process is terminated synchronously (and any threads will be
stopped). In some cases it may be sufficient to ensure the process is stopped and the process does
not handle the return of the call. To ensure the operation is not completed, though, the Signal
action should be combined with the Override
action.
4.1 - Persistent enforcement
This page shows you how to configure persistent enforcement.
Concept
The idea of persistent enforcement is to allow the enforcement policy to continue running even when its tetragon process is gone.
This is configured with the --keep-sensors-on-exit
option.
When the tetragon process exits, the policy stays active because it’s pinned
in sysfs bpf tree under /sys/fs/bpf/tetragon
directory.
When a new tetragon process is started, it performs the following actions:
- checks if there’s existing
/sys/fs/bpf/tetragon
and moves it to/sys/fs/bpf/tetragon_old
directory; - sets up configured policy;
- removes
/sys/fs/bpf/tetragon_old
directory.
Example
This example shows how the persistent enforcement works on simple tracing policy.
Consider the following enforcement tracing policy that kills any process that touches
/tmp/tetragon
file.apiVersion: cilium.io/v1alpha1 kind: TracingPolicy metadata: name: "enforcement" spec: kprobes: - call: "fd_install" syscall: false args: - index: 0 type: int - index: 1 type: "file" selectors: - matchArgs: - index: 1 operator: "Equal" values: - "/tmp/tetragon" matchActions: - action: Sigkill
Spawn tetragon with the above policy and
--keep-sensors-on-exit
option.tetragon --bpf-lib bpf/objs/ --keep-sensors-on-exit --tracing-policy enforcement.yaml
Verify that the enforcement policy is in place.
cat /tmp/tetragon
The output should be similar to
Killed
Kill tetragon with CTRL+C.
time="2024-07-26T14:47:45Z" level=info msg="Perf ring buffer size (bytes)" percpu=68K total=272K time="2024-07-26T14:47:45Z" level=info msg="Perf ring buffer events queue size (events)" size=63K time="2024-07-26T14:47:45Z" level=info msg="Listening for events..." ^C time="2024-07-26T14:50:50Z" level=info msg="Received signal interrupt, shutting down..." time="2024-07-26T14:50:50Z" level=info msg="Listening for events completed." error="context canceled"
Verify that the enforcement policy is STILL in place.
cat /tmp/tetragon
The output should be still similar to
Killed
Limitations
At the moment we are not able to receive any events during the tetragon down time, only the the enforcement is in place.
5 - BPF programs statistics
This page shows you how to monitor bpf programs statistics.
Concept
The BPF subsystem provides performance data for each loaded program and tetragon exports that in metrics or display that in terminal in top like tool.
In terminal
The tetra command allows to display loaded BPF programs in terminal with:
tetra debug progs
The default output shows tetragon programs only and looks like:
2024-10-31 11:12:45.94715546 +0000 UTC m=+8.038098448
Ovh(%) Id Cnt Time Name Pin
0.00 22201 0 0 event_execve /sys/fs/bpf/tetragon/__base__/event_execve/prog
0.00 22198 0 0 event_exit_acct_process /sys/fs/bpf/tetragon/__base__/event_exit/prog
0.00 22200 0 0 event_wake_up_new_task /sys/fs/bpf/tetragon/__base__/kprobe_pid_clear/prog
0.00 22207 0 0 tg_cgroup_rmdir /sys/fs/bpf/tetragon/__base__/tg_cgroup_rmdir/prog
0.00 22206 0 0 tg_kp_bprm_committing_creds /sys/fs/bpf/tetragon/__base__/tg_kp_bprm_committing_creds/prog
0.00 22221 0 0 generic_kprobe_event /sys/fs/bpf/tetragon/syswritefollowfdpsswd/generic_kprobe/__x64_sys_close/prog
0.00 22225 0 0 generic_kprobe_event /sys/fs/bpf/tetragon/syswritefollowfdpsswd/generic_kprobe/__x64_sys_write/prog
0.00 22211 0 0 generic_kprobe_event /sys/fs/bpf/tetragon/syswritefollowfdpsswd/generic_kprobe/fd_install/prog
The fields have following meaning:
Ovh
is system wide overhead of the BPF programId
is global BPF ID of the program (as shown bybpftool prog
)Cnt
is count with number of BPF program executionsTime
is sum of the time of all BPF program executionsPin
is BPF program pin path in bpfffs
It’s possible to display all BPF programs with --all
:
tetra debug progs --all
That has following output:
2024-10-31 11:19:37.720137195 +0000 UTC m=+7.165535117
Ovh(%) Id Cnt Time Name Pin
0.00 159 2 82620 event_execve -
0.00 171 68 18564 iter -
0.00 158 2 10170 event_wake_up_n -
0.00 164 2 4254 tg_kp_bprm_comm -
0.00 157 2 3868 event_exit_acct -
0.00 97 2 1680 -
0.00 35 2 1442 -
0.00 83 0 0 sd_devices -
0.00 9 0 0 -
0.00 7 0 0 -
0.00 8 0 0 -
0.00 87 0 0 sd_devices -
...
Above commands should run properly on top of the tetragon sources. At the moment to run it properly under Kubernetes you need to specify extra directory flags:
kubectl exec -ti -n kube-system tetragon-66rk4 -c tetragon -- tetra debug progs --bpf-dir /run/cilium/bpffs/tetragon/ --all --bpf-lib /var/lib/tetragon/
Note that there are other options to customize the behaviour:
tetra debug progs --help
Retrieve information about BPF programs on the host.
Examples:
- tetragon BPF programs top style
# tetra debug progs
- all BPF programs top style
# tetra debug progs --all
- one shot mode (displays one interval data)
# tetra debug progs --once
- change interval to 10 seconds
# tetra debug progs --timeout 10
- change interval to 10 seconds in one shot mode
# tetra debug progs --once --timeout 10
Usage:
tetra debug progs [flags]
Aliases:
progs, top
Flags:
--all Get all programs
--bpf-dir string Location of bpffs tetragon directory (default "/sys/fs/bpf/tetragon")
--bpf-lib string Location of Tetragon libs (btf and bpf files) (default "bpf/objs/")
-h, --help help for progs
--no-clear Do not clear screen between rounds
--once Run in one shot mode
--timeout int Interval in seconds (delay in one shot mode) (default 1)
Metrics
The BPF subsystem provides performance data for each loaded program and tetragon exports that in metrics.
For each loaded BPF program we get:
run count
which counts how many times the BPF program was executedrun time
which sums the time BPF program spent in all its executions
Hence for each loaded BPF program we export 2 related metrics:
tetragon_overhead_time_program_total[namespace,policy,sensor,attach]
tetragon_overhead_cnt_program_total[namespace,policy,sensor,attach]
Each loaded program is identified by labels:
namespace
is policy Kubernetes namespacepolicy
is policy namesensor
is sensor nameattach
is program attachment name
If we have generic_kprobe
sensor attached on __x64_sys_close
kernel function
under syswritefollowfdpsswd
policy, the related metrics will look like:
tetragon_overhead_program_runs_total{attach="__x64_sys_close",policy="syswritefollowfdpsswd",policy_namespace="",sensor="generic_kprobe"} 15894
tetragon_overhead_program_seconds_total{attach="__x64_sys_close",policy="syswritefollowfdpsswd",policy_namespace="",sensor="generic_kprobe"} 1.03908217e+08
Limitations
Note that the BPF programs statistics are not enabled by default, because they introduce extra overhead, so it’s necessary to enable them manually.
Either with
sysctl
:sysctl kernel.bpf_stats_enabled=1
and make sure you disable the stats when it’s no longer needed:
sysctl kernel.bpf_stats_enabled=0
Or with following
tetra
command:tetra debug enable-stats ^C
where the stats are enabled as long as the command is running (sleeping really).
6 - Event throttling
This page shows you how to configure per-cgroup rate monitoring.
Concept
The idea is that tetragon monitors events rate per cgroup and throttle them (stops posting its events) if they cross configured threshold.
The throttled cgroup is monitored and if its traffic gets stable under the limit again, it stops the cgroup throttling and tetragon resumes receiving the cgroup’s events.
The throttle action generates following events:
THROTTLE
start event is sent when the group rate limit is crossedTHROTTLE
stop event is sent when the cgroup rate is again below the limit stable for 5 seconds
At the moment we monitor and limit base sensor events:
PROCESS_EXEC
PROCESS_EXIT
Setup
The cgroup rate is configured with --cgroup-rate
option:
--cgroup-rate string
Base sensor events cgroup rate <events,interval> disabled by default
('1000,1s' means rate 1000 events per second)
--cgroup-rate=10,1s
sets the cgroup threshold on 10 events per 1 second
--cgroup-rate=1000,1s
sets the cgroup threshold on 1000 events per 1 second
--cgroup-rate=100,1m
sets the cgroup threshold on 1000 events per 1 minutes
--cgroup-rate=10000,10m
sets the cgroup threshold on 1000 events per 10 minutes
Events
The throttle events contains fields as follows.
THROTTLE_START
{ "process_throttle": { "type": "THROTTLE_START", "cgroup": "session-429.scope" }, "node_name": "ubuntu-22", "time": "2024-07-26T13:07:43.178407128Z" }
THROTTLE_STOP
{ "process_throttle": { "type": "THROTTLE_STOP", "cgroup": "session-429.scope" }, "node_name": "ubuntu-22", "time": "2024-07-26T13:07:55.501718877Z" }
Example
This example shows how to generate throttle events when cgroup rate monitoring is enabled.
Start tetragon with cgroup rate monitoring 10 events per second.
tetragon --bpf-lib ./bpf/objs/ --cgroup-rate=10,1s
The successful configuration will show in tetragon log.
... time="2024-07-26T13:33:19Z" level=info msg="Cgroup rate started (10/1s)" ...
Spawn more than 10 events per second.
while :; do sleep 0.001s; done
Monitor events shows throttling.
tetra getevents -o compact
The output should be similar to:
🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 🧬 throttle START session-429.scope 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 💥 exit ubuntu-22 /usr/bin/sleep 0.001s 0 🚀 process ubuntu-22 /usr/bin/sleep 0.001s 🧬 throttle STOP session-429.scope
When you stop the while loop from the other terminal you will get above
throttle STOP
event after 5 seconds.
Limitations
- The cgroup rate is monitored per CPU
- At the moment we only monitor and limit base sensor and kprobe events:
PROCESS_EXEC
PROCESS_EXIT