This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Concepts

The concepts section helps you understand various Tetragon abstractions and mechanisms.

1: Events
2: Tracing Policy

2.1: Example
2.2: Argument types
2.3: Hook points
2.4: Options
2.5: Selectors
2.6: Tags
2.7: Kubernetes Identity Aware Policies
2.8: Enforcement Mode

3: Runtime Hooks
4: Enforcement

4.1: Persistent enforcement

5: Event throttling

1 - Events

Documentation for Tetragon Events

Tetragon’s events are exposed to the system through either the gRPC endpoint or JSON logs. Commands in this section assume the Getting Started guide was used, but are general other than the namespaces chosen and should work in most environments.

JSON

The first way is to observe the raw json output from the stdout container log:

kubectl logs -n kube-system -l app.kubernetes.io/name=tetragon -c export-stdout -f

The raw JSON events provide Kubernetes API, identity metadata, and OS level process visibility about the executed binary, its parent and the execution time. A base Tetragon installation will produce process_exec and process_exit events encoded in JSON as shown here,

Process execution event

style=color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4>

{ "process_exec": { "process": { "exec_id": "Z2tlLWpvaG4tNjMyLWRlZmF1bHQtcG9vbC03MDQxY2FjMC05czk1OjEzNTQ4Njc0MzIxMzczOjUyNjk5", "pid": 52699, "uid": 0, "cwd": "/", "binary": "/usr/bin/curl", "arguments": "https://ebpf.io/applications/#tetragon", "flags": "execve rootcwd", "start_time": "2023-10-06T22:03:57.700327580Z", "auid": 4294967295, "pod": { "namespace": "default", "name": "xwing", "container": { "id": "containerd://551e161c47d8ff0eb665438a7bcd5b4e3ef5a297282b40a92b7c77d6bd168eb3", "name": "spaceship", "image": { "id": "docker.io/tgraf/netperf@sha256:8e86f744bfea165fd4ce68caa05abc96500f40130b857773186401926af7e9e6", "name": "docker.io/tgraf/netperf:latest" }, "start_time": "2023-10-06T21:52:41Z", "pid": 49 }, "pod_labels": { "app.kubernetes.io/name": "xwing", "class": "xwing", "org": "alliance" }, "workload": "xwing" }, "docker": "551e161c47d8ff0eb665438a7bcd5b4", "parent_exec_id": "Z2tlLWpvaG4tNjMyLWRlZmF1bHQtcG9vbC03MDQxY2FjMC05czk1OjEzNTQ4NjcwODgzMjk5OjUyNjk5", "tid": 52699 }, "parent": { "exec_id": "Z2tlLWpvaG4tNjMyLWRlZmF1bHQtcG9vbC03MDQxY2FjMC05czk1OjEzNTQ4NjcwODgzMjk5OjUyNjk5", "pid": 52699, "uid": 0, "cwd": "/", "binary": "/bin/bash", "arguments": "-c \"curl https://ebpf.io/applications/#tetragon\"", "flags": "execve rootcwd clone", "start_time": "2023-10-06T22:03:57.696889812Z", "auid": 4294967295, "pod": { "namespace": "default", "name": "xwing", "container": { "id": "containerd://551e161c47d8ff0eb665438a7bcd5b4e3ef5a297282b40a92b7c77d6bd168eb3", "name": "spaceship", "image": { "id": "docker.io/tgraf/netperf@sha256:8e86f744bfea165fd4ce68caa05abc96500f40130b857773186401926af7e9e6", "name": "docker.io/tgraf/netperf:latest" }, "start_time": "2023-10-06T21:52:41Z", "pid": 49 }, "pod_labels": { "app.kubernetes.io/name": "xwing", "class": "xwing", "org": "alliance" }, "workload": "xwing" }, "docker": "551e161c47d8ff0eb665438a7bcd5b4", "parent_exec_id": "Z2tlLWpvaG4tNjMyLWRlZmF1bHQtcG9vbC03MDQxY2FjMC05czk1OjEzNTQ4NjQ1MjQ1ODM5OjUyNjg5", "tid": 52699 } }, "node_name": "gke-john-632-default-pool-7041cac0-9s95", "time": "2023-10-06T22:03:57.700326678Z" style=color:#111>}

Will only highlight a few important fields here. For a full specification of events see the Reference section. All events in Tetragon contain a process_exec block to identify the process generating the event. For execution events this is the primary block. For Tracing Policy events the hook that generated the event will attach further data to this. The process_exec event provides a cluster wide unique id the process_exec.exec_id for this process along with the metadata expected in a Kubernetes cluster process_exec.process.pod. The binary and args being executed are part of the event here process_exec.process.binary and process_exec.process.args. Finally, a node_name and time provide the location and time for the event and will be present in all event types.

A default deployment writes the JSON log to /var/run/cilium/tetragon/tetragon.log where it can be exported through normal log collection tooling, e.g. ‘fluentd’, logstash, etc.. The file will be rotated and compressed by default. See [Helm Options] for details on how to customize this location.

`Export Filtering`

Export filters restrict the JSON event output to a subset of desirable events. These export filters are configured as a line-separated list of JSON objects, where each object can contain one or more filter expressions. Filters are combined by taking the logical OR of each line-separated filter object and the logical AND of sibling expressions within a filter object. As a concrete example, suppose we had the following filter configuration:

{"event_set": ["PROCESS_EXEC", "PROCESS_EXIT"], "namespace": ["foo"]}
{"event_set": ["PROCESS_KPROBE"]}

The above filter configuration would result in a match if:

The event type is PROCESS_EXEC or PROCESS_EXIT AND the pod namespace is “foo”; OR
The event type is PROCESS_KPROBE

Tetragon supports two groups of export filters: an allowlist and a denylist. If neither is configured, all events are exported. If only an allowlist is configured, event exports are considered default-deny, meaning only the events in the allowlist are exported. The denylist takes precedence over the allowlist in cases where two filter configurations match on the same event.

You can configure export filters using the provided helm options, command line flags, or environment variables.

Caution The denylist and allowlist filters mentioned above only apply to events
exported as JSON files (e.g., through a configured file sink). These filters
do not affect events streamed via the gRPC API, which are consumed by tools
like the tetra CLI. You will still see unfiltered events when using the
tetra CLI, even if those events would be filtered out by denylist or
allowlist for JSON export.

`List of Process Event Filters`

The type for each filter attribute can be determined by referring to the events.proto file.

Filter	Description
`event_set`	Filter process events by event types. Supported types include: `PROCESS_EXEC`, `PROCESS_EXIT`, `PROCESS_KPROBE`, `PROCESS_UPROBE`, `PROCESS_TRACEPOINT`, `PROCESS_LOADER`
`binary_regex`	Filter process events by a list of regular expressions of process binary names (e.g. `"^/home/kubernetes/bin/kubelet$"`). You can find the full syntax here.
`health_check`	Filter process events if their binary names match Kubernetes liveness / readiness probe commands of their corresponding pods.
`namespace`	Filter by Kubernetes pod namespaces. An empty string (`""`) filters processes that do not belong to any pod namespace.
`pid`	Filter by process PID.
`pid_set`	Like `pid` but also includes processes that are descendants of the listed PIDs.
`pod_regex`	Filter by pod name using a list of regular expressions. You can find the full syntax here.
`arguments_regex`	Filter by process arguments using a list of regular expressions. You can find the full syntax here.
`labels`	Filter events by pod labels using Kubernetes label selector syntax Note that this filter never matches events without the pod field (i.e. host process events).
`policy_names`	Filter events by tracing policy names.
`capabilities`	Filter events by Linux process capability.
`cel_expression`	Filter using CEL expressions. CEL filters support IP and CIDR notiation extensions from the k8s project. See https://pkg.go.dev/k8s.io/apiserver/pkg/cel/library#IP and https://pkg.go.dev/k8s.io/apiserver/pkg/cel/library#CIDR for details.
`parent_binary_regex`	Filter process events by a list of regular expressions of parent process binary names (e.g. `"^/home/kubernetes/bin/kubelet$"`). You can find the full syntax here.
`parent_arguments_regex`	Filter by parent process arguments using a list of regular expressions. You can find the full syntax here.
`container_id`	Filter by the container ID in the process.docker field using RE2 regular expression syntax: https://github.com/google/re2/wiki/Syntax
`in_init_tree`	Filter containerized processes based on whether they are descendants of the container’s init process. This can be used, for example, to watch for processes injected into a container via docker exec, kubectl exec, or similar mechanisms.
`ancestor_binary_regex`	Filter process events by a list of regular expressions of ancestor processes’ binary names (e.g. `"^/home/kubernetes/bin/kubelet$"`). You can find the full syntax here.

`Field Filtering`

In some cases, it is not desirable to include all of the fields exported in Tetragon events by default. In these cases, you can use field filters to restrict the set of exported fields for a given event type. Field filters are configured similarly to export filters, as line-separated lists of JSON objects.

Field filters select fields using the protobuf field mask syntax under the "fields" key. You can define a path of fields using field names separated by period (.) characters. To define multiple paths in a single field filter, separate them with comma (,) characters. For example, "fields":"process.binary,parent.binary,pod.name" would select only the process.binary, parent.binary, and pod.name fields.

By default, a field filter applies to all process events, although you can control this behaviour with the "event_set" key. For example, you can apply a field filter to PROCESS_KPROBE and PROCESS_TRACEPOINT events by specifying "event_set":["PROCESS_KPROBE","PROCESS_TRACEPOINT"] in the filter definition.

Each field filter has an "action" that determines what the filter should do with the selected field. The supported action types are "INCLUDE" and "EXCLUDE". A value of "INCLUDE" will cause the field to appear in an event, while a value of "EXCLUDE" will hide the field. In the absence of any field filter for a given event type, the export will include all fields by default. Defining one or more "INCLUDE" filters for a given event type changes that behaviour to exclude all other event types by default.

As a simple example of the above, consider the case where we want to include only exec_id and parent_exec_id in all event types except for PROCESS_EXEC:

{"fields":"process.exec_id,process.parent_exec_id", "event_set": ["PROCESS_EXEC"], "invert_event_set": true, "action": "INCLUDE"}

`Redacting Sensitive Information`

Since Tetragon traces the entire system, event exports might sometimes contain sensitive information (for example, a secret passed via a command line argument to a process). To prevent this information from being exfiltrated via Tetragon JSON export, Tetragon provides a mechanism called Redaction Filters which can be used to string patterns to redact from exported process arguments. These filters are written in JSON and passed to the Tetragon agent via the --redaction-filters command line flag or the redactionFilters Helm value.

To perform redactions, redaction filters define RE2 regular expressions in the redact field. Any capture groups in these RE2 regular expressions are redacted and replaced with "*****".

Note This feature uses RE2 as its regular expression library. Make sure that you follow
RE2 regular expression guidelines as you may observe unexpected results otherwise.
More information on RE2 syntax can be found here.

Warning When writing regular expressions in JSON, it is important to escape backslash
characters. For instance \Wpasswd\W? would be written as {"redact": "\\Wpasswd\\W?"}.

For more control, you can select which binary or binaries should have their arguments redacted with the binary_regex field.

As a concrete example, the following will redact all passwords passed to processes with the "--password" argument:

{"redact": ["--password(?:\\s+|=)(\\S*)"]}

Now, an event that contains the string "--password=foo" would have that string replaced with "--password=*****".

Suppose we also see some passwords passed via the -p shorthand for a specific binary, foo. We can also redact these as follows:

{"binary_regex": ["(?:^|/)foo$"], "redact": ["-p(?:\\s+|=)(\\S*)"]}

With both of the above redaction filters in place, we are now redacting all password arguments.

`tetra CLI`

A second way is to use the tetra CLI. This has the advantage that it can also be used to filter and pretty print the output. The tool allows filtering by process, pod, and other fields. To install tetra see the Tetra Installation Guide

To start printing events run:

kubectl logs -n kube-system -l app.kubernetes.io/name=tetragon -c export-stdout -f | tetra getevents -o compact

The tetra CLI is also available inside tetragon container.

kubectl exec -it -n kube-system ds/tetragon -c tetragon -- tetra getevents -o compact

This was used in the quick start and generates a pretty printing of the events, To further filter by a specific binary and/or pod do the following,

kubectl logs -n kube-system -l app.kubernetes.io/name=tetragon -c export-stdout -f | tetra getevents -o compact --processes curl --pod xwing

Will filter and report just the relevant events.

🚀 process default/xwing /usr/bin/curl https://ebpf.io/applications/#tetragon
💥 exit    default/xwing /usr/bin/curl https://ebpf.io/applications/#tetragon 60

`gRPC`

In addition Tetragon can expose a gRPC endpoint listeners may attach to. The gRPC is exposed by default helm install on localhost:54321, but the address can be configured with the --server-address option. This can be set from helm with the tetragon.grpc.address flag or disabled completely if needed with tetragon.grpc.enabled.

helm install tetragon cilium/tetragon -n kube-system --set tetragon.grpc.enabled=true --set tetragon.grpc.address=localhost:54321

An example gRPC endpoint is the Tetra CLI when its not piped JSON output directly,

 kubectl exec -ti -n kube-system ds/tetragon -c tetragon -- tetra getevents -o compact

`2 - Tracing Policy`

Documentation for the TracingPolicy custom resource

Tetragon’s TracingPolicy is a user-configurable Kubernetes custom resource (CR) that allows users to trace arbitrary events in the kernel and optionally define actions to take on a match. Policies consist of a hook point (kprobes, tracepoints, and uprobes are supported), and selectors for in-kernel filtering and specifying actions. For more details, see hook points page and the selectors page.

Caution TracingPolicy allows for powerful, yet low-level configuration and, as such,
requires knowledge about the Linux kernel and containers to avoid unexpected
issues such as TOCTOU bugs.

For the complete custom resource definition (CRD) refer to the YAML file cilium.io_tracingpolicies.yaml. One practical way to explore the CRD is to use kubectl explain against a Kubernetes API server on which it is installed, for example kubectl explain tracingpolicy.spec.kprobes provides field-specific documentation and details on kprobe spec.

Tracing Policies can be loaded and unloaded at runtime in Tetragon, or on startup using flags.

With Kubernetes, you can use kubectl to add and remove a TracingPolicy.
You can use tetra gRPC CLI to add and remove a TracingPolicy.
You can use the --tracing-policy and --tracing-policy-dir flags to statically add policies at startup time, see more in the daemon configuration page.

Hence, even though Tracing Policies are structured as a Kubernetes CR, they can also be used in non-Kubernetes environments using the last two loading methods.

`2.1 - Example`

Learn the basics of Tracing Policy via an example

To discover TracingPolicy, let’s understand via an example that will be explained, part by part, in this document:

Warning This policy is for illustration purposes only and should not be used to
restrict access to certain files. It can be easily bypassed by, for example,
using hard links.

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "fd-install"
spec:
  kprobes:
  - call: "fd_install"
    syscall: false
    args:
    - index: 0
      type: "int"
    - index: 1
      type: "file"
    selectors:
    - matchArgs:
      - index: 1
        operator: "Equal"
        values:
        - "/tmp/tetragon"
      matchActions:
      - action: Sigkill

The policy checks for file descriptors being created, and sends a SIGKILL signal to any process that creates a file descriptor to a file named /tmp/tetragon. We discuss the policy in more detail next.

`Required fields`

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "fd-install"

The first part follows a common pattern among all Cilium Policies or more widely Kubernetes object. It first declares the Kubernetes API used, then the kind of Kubernetes object it is in this API and an arbitrary name for the object that has to comply with Kubernetes naming convention.

`Hook point`

spec:
  kprobes:
  - call: "fd_install"
    syscall: false
    args:
    - index: 0
      type: "int"
    - index: 1
      type: "file"

The beginning of the specification describes the hook point to use. Here we are using a kprobe, hooking on the kernel function fd_install. That’s the kernel function that gets called when a new file descriptor is created. We indicate that it’s not a syscall, but a regular kernel function. We then specify the function arguments, so that Tetragon’s BPF code will extract and optionally perform filtering on them.

See the hook points page for further information on the various hook points available and arguments.

`Selectors`

    selectors:
    - matchArgs:
      - index: 1
        operator: "Equal"
        values:
        - "/tmp/tetragon"
      matchActions:
      - action: Sigkill

Selectors allow you to filter on the events to extract only a subset of the events based on different properties and optionally take an action.

In the example, we filter on the argument at index 1, passing a file struct to the function. Tetragon has the knowledge on how to apply the Equal operator over a Linux kernel file struct and match on the path of the file.

Then we add the Sigkill action, meaning, that any match of the selector should send a SIGKILL signal to the process that initiated the event.

Learn more about the various selectors in the dedicated selectors page.

`Message`

The message field is an optional short message that will be included in the generated event to inform users what is happening.

spec:
  kprobes:
  - call: "fd_install"
    message: "Installing a file descriptor"

`Tags`

Tags are optional fields of a Tracing Policy that are used to categorize generated events. Further reference here: Tags documentation.

`Policy effect`

First, let’s create the /tmp/tetragon file with some content:

echo eBPF! > /tmp/tetragon

You can save the policy in an example.yaml file, compile Tetragon locally, and start Tetragon:

sudo ./tetragon --bpf-lib bpf/objs --tracing-policy example.yaml

(See Quick Kubernetes Install and Quick Local Docker Install for other ways to start Tetragon.)

Note Stop tetragon with Ctrl+C to disable the policy and
remove the BPF programs.

Once the Tetragon starts, you can monitor events using tetra, the tetragon CLI:

./tetra getevents -o compact

Reading the /tmp/tetragon file with cat:

cat /tmp/tetragon

Results in the following events:

🚀 process  /usr/bin/cat /tmp/tetragon
📬 open     /usr/bin/cat /tmp/tetragon
💥 exit     /usr/bin/cat /tmp/tetragon SIGKILL

And the shell where the cat command was performed will return:

Killed

`See more`

For more examples of tracing policies, take a look at the examples/tracingpolicy folder in the Tetragon repository. Also read the following sections on hook points and selectors.

`2.2 - Argument types`

Argument types for data retrieval

Each argument definition specifies data type to be retrieved from kernel argument. The list contains simple POD types and several complex kernel objects that are represented by extracted data type.

List of described data types:

`sint8, int8`

The data type extracts 8-bit signed value.

`uint8`

The data type extracts 8-bit unsigned value.

`sint16, int16`

The data type extracts 16-bit signed value.

`uint16`

The data type extracts 16-bit unsigned value.

`int, sint32, int32`

The data type extracts 32-bit signed value.

`uint32`

The data type extracts 32-bit unsigned value.

`long, sint64, int64`

The data type extracts 64-bit signed value.

`ulong, uint64, size_t`

The data type extracts 64-bit unsigned value.

`string`

The data type extracts string terminated with zero byte.

`skb`

TBD

`sock`

TBD

`char_buf`

TBD

`char_iovec`

TBD

`filename`

TBD

`fd`

TBD

`cred`

TBD

`const_buf`

TBD

`nop`

TBD

`bpf_attr`

TBD

`perf_event`

TBD

`bpf_map`

TBD

`user_namespace`

TBD

`capability`

TBD

`kiocb`

TBD

`iov_iter`

TBD

`load_info`

TBD

`module`

TBD

`syscall64`

TBD

`kernel_cap_t`

TBD

`cap_inheritable`

TBD

`cap_permitted`

TBD

`cap_effective`

TBD

`linux_binprm`

The linux_binprm data type represents kernel struct linux_binprm object and retrieves the struct linux_binprm::file full path.

See general path limitations in path retrieval limits)

`data_loc`

TBD

`net_device`

TBD

`sockaddr`

TBD

`socket`

TBD

`file`

The file data type represents kernel struct file object and retrieves the file’s full path.

See general path limitations in path retrieval limits)

`dentry`

The dentry data type represents kernel struct dentry object retrieves the related path within one mountpoint.

This stems from the fact that with just struct dentry tetragon does not have mount information and does not have enough data to pass through main point within the path.

See general path limitations in path retrieval limits)

`path`

The path data type represents kernel struct path object retrieves the related path.

CautionFull path retrieval is available only on kernels v5.3 and later.
On older kernels, there’s a limit of 256 path components, which means
we can retrieve up to the maximum path length (4096 bytes), but only
with 256 path entries (directories and file name).

`2.3 - Hook points`

Hook points for Tracing Policies and arguments description

Tetragon can hook into the kernel using kprobes and tracepoints, as well as in user-space programs using uprobes. Users can configure these hook points using the correspodning sections of the TracingPolicy specification (.spec). These hook points include arguments and return values that can be specified using the args and returnArg fields as detailed in the following sections.

`Kprobes`

Kprobes enables you to dynamically hook into any kernel function and execute BPF code. Because kernel functions might change across versions, kprobes are highly tied to your kernel version and, thus, might not be portable across different kernels.

Conveniently, you can list all kernel symbols reading the /proc/kallsyms file. For example to search for the write syscall kernel function, you can execute sudo grep sys_write /proc/kallsyms, the output should be similar to this, minus the architecture specific prefixes.

ffffdeb14ea712e0 T __arm64_sys_writev
ffffdeb14ea73010 T ksys_write
ffffdeb14ea73140 T __arm64_sys_write
ffffdeb14eb5a460 t proc_sys_write
ffffdeb15092a700 d _eil_addr___arm64_sys_writev
ffffdeb15092a740 d _eil_addr___arm64_sys_write

You can see that the exact name of the symbol for the write syscall on our kernel version is __arm64_sys_write. Note that on x86_64, the prefix would be __x64_ instead of __arm64_.

CautionKernel symbols contain an architecture specific prefix when they refer to
syscall symbols. To write portable tracing policies, i.e. policies that can run
on multiple architectures, just use the symbol name without the prefix.
For example, instead of writing call: "__arm64_sys_write" or call: "__x64_sys_write", just write call: "sys_write", Tetragon will adapt and add
the correct prefix based on the architecture of the underlying machine. Note
that the event generated as output currently includes the prefix.

In our example, we will explore a kprobe hooking into the fd_install kernel function. The fd_install kernel function is called each time a file descriptor is installed into the file descriptor table of a process, typically referenced within system calls like open or openat. Hooking fd_install has its benefits and limitations, which are out of the scope of this guide.

spec:
  kprobes:
  - call: "fd_install"
    syscall: false

Note Notice the syscall field, specific to a kprobe spec, with default value
false, that indicates whether Tetragon will hook a syscall or just a regular
kernel function. Tetragon needs this information because syscall and kernel
function use a different ABI.

Kprobes calls can be defined independently in different policies, or together in the same Policy. For example, we can define trace multiple kprobes under the same tracing policy:

spec:
  kprobes:
  - call: "sys_read"
    syscall: true
    # [...]
  - call: "sys_write"
    syscall: true
    # [...]

`Tracepoints`

Tracepoints are statically defined in the kernel and have the advantage of being stable across kernel versions and thus more portable than kprobes.

To see the list of tracepoints available on your kernel, you can list them using sudo ls /sys/kernel/debug/tracing/events, the output should be similar to this.

alarmtimer    ext4            iommu           page_pool     sock
avc           fib             ipi             pagemap       spi
block         fib6            irq             percpu        swiotlb
bpf_test_run  filelock        jbd2            power         sync_trace
bpf_trace     filemap         kmem            printk        syscalls
bridge        fs_dax          kvm             pwm           task
btrfs         ftrace          libata          qdisc         tcp
cfg80211      gpio            lock            ras           tegra_apb_dma
cgroup        hda             mctp            raw_syscalls  thermal
clk           hda_controller  mdio            rcu           thermal_power_allocator
cma           hda_intel       migrate         regmap        thermal_pressure
compaction    header_event    mmap            regulator     thp
cpuhp         header_page     mmap_lock       rpm           timer
cros_ec       huge_memory     mmc             rpmh          tlb
dev           hwmon           module          rseq          tls
devfreq       i2c             mptcp           rtc           udp
devlink       i2c_slave       napi            sched         vmscan
dma_fence     initcall        neigh           scmi          wbt
drm           interconnect    net             scsi          workqueue
emulation     io_uring        netlink         signal        writeback
enable        iocost          oom             skb           xdp
error_report  iomap           page_isolation  smbus         xhci-hcd

You can then choose the subsystem that you want to trace, and look the tracepoint you want to use and its format. For example, if we choose the netif_receive_skb tracepoints from the net subsystem, we can read its format with sudo cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/format, the output should be similar to the following.

name: netif_receive_skb
ID: 1398
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:void * skbaddr;   offset:8;       size:8; signed:0;
        field:unsigned int len; offset:16;      size:4; signed:0;
        field:__data_loc char[] name;   offset:20;      size:4; signed:0;

print fmt: "dev=%s skbaddr=%px len=%u", __get_str(name), REC->skbaddr, REC->len

Similarly to kprobes, tracepoints can also hook into system calls. For more details, see the raw_syscalls and syscalls subysystems.

An example of tracepoints TracingPolicy could be the following, observing all syscalls and getting the syscall ID from the argument at index 4:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "raw-syscalls"
spec:
  tracepoints:
  - subsystem: "raw_syscalls"
    event: "sys_enter"
    args:
    - index: 4
      type: "int64"

`Raw Tracepoints`

Raw tracepoints allow same attachment as tracepoints, but allow access to raw tracepoint arguments (compared to standard tracepoints that provide access to arguments through tracepoint format).

Raw tracepoints are attach directly without the perf layer in the middle, so should be bit faster than standard tracepoints.

Each tracepoint is defined in kernel with similar statement like following:

TRACE_EVENT(sched_process_exec,

        TP_PROTO(struct task_struct *p, pid_t old_pid,
                 struct linux_binprm *bprm),
        ...
);

which defines sched:sched_process_exec tracepoint with arguments.

Raw tracepoints are configured by setting with raw spec tag.

apiversion: cilium.io/v1alpha1
kind: tracingpolicy
metadata:
  name: "rawtp"
spec:
  tracepoints:
    - subsystem: "sched"
      event: "sched_process_exec"
      raw: true

We can use raw tracepoint to attach the tracepoint and access above arguments directly.

Following example stores 3rd argument:

apiversion: cilium.io/v1alpha1
kind: tracingpolicy
metadata:
  name: "rawtp"
spec:
  tracepoints:
    - subsystem: "sched"
      event: "sched_process_exec"
      raw: true
      args:
        - index: 2
          type: "linux_binprm"

which is represented as path and permission data in the resulted event:

    "subsys": "sched",
    "event": "sched_process_exec",
    "args": [
      {
        "linux_binprm_arg": {
          "path": "/usr/bin/ls",
          "permission": "rwxr-xr-x"
        }
      }

It’s also possible to resolve data from arguments with Resolve, like:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "rawtp"
spec:
  tracepoints:
    - subsystem: "sched"
      event: "sched_process_exec"
      raw: true
      args:
        - index: 0
          type: "int"
          resolve: "pid"
        - index: 2
          type: "linux_binprm"

which gives you pid field from first task_struct argument, resulting in following event data:

    "subsys": "sched",
    "event": "sched_process_exec",
    "args": [
      {
        "int_arg": 938600
      },
      {
        "linux_binprm_arg": {
          "path": "/usr/bin/ls",
          "permission": "rwxr-xr-x"
        }
      }

`Uprobes`

Uprobes are similar to kprobes, but they allow you to dynamically hook into any user-space function and execute BPF code. Uprobes are also tied to the binary version of the user-space program, so they may not be portable across different versions or architectures.

To use uprobes, you need to specify the path to the executable or library file, and the symbol of the function you want to probe. You can use tools like objdump, nm, or readelf to find the symbol of a function in a binary file. For example, to find the readline symbol in /bin/bash using nm, you can run:

nm -D /bin/bash | grep readline

The output should look similar to this, with a few lines redacted:

[...]
000000000009f2b0 T pcomp_set_readline_variables
0000000000097e40 T posix_readline_initialize
00000000000d5690 T readline
00000000000d52f0 T readline_internal_char
00000000000d42d0 T readline_internal_setup
[...]

You can see in the nm output: first the symbol value, then the symbol type, for the readline symbol T meaning that this symbol is in the text (code) section of the binary, and finally the symbol name. This confirms that the readline symbol is present in the /bin/bash binary and might be a function name that we can hook with a uprobe.

You can define multiple uprobes in the same policy, or in different policies. You can also combine uprobes with kprobes and tracepoints to get a comprehensive view of the system behavior.

Here is an example of a policy that defines an uprobe for the readline function in the bash executable, and applies it to all processes that use the bash binary:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "example-uprobe"
spec:
  uprobes:
  - path: "/bin/bash"
    symbols:
    - "readline"

This example shows how to use uprobes to hook into the readline function running in all the bash shells.

`USDTs`

Tetragon allows to attach and monitor USDT (User Statically-Defined Tracing) probes.

You can display available USDT probes directly by checking ELF binary stapsdt notes with:

readelf -n ..../contrib/tester-progs/usdt
...
  stapsdt              0x00000024       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: usdt0
    Location: 0x000000000000115b, Base: 0x0000000000002004, Semaphore: 0x0000000000004030
    Arguments:
  stapsdt              0x00000043       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: usdt3
    Location: 0x0000000000001177, Base: 0x0000000000002004, Semaphore: 0x0000000000004032
    Arguments: -4@-12(%rbp) -8@-8(%rbp) 8@%rax
  stapsdt              0x00000084       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: usdt12
    Location: 0x000000000000120a, Base: 0x0000000000002004, Semaphore: 0x0000000000004034
    Arguments: -4@-12(%rbp) -4@%edi -8@-8(%rbp) -8@%r8 -4@$5 -8@%r9 8@%rax 8@%r10 -4@$-9 -2@%dx -2@%cx -1@%sil

Or use tetra to generate tracing policy for available USDT probes with:

tetra tracingpolicy generate usdts --binary ..../contrib/tester-progs/usdt

that generates following tracing policy:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  creationTimestamp: "2025-07-28T21:23:36Z"
  name: usdts
spec:
  usdts:
  - message: ""
    name: usdt0
    path: ./contrib/tester-progs/usdt
    provider: test
  - args:
    - index: 0
      label: -4@-12(%rbp)
      maxData: false
      resolve: ""
      returnCopy: false
      sizeArgIndex: 0
      type: int32
    - index: 1
      label: -8@-8(%rbp)
      maxData: false
      resolve: ""
      returnCopy: false
      sizeArgIndex: 0
      type: int64
    - index: 2
      label: 8@%rax
      maxData: false
      resolve: ""
      returnCopy: false
      sizeArgIndex: 0
      type: uint64

Each USDT probe is defined with binary Path plus Provider and Name tags.

The combination of Provider and Name allows to group usdt probes and is usually specific for each application.

Each probe can have up to 5 arguments (which is a tetragon limitation) that follow standard argument definitions used for other probes.

`LSM BPF`

LSM BPF programs allow runtime instrumentation of the LSM hooks by privileged users to implement system-wide MAC (Mandatory Access Control) and Audit policies using eBPF.

List of LSM hooks which can be instrumented can be found in security/security.c.

To verify if BPF LSM is available use the following command:

cat /boot/config-$(uname -r) | grep BPF_LSM

The output should be similar to this if BPF LSM is supported:

CONFIG_BPF_LSM=y

Then, if provided above conditions are met, use this command to check if BPF LSM is enabled:

cat /sys/kernel/security/lsm

The output might look like this:

bpf,lockdown,integrity,apparmor

If the output includes the bpf, than BPF LSM is enabled. Otherwise, you can modify /etc/default/grub:

GRUB_CMDLINE_LINUX="lsm=lockdown,integrity,apparmor,bpf"

Then, update the grub configuration and restart the system.

The provided example of LSM BPF TracingPolicy monitors access to files /etc/passwd and /etc/shadow with /usr/bin/cat executable.

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "lsm-file-open"
spec:
  lsmhooks:
  - hook: "file_open"
    args:
      - index: 0
        type: "file"
    selectors:
    - matchBinaries:
      - operator: "In"
        values:
        - "/usr/bin/cat"
      matchArgs:
      - index: 0
        operator: "Equal"
        values:
        - "/etc/passwd"
        - "/etc/shadow"

`Arguments`

Kprobes, uprobes and tracepoints all share a needed arguments fields called args. It is a list of arguments to include in the trace output. Tetragon’s BPF code requires information about the types of arguments to properly read, print and filter on its arguments. This information needs to be provided by the user under the args section. For the available types, check the TracingPolicy CRD.

Following our example, here is the part that defines the arguments:

args:
- index: 0
  type: "int"
- index: 1
  type: "file"

Each argument can optionally include a ’label’ parameter, which will be included in the output. This can be used to annotate the arguments to help with understanding and processing the output. As an example, here is the same definition, with an appropriate label on the int argument:

args:
- index: 0
  type: "int"
  label: "FD"
- index: 1
  type: "file"

To properly read and hook onto the fd_install(unsigned int fd, struct file *file) function, the YAML snippet above tells the BPF code that the first argument is an int and the second argument is a file, which is the struct file of the kernel. In this way, the BPF code and its printer can properly collect and print the arguments.

These types are sorted by the index field, where you can specify the order. The indexing starts with 0. So, index: 0 means, this is going to be the first argument of the function, index: 1 means this is going to be the second argument of the function, etc.

Note that for some args types, char_buf and char_iovec, there are additional fields named returnCopy and sizeArgIndex available:

returnCopy indicates that the corresponding argument should be read later (when the kretprobe for the symbol is triggered) because it might not be populated when the kprobe is triggered at the entrance of the function. For example, a buffer supplied to read(2) won’t have content until kretprobe is triggered.
sizeArgIndex indicates the (1-based, see warning below) index of the arguments that represents the size of the char_buf or iovec. For example, for write(2), the third argument, size_t count is the number of char element that we can read from the const void *buf pointer from the second argument. Similarly, if we would like to capture the __x64_sys_writev(long, iovec *, vlen) syscall, then iovec has a size of vlen, which is going to be the third argument.

Caution sizeArgIndex is inconsistent at the moment and does not take the index, but
the number of the index (or index + 1). So if the size is the third argument,
index 2, the value should be 3.

These flags can be combined, see the example below.

- call: "sys_write"
  syscall: true
  args:
  - index: 0
    type: "int"
  - index: 1
    type: "char_buf"
    returnCopy: true
    sizeArgIndex: 3
  - index: 2
    type: "size_t"

Note that you can specify which arguments you would like to print from a specific syscall. For example if you don’t care about the file descriptor, which is the first int argument with index: 0 and just want the char_buf, what is written, then you can leave this section out and just define:

args:
- index: 1
  type: "char_buf"
  returnCopy: true
  sizeArgIndex: 3
- index: 2
  type: "size_t"

This tells the printer to skip printing the int arg because it’s not useful.

For char_buf type up to the 4096 bytes are stored. Data with bigger size are cut and returned as truncated bytes.

You can specify maxData flag for char_buf type to read maximum possible data (currently 327360 bytes), like:

args:
- index: 1
  type: "char_buf"
  maxData: true
  sizeArgIndex: 3
- index: 2
  type: "size_t"

This field is only used for char_buff data. When this value is false (default), the bpf program will fetch at most 4096 bytes. In later kernels (>=5.4) tetragon supports fetching up to 327360 bytes if this flag is turned on.

The maxData flag does not work with returnCopy flag at the moment, so it’s usable only for syscalls/functions that do not require return probe to read the data.

`Attribute resolution`

CautionFor kprobes, available only from kernel version 5.4.
For LSM, available only from kernel version 5.7.
For uprobes, this functionality is not supported.

This functionality allows you to dynamically extract specific attributes from kernel structures passed as parameters to Kprobes and LSM hooks. For example, when using the bprm_check_security LSM hook, you can access and display attributes from the parameter struct linux_binprm *bprm at index 0. This parameter contains many informations you may want to access. With the resolve flag, you can easily access fields such as mm.owner.real_parent.comm, which provides the parent process comm.

`How it works`

The following tracing policy demonstrates how to use the resolve flag to extract the parent process’s comm during the execution of a binary:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "lsm"
spec:
  lsmhooks:
  - hook: "bprm_check_security"
    args:
    - index: 0 # struct linux_binprm *bprm
      type: "string"
      resolve: "mm.owner.real_parent.comm"
    selectors:
    - matchActions:
      - action: Post

index flag : The parameter at index 0 is a pointer to the struct linux_binprm.
resolve flag : Using the resolve flag, the policy extracts the mm.owner.real_parent.comm field, representing the parent process’s comm.

CautionThis feature requires you to read the kernel structure definitions to find what you’re
looking for in the hook parameter attributes. For instance, if you want to have a look
at what is available inside struct linux_binprm, take a look at its definition in
include/linux/binfmts.h
Some structures are dynamic. This means that they may change at runtime.

Tetragon can also handle some structures such as struct file or struct path and a few others. This means you can also extract the whole struct, if it is available in the attributes of the parameter, and set the type with the correct type like this :

...
  lsmhooks:
  - hook: "bprm_check_security"
    args:
      - index: 0 # struct linux_binprm *bprm
        type: "file"
        resolve: "file"
...

Or

...
  lsmhooks:
  - hook: "bprm_check_security"
    args:
      - index: 0
        type: "path"
        resolve: "file.f_path"
...

The following tracing policy demonstrates how to use the resolve flag on nested pointers. For exemple, the hook security_inode_copy_up (defined in security/security.c) takes two parameters: struct dentry *src and struct cred **new. The resolve flag allows you to access fields within these nested structures transparently.

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "lsm"
spec:
  kprobes:
  - call: "security_inode_copy_up"
    syscall: false
    args:
    - index: 0 # struct dentry *src
      type: "int64"
    - index: 1 # struct cred **new
      type: "int"
      resolve: "user.uid.val"
    selectors:
    - matchActions:
      - action: Post

`Return values`

A TracingPolicy spec can specify that the return value should be reported in the tracing output. To do this, the return parameter of the call needs to be set to true, and the returnArg parameter needs to be set to specify the type of the return argument. For example:

- call: "sk_alloc"
  syscall: false
  return: true
  args:
  - index: 1
    type: int
    label: "family"
  returnArg:
    index: 0
    type: sock

In this case, the sk_alloc hook is specified to return a value of type sock (a pointer to a struct sock). Whenever the sk_alloc hook is hit, not only will it report the family parameter in index 1, it will also report the socket that was created.

`Return values for socket tracking`

A unique feature of a sock being returned from a hook such as sk_alloc is that the socket it refers to can be tracked. Most networking hooks in the network stack are run in a context that is not that of the process that owns the socket for which the actions relate; this is because networking happens asynchronously and not entirely in-line with the process. The sk_alloc hook does, however, occur in the context of the process, such that the task, the PID, and the TGID are of the process that requested that the socket was created.

Specifying socket tracking tells Tetragon to store a mapping between the socket and the process’ PID and TGID; and to use that mapping when it sees the socket in a sock argument in another hook to replace the PID and TGID of the context with the process that actually owns the socket. This can be done by adding a returnArgAction to the call. Available actions are TrackSock and UntrackSock. See TrackSock and UntrackSock.

- call: "sk_alloc"
  syscall: false
  return: true
  args:
  - index: 1
    type: int
    label: "family"
  returnArg:
    index: 0
    type: sock
  returnArgAction: TrackSock

Socket tracking is only available on kernels >=5.3.

`Lists`

It’s possible to define list of functions and use it in the kprobe’s call field.

Following example traces all sys_dup[23] syscalls.

spec:
  lists:
  - name: "dups"
    type: "syscalls"
    values:
    - "sys_dup"
    - "sys_dup2"
    - "sys_dup3"
  kprobes:
  - call: "list:dups"

It is basically a shortcut for following policy:

spec:
  kprobes:
  - call: "sys_dup"
    syscall: true
  - call: "sys_dup2"
    syscall: true
  - call: "sys_dup3"
    syscall: true

As shown in subsequent examples, its main benefit is allowing a single definition for calls that have the same filters.

The list is defined under lists field with arbitrary values for name and values fields.

spec:
  lists:
  - name: "dups"
    type: "syscalls"
    values:
    - "sys_dup"
    - "sys_dup2"
    - "sys_dup3"
    ...

It’s possible to define multiple lists.

spec:
  lists:
  - name: "dups"
    type: "syscalls"
    values:
    - "sys_dup"
    - "sys_dup2"
    - "sys_dup3"
    name: "another"
    - "sys_open"
    - "sys_close"

Syscalls specified with sys_ prefix are translated to their 64 bit equivalent function names.

It’s possible to specify a syscall for an alternative ABI by using the ABI name as a prefix. For example:

spec:
  lists:
  - name: "dups"
    type: "syscalls"
    values:
    - "sys_dup"
    - "i386/sys_dup"
    name: "another"
    - "sys_open"
    - "sys_close"

Specific list can be referenced in kprobe’s call field with "list:NAME" value.

spec:
  lists:
  - name: "dups"
  ...

  kprobes:
  - call: "list:dups"

The kprobe definition creates a kprobe for each item in the list and shares the rest of the config specified for kprobe.

List can also specify type field that implies extra checks on the values (like for syscall type) or denote that the list is generated automatically (see below). User must specify syscall type for list with syscall functions. Also syscall functions can’t be mixed with regular functions in the list.

The additional selector configuration is shared with all functions in the list. In following example we create 3 kprobes that share the same pid filter.

spec:
  lists:
  - name: "dups"
    type: "syscalls"
    values:
    - "sys_dup"
    - "sys_dup2"
    - "sys_dup3"
  kprobes:
  - call: "list:dups"
    selectors:
    - matchPIDs:
      - operator: In
        followForks: true
        isNamespacePID: false
        values:
        - 12345

It’s possible to use argument filter together with the list.

It’s important to understand that the argument will be retrieved by using the specified argument type for all the functions in the list.

Following example adds argument filter for first argument on all functions in dups list to match value 9999.

spec:
  lists:
  - name: "dups"
    type: "syscalls"
    values:
    - "sys_dup"
    - "sys_dup2"
    - "sys_dup3"
  kprobes:
  - call: "list:dups"
    args:
    - index: 0
      type: int
    selectors:
    - matchArgs:
      - index: 0
        operator: "Equal"
        values:
        - 9999

There are two additional special types of generated lists.

The generated_syscalls type of list that generates list with all possible syscalls on the system.

Following example traces all syscalls for /usr/bin/kill binary.

spec:
  lists:
  - name: "all-syscalls"
    type: "generated_syscalls"
  kprobes:
  - call: "list:all-syscalls"
    selectors:
    - matchBinaries:
      - operator: "In"
        values:
        - "/usr/bin/kill"

The generated_ftrace type of list that generates functions from ftrace available_filter_functions file with specified filter. The filter is specified with pattern field and expects regular expression.

Following example traces all kernel ksys_* functions for /usr/bin/kill binary.

spec:
  lists:
  - name: "ksys"
    type: "generated_ftrace"
    pattern: "^ksys_*"
  kprobes:
  - call: "list:ksys"
    selectors:
    - matchBinaries:
      - operator: "In"
        values:
        - "/usr/bin/kill"

Note that if syscall list is used in selector with InMap operator, the argument type needs to be syscall64, like.

spec:
  lists:
  - name: "dups"
    type: "syscalls"
    values:
    - "sys_dup"
    - "i386/sys_dup"
  tracepoints:
  - subsystem: "raw_syscalls"
    event: "sys_enter"
    args:
    - index: 4
      type: "syscall64"
    selectors:
    - matchArgs:
      - index: 0
        operator: "InMap"
        values:
        - "list:dups"

`2.4 - Options`

Pass options to hook

It’s possible to pass options through spec file as an array of name and value pairs:

spec:
  options:
    - name: "option-1"
      value: "True"
    - name: "option-2"
      value: "10"

Options array is passed and processed by each hook used in the spec file that supports options. At the moment it’s availabe for kprobe and uprobe hooks.

Kprobe Options: options for kprobe hooks.
Uprobe Options: options for uprobe hooks.

`Kprobe options`

disable-kprobe-multi: disable kprobe multi link

`disable-kprobe-multi`

This option disables kprobe multi link interface for all the kprobes defined in the spec file. If enabled, all the defined kprobes will be atached through standard kprobe interface. It stays enabled for another spec file without this option.

It takes boolean as value, by default it’s false.

Example:

  options:
    - name: "disable-kprobe-multi"
      value: "1"

`Uprobe options`

disable-uprobe-multi: disable uprobe multi link

`disable-uprobe-multi`

This option disables uprobe multi link interface for all the uprobes defined in the spec file. If enabled, all the defined uprobes will be atached through standard uprobe interface. It stays enabled for another spec file without this option.

It takes boolean as value, by default it’s false.

Example:

  options:
    - name: "disable-uprobe-multi"
      value: "1"

`2.5 - Selectors`

Perform in-kernel BPF filtering and actions on events

Selectors enable per-hook in-kernel BPF filtering and actions. Each selector defines a set of filters as well as a set of (optional) actions to be performed if the selector filters match. Each hook can contain up to 5 selectors. If no selectors are defined on a hook, the default action (Post, i.e., post an event) will be used.

Each selector comprises a set of filters:

matchArgs: filter on the value of arguments.
matchReturnArgs: filter on the return value.
matchPIDs: filter on PID.
matchBinaries: filter on binary path.
matchNamespaces: filter on Linux namespaces.
matchCapabilities: filter on Linux capabilities.
matchNamespaceChanges: filter on Linux namespaces changes.
matchCapabilityChanges: filter on Linux capabilities changes.

And a set of actions that will be performed if the specified filters match:

matchActions: apply an action on selector matching.
matchReturnActions: apply an action on return selector matching.

For a selector to match, all of its filters must match (AND operator). If a selector defines no filters, it matches. If multiple selectors are defined in a hook, only the action defined in the first matching selector will be applied (short-circuited OR). If a selector defines no action, the default action (Post) is applied.

For example, the following policy will generate events when the sys_mount system call is executed by binaries other than /usr/bin/mount.

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "mount-example"
spec:
  kprobes:
  - call: "sys_mount"
    syscall: true
    selectors:
      # first selector
      - matchBinaries:
        - operator: In
          values:
          - "/usr/bin/mount"
        matchActions:
        - action: NoPost
      # second selector
      - matchActions:
        - action: Post

`Arguments filter`

Arguments filters can be specified under the matchArgs field and provide filtering based on the value of the function’s argument.

You can specify the argument either in the index or in args field. The index field denotes the argument position within the function arguments, while the args field denotes the arguments position within the spec arguments. Both fields are zero based (1st argument has zero value). The args field (if defined) takes precedence over index field.

The args field can support multiple arguments. Currently, the only operator that supports more than 1 argument is the CapabilitiesGained operator.

In the next example, a selector is defined with a matchArgs filter that tells the BPF code to process only the function call for which the second argument, index equal to 1, concerns the file under the path /etc/passwd or /etc/shadow. It’s using the operator Equal to match against the value of the argument.

Note that conveniently, we can match against a path directly when the argument is of type file.

selectors:
- matchArgs:
  - index: 1
    operator: "Equal"
    values:
    - "/etc/passwd"
    - "/etc/shadow"

In the next example, a selector is defined with a matchArgs filter that tells the BPF code to process only the function call for which the second spec argument, args equals to [1] (which represents 1st function argument, index equals to 0), has value 0xcoffee.

- args:
  - index: 2
    type: "int"
    label: "arg0-index2"
  - index: 0
    type: "int"
    label: "arg1-index1"
  - index: 1
    type: "int"
    label: "arg2-index0"
  selectors:
  - matchArgs:
    - args: [1]
      operator: "Equal"
      values:
      - "0xcoffee"

Note that you can mix index and arg fields within matchArgs selector definitions.

The available operators for matchArgs are:

Equal
NotEqual
Prefix
Postfix
Mask

Further examples

In the previous example, we used the operator Equal, but we can also use the Prefix operator and match against all files under /etc with:

selectors:
- matchArgs:
  - index: 1
    operator: "Prefix"
    values:
    - "/etc"

In this situation, an event will be created every time a process tries to access a file under /etc.

Although it makes less sense, you can also match over the first argument, to only detect events that will use the file descriptor 4, which is usually the first that come afters stdin, stdout and stderr in process. And combine that with the previous example.

- matchArgs:
  - index: 0
    operator: "Equal"
    values:
    - "3"
  - index: 1
    operator: "Prefix"
    values:
    - "/etc"

`Return args filter`

Arguments filters can be specified under the returnMatchArgs field and provide filtering based on the value of the function return value. It allows you to filter on the return value, thus success, error or value returned by a kernel call.

matchReturnArgs:
- operator: "NotEqual"
  values:
  - 0

The available operators for matchReturnArgs are:

Equal
NotEqual
Prefix
Postfix

A use case for this would be to detect the failed access to certain files, like /etc/shadow. Doing cat /etc/shadow will use a openat syscall that will returns -1 for a failed attempt with an unprivileged user.

`PIDs filter`

PIDs filters can be specified under the matchPIDs field and provide filtering based on the value of host pid of the process. For example, the following matchPIDs filter tells the BPF code that observe only hooks for which the host PID is equal to either pid1 or pid2 or pid3:

- matchPIDs:
  - operator: "In"
    followForks: true
    values:
    - "pid1"
    - "pid2"
    - "pid3"

The available operators for matchPIDs are:

In
NotIn

Further examples

Another example can be to collect all processes not associated with a container’s init PID, which is equal to 1. In this way, we are able to detect if there was a kubectl exec performed inside a container because processes created by kubectl exec are not children of PID 1.

- matchPIDs:
  - operator: NotIn
    followForks: false
    isNamespacePID: true
    values:
    - 1

`Binaries filter`

Binary filters can be specified under the matchBinaries field and provide filtering based on the value of a certain binary name. For example, the following matchBinaries selector tells the BPF code to process only system calls and kernel functions that are coming from cat or tail.

- matchBinaries:
  - operator: "In"
    values:
    - "/usr/bin/cat"
    - "/usr/bin/tail"

The available operators for matchBinaries are:

In
NotIn
Prefix
NotPrefix
Postfix
NotPostfix

The values field has to be a map of strings. The default behaviour is followForks: true, so all the child processes are followed.

`Follow children`

the matchBinaries filter can be configured to also apply to children of matching processes. To do this, set followChildren to true. For example:

- matchBinaries:
  - operator: "In"
    values:
    - "/usr/sbin/sshd"
    followChildren: true

There are a number of limitations when using followChildren:

Children created before the policy was installed will not be matched
The number of matchBinaries sections with followChildren: true cannot exceed 64.
Operators other than In are not supported.

Further examples

One example can be to monitor all the sys_write system calls which are coming from the /usr/sbin/sshd binary and its child processes and writing to stdin/stdout/stderr.

This is how we can monitor what was written to the console by different users during different ssh sessions. The matchBinaries selector in this case is the following:

- matchBinaries:
  - operator: "In"
    values:
    - "/usr/sbin/sshd"

while the whole kprobe call is the following:

- call: "sys_write"
  syscall: true
  args:
  - index: 0
    type: "int"
  - index: 1
    type: "char_buf"
    sizeArgIndex: 3
  - index: 2
    type: "size_t"
  selectors:
  # match to /sbin/sshd
  - matchBinaries:
    - operator: "In"
      values:
      - "/usr/sbin/sshd"
  # match to stdin/stdout/stderr
    matchArgs:
    - index: 0
      operator: "Equal"
      values:
      - "1"
      - "2"
      - "3"

`Namespaces filter`

Namespaces filters can be specified under the matchNamespaces field and provide filtering of calls based on Linux namespace. You can specify the namespace inode or use the special host_ns keyword, see the example and description for more information.

An example syntax is:

- matchNamespaces:
  - namespace: Pid
    operator: In
    values:
    - "4026531836"
    - "4026531835"

This will match if: [Pid namespace is 4026531836] OR [Pid namespace is 4026531835]

namespace can be: Uts, Ipc, Mnt, Pid, PidForChildren, Net, Cgroup, or User. Time and TimeForChildren are also available in Linux >= 5.6.
operator can be In or NotIn
values can be raw numeric values (i.e. obtained from lsns) or "host_ns" which will automatically be translated to the appropriate value.

Limitations

We can have up to 4 values. These can be both numeric and host_ns inside a single namespace.
We can have up to 4 namespace values under matchNamespaces in Linux kernel < 5.3. In Linux >= 5.3 we can have up to 10 values (i.e. the maximum number of namespaces that modern kernels provide).

Further examples

We can have multiple namespace filters:

selectors:
- matchNamespaces:
  - namespace: Pid
    operator: In
    values:
    - "4026531836"
    - "4026531835"
  - namespace: Mnt
    operator: In
    values:
    - "4026531833"
    - "4026531834"

This will match if: ([Pid namespace is 4026531836] OR [Pid namespace is 4026531835]) AND ([Mnt namespace is 4026531833] OR [Mnt namespace is 4026531834])

Use cases examples

Generate a kprobe event if /etc/shadow was opened by /bin/cat which either had host Net or Mnt namespace access

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "example_ns_1"
spec:
  kprobes:
    - call: "fd_install"
      syscall: false
      args:
        - index: 0
          type: int
        - index: 1
          type: "file"
      selectors:
        - matchBinaries:
          - operator: "In"
            values:
            - "/bin/cat"
          matchArgs:
          - index: 1
            operator: "Equal"
            values:
            - "/etc/shadow"
          matchNamespaces:
          - namespace: Mnt
            operator: In
            values:
            - "host_ns"
        - matchBinaries:
          - operator: "In"
            values:
            - "/bin/cat"
          matchArgs:
          - index: 1
            operator: "Equal"
            values:
            - "/etc/shadow"
          matchNamespaces:
          - namespace: Net
            operator: In
            values:
            - "host_ns"

This example has 2 selectors. Note that each selector starts with -.

Selector 1:

        - matchBinaries:
          - operator: "In"
            values:
            - "/bin/cat"
          matchArgs:
          - index: 1
            operator: "Equal"
            values:
            - "/etc/shadow"
          matchNamespaces:
          - namespace: Mnt
            operator: In
            values:
            - "host_ns"

Selector 2:

        - matchBinaries:
          - operator: "In"
            values:
            - "/bin/cat"
          matchArgs:
          - index: 1
            operator: "Equal"
            values:
            - "/etc/shadow"
          matchNamespaces:
          - namespace: Net
            operator: In
            values:
            - "host_ns"

We have [Selector1 OR Selector2]. Inside each selector we have filters. Both selectors have 3 filters (i.e. matchBinaries, matchArgs, and matchNamespaces) with different arguments. Adding a - in the beginning of a filter will result in a new selector.

So the previous CRD will match if:

[binary == /bin/cat AND arg1 == /etc/shadow AND MntNs == host] OR [binary == /bin/cat AND arg1 == /etc/shadow AND NetNs is host]

We can modify the previous example as follows:

Generate a kprobe event if /etc/shadow was opened by /bin/cat which has host Net and Mnt namespace access

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "example_ns_2"
spec:
  kprobes:
    - call: "fd_install"
      syscall: false
      args:
        - index: 0
          type: int
        - index: 1
          type: "file"
      selectors:
        - matchBinaries:
          - operator: "In"
            values:
            - "/bin/cat"
          matchArgs:
          - index: 1
            operator: "Equal"
            values:
            - "/etc/shadow"
          matchNamespaces:
          - namespace: Mnt
            operator: In
            values:
            - "host_ns"
          - namespace: Net
            operator: In
            values:
            - "host_ns"

Here we have a single selector. This CRD will match if:

[binary == /bin/cat AND arg1 == /etc/shadow AND (MntNs == host AND NetNs == host) ]

`Capabilities filter`

Capabilities filters can be specified under the matchCapabilities field and provide filtering of calls based on Linux capabilities in the specific sets.

An example syntax is:

- matchCapabilities:
  - type: Effective
    operator: In
    values:
    - "CAP_CHOWN"
    - "CAP_NET_RAW"

This will match if: [Effective capabilities contain CAP_CHOWN] OR [Effective capabilities contain CAP_NET_RAW]

type can be: Effective, Inheritable, or Permitted.
operator can be In or NotIn
values can be any supported capability. A list of all supported capabilities can be found in /usr/include/linux/capability.h.

Limitations

There is no limit in the number of capabilities listed under values.
Only one type field can be specified under matchCapabilities.

`Namespace changes filter`

Namespace changes filter can be specified under the matchNamespaceChanges field and provide filtering based on calls that are changing Linux namespaces. This filter can be useful to track execution of code in a new namespace or even container escapes that change their namespaces.

For instance, if an unprivileged process creates a new user namespace, it gains full privileges within that namespace. This grants the process the ability to perform some privileged operations within the context of this new namespace that would otherwise only be available to privileged root user. As a result, such filter is useful to track namespace creation, which can be abused by untrusted processes.

To keep track of the changes, when a process_exec happens, the namespaces of the process are recorded and these are compared with the current namespaces on the event with a matchNamespaceChanges filter.

matchNamespaceChanges:
- operator: In
  values:
  - "Mnt"

The unshare command, or executing in the host namespace using nsenter can be used to test this feature. See a demonstration example of this feature.

`Capability changes filter`

Capability changes filter can be specified under the matchCapabilityChanges field and provide filtering based on calls that are changing Linux capabilities.

To keep track of the changes, when a process_exec happens, the capabilities of the process are recorded and these are compared with the current capabilities on the event with a matchCapabilityChanges filter.

matchCapabilityChanges:
- type: Effective
  operator: In
  isNamespaceCapability: false
  values:
  - "CAP_SETUID"

See a demonstration example of this feature.

`Actions filter`

Actions filters are a list of actions that execute when an appropriate selector matches. They are defined under matchActions and currently, the following action types are supported:

Warning The FollowFD and related (UnfollowFD, CopyFD) actions have been deprecated due to being unsafe and
are planned to be removed in version 1.5.

Note Sigkill, Override, FollowFD, UnfollowFD, CopyFD, Post,
TrackSock and UntrackSock are
executed directly in the kernel BPF code while GetUrl and DnsLookup are
happening in userspace after the reception of events.

`Sigkill action`

Sigkill action terminates synchronously the process that made the call that matches the appropriate selectors from the kernel. In the example below, every sys_write system call with a PID not equal to 1 or 0 attempting to write to /etc/passwd will be terminated. Indeed when using kubectl exec, a new process is spawned in the container PID namespace and is not a child of PID 1.

- call: "sys_write"
  syscall: true
  args:
  - index: 0
    type: "fd"
  - index: 1
    type: "char_buf"
    sizeArgIndex: 3
  - index: 2
    type: "size_t"
  selectors:
  - matchPIDs:
    - operator: NotIn
      followForks: true
      isNamespacePID: true
      values:
      - 0
      - 1
    matchArgs:
    - index: 0
      operator: "Prefix"
      values:
      - "/etc/passwd"
    matchActions:
    - action: Sigkill

Caution Please consult the Enforcement section if you plan to use
this action for enforcement.

`Signal action`

Signal action sends specified signal to current process. The signal number is specified with argSig value.

Following example is equivalent to the Sigkill action example above. The difference is to use the signal action with SIGKILL(9) signal.

- call: "sys_write"
  syscall: true
  args:
  - index: 0
    type: "fd"
  - index: 1
    type: "char_buf"
    sizeArgIndex: 3
  - index: 2
    type: "size_t"
  selectors:
  - matchPIDs:
    - operator: NotIn
      followForks: true
      isNamespacePID: true
      values:
      - 0
      - 1
    matchArgs:
    - index: 0
      operator: "Prefix"
      values:
      - "/etc/passwd"
    matchActions:
    - action: Signal
      argSig: 9

Caution Please consult the Enforcement section if you plan to use
this action for enforcement.

`Override action`

Override action allows to modify the return value of call. While Sigkill will terminate the entire process responsible for making the call, Override will run in place of the original kprobed function and return the value specified in the argError field. It’s then up to the code path or the user space process handling the returned value to whether stop or proceed with the execution.

For example, you can create a TracingPolicy that intercepts sys_symlinkat and will make it return -1 every time the first argument is equal to the string /etc/passwd:

kprobes:
- call: "sys_symlinkat"
  syscall: true
  args:
  - index: 0
    type: "string"
  - index: 1
    type: "int"
  - index: 2
    type: "string"
  selectors:
  - matchArgs:
    - index: 0
      operator: "Equal"
      values:
      - "/etc/passwd\0"
    matchActions:
    - action: Override
      argError: -1

NoteOverride uses the kernel error injection framework and is only available
on kernels compiled with CONFIG_BPF_KPROBE_OVERRIDE configuration option.
Overriding system calls is the primary use case, but there are other kernel
functions that support error injections too. These functions are annotated
with ALLOW_ERROR_INJECTION() in the kernel source, and can be identified by
reading the file /sys/kernel/debug/error_injection/list.
Starting from kernel version 5.7 overriding security_ hooks is also possible.

Caution For kernel developers: if you want to override your kernel functions then
ensure they properly follow the Error Injectable Functions guide.

`FollowFD action`

Warning The FollowFD and related (UnfollowFD, CopyFD) actions have been deprecated due to being unsafe and
are planned to be removed in version 1.5.

The FollowFD action allows to create a mapping using a BPF map between file descriptors and filenames. After its creation, the mapping can be maintained through UnfollowFD and CopyFD actions. Note that proper maintenance of the mapping is up to the tracing policy writer.

FollowFD is typically used at hook points where a file descriptor and its associated filename appear together. The kernel function fd_install is a good example.

The fd_install kernel function is called each time a file descriptor must be installed into the file descriptor table of a process, typically referenced within system calls like open or openat. It is a good place for tracking file descriptor and filename matching.

Let’s take a look at the following example:

- call: "fd_install"
  syscall: false
  args:
  - index: 0
    type: int
  - index: 1
    type: "file"
  selectors:
  - matchPIDs:
      # [...]
    matchArgs:
      # [...]
    matchActions:
    - action: FollowFD
      argFd: 0
      argName: 1

This action uses the dedicated argFd and argName fields to get respectively the index of the file descriptor argument and the index of the name argument in the call.

While the mapping between the file descriptor and filename remains in place (that is, between FollowFD and UnfollowFD for the same file descriptor) tracing policies may refer to filenames instead of file descriptors. This offers greater convenience and allows more functionality to reside inside the kernel, thereby reducing overhead.

For instance, assume that you want to prevent writes into file /etc/passwd. The system call sys_write only receives a file descriptor, not a filename, as argument. Yet with a bracketing pair of FollowFD and UnfollowFD actions in place the tracing policy that hooks into sys_write can nevertheless refer to the filename /etc/passwd, if it also marks the relevant argument as of type fd.

The following example combines actions FollowFD and UnfollowFD as well as an argument of type fd to such effect:

kprobes:
- call: "fd_install"
  syscall: false
  args:
  - index: 0
    type: int
  - index: 1
    type: "file"
  selectors:
  - matchArgs:
    - index: 1
      operator: "Equal"
      values:
      - "/tmp/passwd"
    matchActions:
    - action: FollowFD
      argFd: 0
      argName: 1
- call: "sys_write"
  syscall: true
  args:
  - index: 0
    type: "fd"
  - index: 1
    type: "char_buf"
    sizeArgIndex: 3
  - index: 2
    type: "size_t"
  selectors:
  - matchArgs:
    - index: 0
      operator: "Equal"
      values:
      - "/tmp/passwd"
    matchActions:
    - action: Sigkill
- call: "sys_close"
  syscall: true
  args:
  - index: 0
     type: "int"
  selectors:
  - matchActions:
    - action: UnfollowFD
      argFd: 0
      argName: 0

`UnfollowFD action`

Warning The FollowFD and related (UnfollowFD, CopyFD) actions have been deprecated due to being unsafe and
are planned to be removed in version 1.5.

The UnfollowFD action takes a file descriptor from a system call and deletes the corresponding entry from the BPF map, where it was put under the FollowFD action. It is typically used at hooks points where the scope of association between a file descriptor and a filename ends. The system call sys_close is a good example.

Let’s take a look at the following example:

- call: "sys_close"
  syscall: true
  args:
  - index: 0
    type: "int"
  selectors:
  - matchPIDs:
    - operator: NotIn
      followForks: true
      isNamespacePID: true
      values:
      - 0
      - 1
    matchActions:
    - action: UnfollowFD
      argFd: 0

Similar to the FollowFD action, the index of the file descriptor is described under argFd:

matchActions:
- action: UnfollowFD
  argFd: 0

In this example, argFD is 0. So, the argument from the sys_close system call at index: 0 will be deleted from the BPF map whenever a sys_close is executed.

- index: 0
  type: "int"

Caution Whenever we would like to follow a file descriptor with a FollowFD block,
there should be a matching UnfollowFD block, otherwise the BPF map will be
broken.

`CopyFD action`

Warning The FollowFD and related (UnfollowFD, CopyFD) actions have been deprecated due to being unsafe and
are planned to be removed in version 1.5.

The CopyFD action is specific to duplication of file descriptor use cases. Similary to FollowFD, it takes an argFd and argName arguments. It can typically be used tracking the dup, dup2 or dup3 syscalls.

See the following example for illustration:

- call: "sys_dup2"
  syscall: true
  args:
  - index: 0
    type: "fd"
  - index: 1
    type: "int"
  selectors:
  - matchPIDs:
    # [...]
    matchActions:
    - action: CopyFD
      argFd: 0
      argName: 1
- call: "sys_dup3"
  syscall: true
  args:
  - index: 0
    type: "fd"
  - index: 1
    type: "int"
  - index: 2
    type: "int"
  selectors:
  - matchPIDs:
    # [...]
    matchActions:
    - action: CopyFD
      argFd: 0
      argName: 1

`GetUrl action`

The GetUrl action can be used to perform a remote interaction such as triggering Thinkst canaries or any system that can be triggered via an URL request. It uses the argUrl field to specify the URL to request using GET method.

matchActions:
- action: GetUrl
  argUrl: http://ebpf.io

`DnsLookup action`

The DnsLookup action can be used to perform a remote interaction such as triggering Thinkst canaries or any system that can be triggered via an DNS entry request. It uses the argFqdn field to specify the domain to lookup.

matchActions:
- action: DnsLookup
  argFqdn: ebpf.io

`Post action`

The Post action allows an event to be transmitted to the agent, from kernelspace to userspace. By default, all TracingPolicy hook will create an event with the Post action except in those situations:

a NoPost action was specified in a matchActions;
a rate-limiting parameter is in place, see details below.

This action allows you to specify parameters for the Post action.

`Rate limiting`

Post takes the rateLimit parameter with a time value. This value defaults to seconds, but post-fixing ’m’ or ‘h’ will cause the value to be interpreted in minutes or hours. When this parameter is specified for an action, that action will check if the same action has fired, for the same thread, within the time window, with the same inspected arguments. (Only the first 40 bytes of each inspected argument is used in the matching. Only supported on kernels v5.3 onwards.)

For example, you can specify a selector to only generate an event every 5 minutes with adding the following action and its paramater:

matchActions:
- action: Post
  rateLimit: 5m

By default, the rate limiting is applied per thread, meaning that only repeated actions by the same thread will be rate limited. This can be expanded to all threads for a process by specifying a rateLimitScope with value “process”; or can be expanded to all processes by specifying the same with the value “global”.

`Stack traces`

Post takes the kernelStackTrace parameter, when turned to true (by default to false) it enables dump of the kernel stack trace to the hook point in kprobes events. To dump user space stack trace set userStackTrace parameter to true. For example, the following kprobe hook can be used to retrieve the kernel stack to kfree_skb_reason, the function called in the kernel to drop kernel socket buffers.

kprobes:
  - call: kfree_skb_reason
    selectors:
    - matchActions:
      - action: Post
        kernelStackTrace: true
        userStackTrace: true

CautionBy default Tetragon does not expose the linear addresses from kernel space or
user space, you need to enable the flag --expose-stack-addresses to get the
addresses along the rest.
Note that the Tetragon agent is using its privilege to read the kernel symbols
and their address. Being able to retrieve kernel symbols address can be used to
break kernel address space layout randomization (KASLR) so only privileged users
should be able to enable this feature and read events containing stack traces.
The same thing we can say about retrieving address for user mode processes.
Stack trace addresses can be used to bypass address space layout randomization (ASLR).

Once loaded, events created from this policy will contain a new kernel_stack_trace field on the process_kprobe event with an output similar to:

{
  "address": "18446744072119856613",
  "offset": "5",
  "symbol": "kfree_skb_reason"
},
{
  "address": "18446744072119769755",
  "offset": "107",
  "symbol": "__sys_connect_file"
},
{
  "address": "18446744072119769989",
  "offset": "181",
  "symbol": "__sys_connect"
},
[...]

The “address” is the kernel function address, “offset” is the offset into the native instruction for the function and “symbol” is the function symbol name.

User mode stack trace is contained in user_stack_trace field on the process_kprobe event and looks like:

{
  "address": "140498967885099",
  "offset": "1209643",
  "symbol": "__connect",
  "module": "/usr/lib/x86_64-linux-gnu/libc.so.6"
},
{
  "address": "140498968021470",
  "offset": "1346014",
  "symbol": "inet_pton",
  "module": "/usr/lib/x86_64-linux-gnu/libc.so.6"
},
{
  "address": "140498971185511",
  "offset": "106855",
  "module": "/usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0"
},

The “address” is the function address, “offset” is the function offset from the beginning of the binary module. “module” is the absolute path of the binary file to which address belongs. “symbol” is the function symbol name. “symbol” may be missing if the binary file is stripped.

NoteInformation from procfs (/proc/<pid>/maps) is used to symbolize user
stack trace addresses. Stack trace addresses extraction and symbolizing are async.
It might happen that process is terminated and the /proc/<pid>/maps file will be
not existed at user stack trace symbolization step. In such case user stack traces
for very short living process might be not collected.
For Linux kernels before 5.15 user stack traces may be incomplete (some stack
traces entries may be missed).

This output can be enhanced in a more human friendly using the tetra getevents -o compact command. Indeed, by default, it will print the stack trace along the compact output of the event similarly to this:

❓ syscall /usr/bin/curl kfree_skb_reason
Kernel:
   0xffffffffa13f2de5: kfree_skb_reason+0x5
   0xffffffffa13dda9b: __sys_connect_file+0x6b
   0xffffffffa13ddb85: __sys_connect+0xb5
   0xffffffffa13ddbd8: __x64_sys_connect+0x18
   0xffffffffa1714bd8: do_syscall_64+0x58
   0xffffffffa18000e6: entry_SYSCALL_64_after_hwframe+0x6e
User space:
   0x7f878cf2752b: __connect (/usr/lib/x86_64-linux-gnu/libc.so.6+0x12752b)
   0x7f878cf489de: inet_pton (/usr/lib/x86_64-linux-gnu/libc.so.6+0x1489de)
   0x7f878d1b6167:  (/usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0+0x1a167)

The printing format for kernel stack trace is "0x%x: %s+0x%x", address, symbol, offset. The printing format for user stack trace is "0x%x: %s (%s+0x%x)", address, symbol, module, offset.

Note Compact output will display missing addresses as 0x0, see the above note on
--expose-stack-addresses for more info.

`File hash collection with IMA`

Post takes the imaHash parameter, when turned to true (by default to false) it adds file hashes in LSM events calculated by Linux integrity subsystem. The following list of LSM hooks is supported:

bprm_check_security
bprm_committed_creds
bprm_committing_creds
bprm_creds_from_file
file_ioctl
file_lock
file_open
file_post_open
file_receive
mmap_file

First, you need to be sure that LSM BPF is enabled.

To verify if IMA-measurement is available use the following command:

cat /boot/config-$(uname -r) | grep "CONFIG_IMA\|CONFIG_INTEGRITY"

The output should be similar to this if IMA-measurement is supported:

CONFIG_INTEGRITY=y
CONFIG_IMA=y

If provided above conditions are met, you can enable IMA-measurement by modifying /etc/deault/grub:

GRUB_CMDLINE_LINUX="lsm=integrity,bpf ima_policy=tcb"

Then, update the grub configuration and restart the system.

ima_policy= is used to define which files will be measured. tcb measures all executables run, all mmap’d files for execution (such as shared libraries), all kernel modules loaded, and all firmware loaded. Additionally, a files opened for read by root are measured as well. ima_policy= can be specified multiple times, and the result is the union of the policies. To know more about ima_policy you can follow this link.

Note Hash calculation with IMA subsystem and LSM BPF is supported from 5.11 kernel version.
For kernel versions below 6.1 is recommended to mount filesystems with iversion. Mounting with iversion
helps IMA not recalculating hash if file is not changed. From kernel 6.1 iversion is by default.
It is not necessary to enable IMA to calculate hashes with Tetragon if you have kernel 6.1+.
But hashes will be recalculated no matter if file is not changed. See implementation details of
bpf_ima_file_hash helper.

The provided example of TracingPolicy collects hashes of executed binaries from zsh and bash interpreters:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
spec:
  lsmhooks:
  - hook: "bprm_check_security"
    args:
      - index: 0
        type: "linux_binprm"
    selectors:
    - matchBinaries:
      - operator: "In"
        values:
        - "/usr/bin/zsh"
        - "/usr/bin/bash"
      matchActions:
        - action: Post
          imaHash: true

LSM event with file hash can look like this:

{
  "process_lsm": {
    "process": {
        ...
    },
    "parent": {
        ...
    },
    "function_name": "bprm_check_security",
    "policy_name": "file-integrity-monitoring",
    "args": [
      {
        "linux_binprm_arg": {
          "path": "/usr/bin/grep",
          "permission": "-rwxr-xr-x"
        }
      }
    ],
    "action": "KPROBE_ACTION_POST",
    "ima_hash": "sha256:73abb4280520053564fd4917286909ba3b054598b32c9cdfaf1d733e0202cc96"
  },
}

ima_hash field contains information about hashing algorithm and the hash value itself separated by ‘:’.

This output can be enhanced in a more human friendly using the tetra getevents -e PROCESS_LSM -o compact command.

🔒 LSM     user-nix /usr/bin/zsh bprm_check_security
   /usr/bin/cat sha256:dd5526c5872cce104a80f4d4e7f787c56ab7686a5b8dedda0ba4e8b36a3c084c
🔒 LSM     user-nix /usr/bin/zsh bprm_check_security
   /usr/bin/grep sha256:73abb4280520053564fd4917286909ba3b054598b32c9cdfaf1d733e0202cc96

`NoPost action`

The NoPost action can be used to suppress the event to be generated, but at the same time all its defined actions are performed.

It’s useful when you are not interested in the event itself, just in the action being performed.

Following example override openat syscall for “/etc/passwd” file but does not generate any event about that.

- call: "sys_openat"
  return: true
  syscall: true
  args:
  - index: 0
    type: int
  - index: 1
    type: "string"
  - index: 2
    type: "int"
  returnArg:
    index: 0
    type: "int"
  selectors:
  - matchPIDs:
    matchArgs:
    - index: 1
      operator: "Equal"
      values:
      - "/etc/passwd"
    matchActions:
    - action: Override
      argError: -2
    - action: NoPost

`TrackSock action`

The TrackSock action allows to create a mapping using a BPF map between sockets and processes. It however needs to maintain a state correctly, see UntrackSock related action. TrackSock works similarly to FollowFD, specifying the argument with the sock type using argSock instead of specifying the FD argument with argFd.

It is however more likely that socket tracking will be performed on the return value of sk_alloc as described above.

Socket tracking is only available on kernel >=5.3.

`UntrackSock action`

The UntrackSock action takes a struct sock pointer from a function call and deletes the corresponding entry from the BPF map, where it was put under the TrackSock action.

Let’s take a look at the following example:

- call: "__sk_free"
  syscall: false
  args:
    - index: 0
      type: sock
  selectors:
    - matchActions:
      - action: UntrackSock
        argSock: 0

Similar to the TrackSock action, the index of the sock is described under argSock:

- matchActions:
  - action: UntrackSock
    argSock: 0

In this example, argSock is 0. So, the argument from the __sk_free function call at index: 0 will be deleted from the BPF map whenever a __sk_free is executed.

- index: 0
  type: "sock"

Caution Whenever we would like to track a socket with a TrackSock block,
there should be a matching UntrackSock block, otherwise the BPF map will be
broken.

Socket tracking is only available on kernel >=5.3.

`Notify Enforcer action`

The NotifyEnforcer action notifies the enforcer program to kill or override a syscall.

It’s meant to be used on systems with kernel that lacks multi kprobe feature, that allows to attach many kprobes quickly). To workaround that the enforcer sensor uses the raw syscall tracepoint and attaches simple program to syscalls that we need to kill or override.

The specs needs to have enforcer program definition, that instructs tetragon to load the enforcer program and attach it to specified syscalls.

spec:
  enforcers:
  - calls:
    - "list:dups"

The syscalls expects list of syscalls or list:XXX pointer to list.

Note that currently only single enforcer definition is allowed.

The NotifyEnforcer action takes 2 arguments.

matchActions:
- action: "NotifyEnforcer"
  argError: -1
  argSig: 9

If specified the argError will be passed to bpf_override_return helper to override the syscall return value. If specified the argSig will be passed to bpf_send_signal helper to override the syscall return value.

The following is spec for killing /usr/bin/bash program whenever it calls sys_dup or sys_dup2 syscalls.

spec:
  lists:
  - name: "dups"
    type: "syscalls"
    values:
    - "sys_dup"
    - "sys_dup2"
  enforcers:
  - calls:
    - "list:dups"
  tracepoints:
  - subsystem: "raw_syscalls"
    event: "sys_enter"
    args:
    - index: 4
      type: "syscall64"
    selectors:
    - matchArgs:
      - index: 0
        operator: "InMap"
        values:
        - "list:dups"
      matchBinaries:
      - operator: "In"
        values:
        - "/usr/bin/bash"
      matchActions:
      - action: "NotifyEnforcer"
        argSig: 9

Note as mentioned above the NotifyEnforcer with enforcer program is meant to be used only on kernel versions with no support for fast attach of multiple kprobes (kprobe_multi link).

With kprobe_multi link support the above example can be easily replaced with:

spec:
  lists:
  - name: "syscalls"
    type: "syscalls"
    values:
    - "sys_dup"
    - "sys_dup2"
  kprobes:
  - call: "list:syscalls"
    selectors:
    - matchBinaries:
      - operator: "In"
        values:
        - "/usr/bin/bash"
      matchActions:
      - action: "Sigkill"

`Selector Semantics`

The selector semantics of the CiliumTracingPolicy follows the standard Kubernetes semantics and the principles that are used by Cilium to create a unified policy definition.

To explain deeper the structure and the logic behind it, let’s consider first the following example:

selectors:
 - matchPIDs:
   - operator: In
     followForks: true
     values:
     - pid1
     - pid2
     - pid3
  matchArgs:
  - index: 0
    operator: "Equal"
    values:
    - fdString1

In the YAML above matchPIDs and matchArgs are logically AND together giving the expression:

(pid in {pid1, pid2, pid3} AND arg0=fdstring1)

`Multiple values`

When multiple values are given, we apply the OR operation between them. In case of having multiple values under the matchPIDs selector, if any value matches with the given pid from pid1, pid2 or pid3 then we accept the event:

pid==pid1 OR pid==pid2 OR pid==pid3

As an example, we can filter for sys_read() syscalls that were not part of the container initialization and the main pod process and tried to read from the /etc/passwd file by using:

selectors:
 - matchPIDs:
   - operator: NotIn
     followForks: true
     values:
     - 0
     - 1
  matchArgs:
  - index: 0
    operator: "Equal"
    values:
    - "/etc/passwd"

Similarly, we can use multiple values under the matchArgs selector:

(pid in {pid1, pid2, pid3} AND arg0={fdstring1, fdstring2})

If any value matches with fdstring1 or fdstring2, specifically (string==fdstring1 OR string==fdstring2) then we accept the event.

For example, we can monitor sys_read() syscalls accessing both the /etc/passwd or the /etc/shadow files:

selectors:
 - matchPIDs:
   - operator: NotIn
     followForks: true
     values:
     - 0
     - 1
  matchArgs:
  - index: 0
    operator: "Equal"
    values:
    - "/etc/passwd"
    - "/etc/shadow"

`Multiple operators`

When multiple operators are supported under matchPIDs or matchArgs, they are logically AND together. In case if we have multiple operators under matchPIDs:

selectors:
  - matchPIDs:
    - operator: In
      followForks: true
      values:
      - pid1
    - operator: NotIn
      followForks: true
      values:
      - pid2

then we would build the following expression on the BPF side:

(pid == 0[following forks]) && (pid != 1[following forks])

In case of having multiple matchArgs:

selectors:
 - matchPIDs:
   - operator: In
     followForks: true
     values:
     - pid1
     - pid2
     - pid3
  matchArgs:
  - index: 0
    operator: "Equal"
    values:
    - 1
  - index: 2
    operator: "lt"
    values:
    - 500

Then we would build the following expression on the BPF side

(pid in {pid1, pid2, pid3} AND arg0=1 AND arg2 < 500)

`Operator types`

There are different types supported for each operator. In case of matchArgs:

Equal
NotEqual
Prefix
Postfix
Mask
GreaterThan (aka GT)
LessThan (aka LT)
SPort - Source Port
NotSPort - Not Source Port
SPortPriv - Source Port is Privileged (0-1023)
NotSPortPriv - Source Port is Not Privileged (Not 0-1023)
DPort - Destination Port
NotDPort - Not Destination Port
DPortPriv - Destination Port is Privileged (0-1023)
NotDPortPriv - Destination Port is Not Privileged (Not 0-1023)
SAddr - Source Address, can be IPv4/6 address or IPv4/6 CIDR (for ex 1.2.3.4/24 or 2a1:56::1/128)
NotSAddr - Not Source Address
DAddr - Destination Address
NotDAddr - Not Destination Address
Protocol
Family
State

The operator types Equal and NotEqual are used to test whether the certain argument of a system call is equal to the defined value in the CR.

For example, the following YAML snippet matches if the argument at index 0 is equal to /etc/passwd:

      matchArgs:
      - index: 0
        operator: "Equal"
        values:
        - "/etc/passwd"

Both Equal and NotEqual are set operations. This means if multiple values are specified, they are ORd together in case of Equal, and ANDd together in case of NotEqual.

For example, in case of Equal the following YAML snippet matches if the argument at index 0 is in the set of {arg0, arg1, arg2}.

matchArgs:
- index: 0
  operator: "Equal"
  values:
  - "arg0"
  - "arg1"
  - "arg2"

The above would be executed in the kernel as

arg == arg0 OR arg == arg1 OR arg == arg2

In case of NotEqual the following YAML snippet matches if the argument at index 0 is not in the set of {arg0, arg1}.

matchArgs:
- index: 0
  operator: "NotEqual"
  values:
  - "arg0"
  - "arg1"

The above would be executed in the kernel as

arg != arg0 AND arg != arg1

The operator type Mask performs and bitwise operation on the argument value and defined values. The argument type needs to be one of the value types.

For example in following YAML snippet we match second argument for bits 1 and 9 (0x200 value). We could use single value 0x201 as well.

matchArgs:
- index: 2
  operator: "Mask"
  values:
  - 1
  - 0x200

The above would be executed in the kernel as

arg & 1 OR arg & 0x200

The value can be specified as hexadecimal (with 0x prefix) octal (with 0 prefix) or decimal value (no prefix).

The operator Prefix checks if the certain argument starts with the defined value, while the operator Postfix compares if the argument matches to the defined value as trailing.

The operators relating to ports, addresses and protocol are used with sock, skb, sockaddr and socket types. Port operators can accept a range of ports specified as min:max as well as lists of individual ports. Address operators can accept IPv4/6 CIDR ranges as well as lists of individual addresses.

The Protocol operator can accept integer values to match against, or the equivalent IPPROTO_ enumeration. For example, UDP can be specified as either IPPROTO_UDP or 17; TCP can be specified as either IPPROTO_TCP or 6.

The Family operator can accept integer values to match against or the equivalent AF_ enumeration. For example, IPv4 can be specified as either AF_INET or 2; IPv6 can be specified as either AF_INET6 or 10.

The State operator can accept integer values to match against or the equivalent TCP_ enumeration. For example, an established socket can be matched with TCP_ESTABLISHED or 1; a closed socket with TCP_CLOSE or 7.

In case of matchPIDs:

In
NotIn

The operator types In and NotIn are used to test whether the pid of a system call is found in the provided values list in the CR. Both In and NotIn are set operations, which means if multiple values are specified they are ORd together in case of In and ANDd together in case of NotIn.

For example, in case of In the following YAML snippet matches if the pid of a certain system call is being part of the list of {0, 1}:

- matchPIDs:
  - operator: In
    followForks: true
    isNamespacePID: true
    values:
    - 0
    - 1

The above would be executed in the kernel as

pid == 0 OR pid == 1

In case of NotIn the following YAML snippet matches if the pid of a certain system call is not being part of the list of {0, 1}:

- matchPIDs:
  - operator: NotIn
    followForks: true
    isNamespacePID: true
    values:
    - 0
    - 1

The above would be executed in the kernel as

pid != 0 AND pid != 1

In case of matchBinaries:

In

The In operator type is used to test whether a binary name of a system call is found in the provided values list. For example, the following YAML snippet matches if the binary name of a certain system call is being part of the list of {binary0, binary1, binary2}:

- matchBinaries:
  - operator: "In"
    values:
    - "binary0"
    - "binary1"
    - "binary2"

`Multiple selectors`

When multiple selectors are configured they are logically ORd together.

selectors:
 - matchPIDs:
   - operator: In
     followForks: true
     values:
     - pid1
     - pid2
     - pid3
  matchArgs:
  - index: 0
    operator: "Equal"
    values:
    - 1
  - index: 2
    operator: "lt"
    values:
    -  500
 - matchPIDs:
   - operator: In
     followForks: true
     values:
     - pid1
     - pid2
     - pid3
  matchArgs:
  - index: 0
    operator: "Equal"
    values:
    - 2

The above would be executed in kernel as:

(pid in {pid1, pid2, pid3} AND arg0=1 AND arg2 < 500) OR
(pid in {pid1, pid2, pid3} AND arg0=2)

`Limitations`

Those limitations might be outdated, see issue #709.

Because BPF must be bounded we have to place limits on how many selectors can exist.

Max Selectors 8.
Max PID values per selector 4
Max MatchArgs per selector 5 (one per index)
Max MatchArg Values per MatchArgs 1 (limiting initial implementation can bump to 16 or so)

`Return Actions filter`

Return actions filters are a list of actions that execute when an return selector matches. They are defined under matchReturnActions and currently support all the Actions filter action types.

`2.6 - Tags`

Use Tags to categorize events

Tags are optional fields of a Tracing Policy that are used to categorize generated events.

`Introduction`

Tags are specified in Tracing policies and will be part of the generated event.

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "file-monitoring-filtered"
spec:
  kprobes:
  - call: "security_file_permission"
    message: "Sensitive file system write operation"
    syscall: false
    args:
    - index: 0
      type: "file" # (struct file *) used for getting the path
    - index: 1
      type: "int" # 0x04 is MAY_READ, 0x02 is MAY_WRITE
    selectors:
    - matchArgs:
      - index: 0
        operator: "Prefix"
        values:
        - "/etc"              # Writes to sensitive directories
        - "/boot"
        - "/lib"
        - "/lib64"
        - "/bin"
        - "/usr/lib"
        - "/usr/local/lib"
        - "/usr/local/sbin"
        - "/usr/local/bin"
        - "/usr/bin"
        - "/usr/sbin"
        - "/var/log"          # Writes to logs
        - "/dev/log"
        - "/root/.ssh"        # Writes to sensitive files add here.
      - index: 1
        operator: "Equal"
        values:
        - "2" # MAY_WRITE
    tags: [ "observability.filesystem", "observability.process" ]

Every kprobe call can have up to max 16 tags.

`Namespaces`

`Observability namespace`

Events in this namespace relate to collect and export data about the internal system state.

“observability.filesystem”: the event is about file system operations.
“observability.privilege_escalation”: the event is about raising permissions of a user or a process.
“observability.process”: the event is about an instance of a Linux program being executed.

`User defined Tags`

Users can define their own tags inside Tracing Policies. The official supported tags are documented in the Namespaces section.

`2.7 - Kubernetes Identity Aware Policies`

Tetragon in-kernel filtering based on Kubernetes namespaces, pod labels, and container fields

`Motivation`

Tetragon is configured via TracingPolicies. Broadly speaking, TracingPolicies define what situations Tetragon should react to and how. The what can be, for example, specific system calls with specific argument values. The how defines what action the Tetragon agent should perform when the specified situation occurs. The most common action is generating an event, but there are others (e.g., returning an error without executing the function or killing the corresponding process).

Here, we discuss how to apply tracing policies only on a subset of pods running on the system via the followings mechanisms:

namespaced policies
pod-label filters
container field filters

Tetragon implements these mechanisms in-kernel via eBPF. This is important for both observability and enforcement use-cases. For observability, copying only the relevant events from kernel- to user-space reduces overhead. For enforcement, performing the enforcement action in the kernel avoids the race-condition of doing it in user-space. For example, let us consider the case where we want to block an application from performing a system call. Performing the filtering in-kernel means that the application will never finish executing the system call, which is not possible if enforcement happens in user-space (after the fact).

To ensure that namespaced tracing policies are always correctly applied, Tetragon needs to perform actions before containers start executing. Tetragon supports this via OCI runtime hooks. If such hooks are not added, Tetragon will apply policies in a best-effort manner using information from the k8s API server.

`Namespace filtering`

For namespace filtering we use TracingPolicyNamespaced which has the same contents as a TracingPolicy, but it is defined in a specific namespace and it is only applied to pods of that namespace.

`Pod label filters`

For pod label filters, we use the PodSelector field of tracing policies to select the pods that the policy is applied to.

`Container field filters`

For container field filters, we use the containerSelector field of tracing policies to select the containers that the policy is applied to. At the moment, the only supported fields are name and repo which refers to the container repository.

`Demo`

`Setup`

For this demo, we use containerd and configure appropriate run-time hooks using minikube.

First, let us start minikube, build and load images, and install Tetragon and OCI hooks:

minikube start --container-runtime=containerd
./contrib/tetragon-rthooks/scripts/minikube-install-hook.sh
make image image-operator
minikube image load --daemon=true cilium/tetragon:latest cilium/tetragon-operator:latest
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf
helm install --namespace kube-system \
	--set tetragonOperator.image.override=cilium/tetragon-operator:latest \
	--set tetragon.image.override=cilium/tetragon:latest  \
	--set tetragon.grpc.address="unix:///var/run/cilium/tetragon/tetragon.sock" \
	tetragon ./install/kubernetes/tetragon

Once the tetragon pod is up and running, we can get its name and store it in a variable for convenience.

tetragon_pod=$(kubectl -n kube-system get pods -l app.kubernetes.io/name=tetragon -o custom-columns=NAME:.metadata.name --no-headers)

Once the tetragon operator pod is up and running, we can also get its name and store it in a variable for convenience.

tetragon_operator=$(kubectl -n kube-system get pods -l app.kubernetes.io/name=tetragon-operator -o custom-columns=NAME:.metadata.name --no-headers)

Next, we check the tetragon-operator logs and tetragon agent logs to ensure that everything is in order.

First, we check if the operator installed the TracingPolicyNamespaced CRD.

kubectl -n kube-system logs -c tetragon-operator $tetragon_operator

The expected output is:

level=info msg="Tetragon Operator: " subsys=tetragon-operator
level=info msg="CRD (CustomResourceDefinition) is installed and up-to-date" name=TracingPolicy/v1alpha1 subsys=k8s
level=info msg="Creating CRD (CustomResourceDefinition)..." name=TracingPolicyNamespaced/v1alpha1 subsys=k8s
level=info msg="CRD (CustomResourceDefinition) is installed and up-to-date" name=TracingPolicyNamespaced/v1alpha1 subsys=k8s
level=info msg="Initialization complete" subsys=tetragon-operator

Next, we check that policyfilter (the low-level mechanism that implements the desired functionality) is indeed enabled.

kubectl -n kube-system logs -c tetragon $tetragon_pod

The output should include:

level=info msg="Enabling policy filtering"

`Namespaced policies`

For illustration purposes, we will use the lseek system call with an invalid argument. Specifically a file descriptor (the first argument) of -1. Normally, this operation would return a “Bad file descriptor error”.

Let us start a pod in the default namespace:

kubectl -n default run test --image=python -it --rm --restart=Never  -- python

Above command will result in the following python shell:

If you don't see a command prompt, try pressing enter.
>>>

There is no policy installed, so attempting to do the lseek operation will just return an error. Using the python shell, we can execute an lseek and see the returned error.

>>> import os
>>> os.lseek(-1,0,0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor
>>>

In another terminal, we install a policy in the default namespace:

cat << EOF | kubectl apply -n default -f -
apiVersion: cilium.io/v1alpha1
kind: TracingPolicyNamespaced
metadata:
  name: "lseek-namespaced"
spec:
  kprobes:
  - call: "sys_lseek"
    syscall: true
    args:
    - index: 0
      type: "int"
    selectors:
    - matchArgs:
      - index: 0
        operator: "Equal"
        values:
        - "-1"
      matchActions:
      - action: Sigkill
EOF

The above tracing policy will kill the process that performs a lseek system call with a file descriptor of -1. Note that we use a SigKill action only for illustration purposes because it’s easier to observe its effects.

Then, attempting the lseek operation on the previous terminal, will result in the process getting killed:

>>> os.lseek(-1, 0, 0)
pod "test" deleted
pod default/test terminated (Error)

The same is true for a newly started container:

kubectl -n default run test --image=python -it --rm --restart=Never  -- python

If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
pod "test" deleted
pod default/test terminated (Error)

Doing the same on another namespace:

kubectl create namespace test
kubectl -n test run test --image=python -it --rm --restart=Never  -- python

Will not kill the process and result in an error:

If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor

`Pod label filters`

Let’s install a tracing policy with a pod label filter.

cat << EOF | kubectl apply -f -
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "lseek-podfilter"
spec:
  podSelector:
    matchLabels:
      app: "lseek-test"
  kprobes:
  - call: "sys_lseek"
    syscall: true
    args:
    - index: 0
      type: "int"
    selectors:
    - matchArgs:
      - index: 0
        operator: "Equal"
        values:
        - "-1"
      matchActions:
      - action: Sigkill
EOF

Pods without the label will not be affected:

kubectl run test  --image=python -it --rm --restart=Never  -- python

If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  OSError: [Errno 9] Bad file descriptor
  >>>

But pods with the label will:

kubectl run test --labels "app=lseek-test" --image=python -it --rm --restart=Never  -- python

If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
pod "test" deleted
pod default/test terminated (Error)

`Container field filters`

Let’s install a tracing policy with a container field filter.

cat << EOF | kubectl apply -f -
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "lseek-containerfilter"
spec:
  containerSelector:
    matchExpressions:
      - key: name
        operator: In
        values:
        - main
  kprobes:
  - call: "sys_lseek"
    syscall: true
    args:
    - index: 0
      type: "int"
    selectors:
    - matchArgs:
      - index: 0
        operator: "Equal"
        values:
        - "-1"
      matchActions:
      - action: Sigkill
EOF

Let’s create a pod with 2 containers:

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: lseek-pod
spec:
  containers:
  - name: main
    image: python
    command: ['sh', '-c', 'sleep infinity']
  - name: sidecar
    image: python
    command: ['sh', '-c', 'sleep infinity']
EOF

Containers that don’t match the name main will not be affected:

kubectl exec -it lseek-pod -c sidecar -- python3

>>> import os
>>> os.lseek(-1, 0, 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  OSError: [Errno 9] Bad file descriptor
>>>

But containers matching the name main will:

kubectl exec -it lseek-pod -c main -- python3

>>> import os
>>> os.lseek(-1, 0, 0)
command terminated with exit code 137

`2.8 - Enforcement Mode`

Configuring enforcement in tracing policies

Beyond monitoring, Tetragon tracing policies include enforcement actions. Configuring the mode of a policy allows you to disable enforcement in a policy, without modifying the policy itself.

A tracing policy can be set in two modes:

monitoring: enforcement operations are elided
enforcement: enforcement operations are respected and performed

Using the tetra CLI, you can inspect the mode of each policy. For example:

tetra tracingpolicy list

Will produce out similar to:

ID   NAME               STATE     FILTERID   NAMESPACE   SENSORS              KERNELMEMORY   MODE
1    enforce-security   enabled   1          pizza       generic_kprobe       7.53 MB        enforce
2    allsyscalls        enabled   0          (global)    generic_tracepoint   4.25 MB        monitor

There are three ways that a policy mode can be set. From lower to higher priority, they are:

Setting the mode in the policy itself
Setting the mode when loading the policy
Setting the mode at runtime (via gRPC)

`Setting the mode in the policy itself`

The policy mode can be set in the spec.options field, under the policy-mode key. For example:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicyNamespaced
metadata:
  name: "enforce-policy"
  namespace: "default"
spec:
  options:
    - name: "policy-mode"
      value: "monitor"
   ...

`Setting the mode when loading the policy`

This can be done via the tetra CLI:

tetra tracingpolicy add --mode monitor policy.yaml

Will load policy.yaml in monitor mode.

`Setting the policy at runtime`

You can use the tetra CLI to set the policy mode once the policy is loaded. For example:

tetra tp set-mode --namespace pizza enforce-security enforce

`3 - Runtime Hooks`

Tetragon Runtime Hooks

Applying Kubernetes Identity Aware Policies requires information about Kubernetes (K8s) pods (e.g., namespaces and labels). Based on this information, the Tetragon agent can update the state so that Kubernetes Identify filtering can be applied in-kernel via BPF.

One way that this information is available to the Tetragon agent is via the K8s API server. Relying on the API server, however, can lead to a delay before the container starts and the policy is applied. This might be undesirable, especially for enforcement policies.

Runtime hooks address this issue by “hooking” into the container run-time system, and ensuring that the Tetragon agent sets up the necessary state for filtering before the container starts.

The OCI hooks are implemented via a tetragon-oci-hook binary which is responsible for contacting the agent via a gRPC socket. tetragon-oci-hook can be configured to either fail or succeed when connecting to the tetragon agent fails (this is needed, so that Tetragon itself, as well as other mission critical containers can be started).

┌────────────────────┐         ┌────────────────────┐        ┌──────────────────┐
│  tetragon-oci-hook │         │   tetragon.sock    │        │  tetragon agent  │
│    (binary)        │─────────┤ (gRPC UNIX socket) │──────► │                  │
│                    │         │                    │        │                  │
└────────────────────┘         └────────────────────┘        └──────────────────┘

Depending on the container runtime, there are different ways to configure the runtime so that tetragon-oci-hook is executed before a container starts:

`CRI-O`

CRI-O implements the OCI hooks configuration directories as described in: https://github.com/containers/common/blob/main/pkg/hooks/docs/oci-hooks.5.md. Hence, enabling the hook requires adding an appropriate file to this directory.

`Containerd (with NRI)`

Recent versions of containerd support NRI: NRI support was added in 1.7 and will be enabled by default starting with 2.0. To use tetragon-oci-hook with NRI, there is a simple NRI plugin (called tetragon-nri-hook) that adds the tetragon-oci-hook to the container spec.

`Containerd (without NRI)`

Containerd can be configured to use a custom container spec that includes tetragon-oci-hook.

`Configuration`

See Configure Runtime Hooks.

`4 - Enforcement`

Documentation for Tetragon enforcement system

Tetragon allows enforcing events in the kernel inline with the operation itself. This document describes the types of enforcement provided by Tetragon and concerns policy implementors must be aware of.

There are two ways that Tetragon performs enforcement: overriding the return value of a function and sending a signal (e.g., SIGKILL) to the process.

`Override return value`

Override the return value of a call means that the function will never be executed and, instead, a value (typically an error) will be returned to the caller. Generally speaking, only system calls and security check functions allow to change their return value in this manner. Details about how users can configure tracing policies to override the return value can be found in the Override action documentation.

`Signals`

Another type of enforcement is signals. For example, users can write a TracingPolicy (details can be found in the Signal action documentation) that sends a SIGKILL to a process matching certain criteria and thus terminate it.

In contrast with overriding the return value, sending a SIGKILL signal does not always stop the operation being performed by the process that triggered the operation. For example, a SIGKILL sent in a write() system call does not guarantee that the data will not be written to the file. However, it does ensure that the process is terminated synchronously (and any threads will be stopped). In some cases it may be sufficient to ensure the process is stopped and the process does not handle the return of the call. To ensure the operation is not completed, though, the Signal action should be combined with the Override action.

`4.1 - Persistent enforcement`

How to configure persistent enforcement

This page shows you how to configure persistent enforcement.

`Concept`

The idea of persistent enforcement is to allow the enforcement policy to continue running even when its tetragon process is gone.

This is configured with the --keep-sensors-on-exit option.

When the tetragon process exits, the policy stays active because it’s pinned in sysfs bpf tree under /sys/fs/bpf/tetragon directory.

When a new tetragon process is started, it performs the following actions:

checks if there’s existing /sys/fs/bpf/tetragon and moves it to /sys/fs/bpf/tetragon_old directory;
sets up configured policy;
removes /sys/fs/bpf/tetragon_old directory.

`Example`

This example shows how the persistent enforcement works on simple tracing policy.

Consider the following enforcement tracing policy that kills any process that touches /tmp/tetragon file.

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
 name: "enforcement"
spec:
 kprobes:
 - call: "fd_install"
   syscall: false
   args:
   - index: 0
     type: int
   - index: 1
     type: "file"
   selectors:
   - matchArgs:
     - index: 1
       operator: "Equal"
       values:
       - "/tmp/tetragon"
     matchActions:
     - action: Sigkill

Spawn tetragon with the above policy and --keep-sensors-on-exit option.

tetragon --bpf-lib bpf/objs/ --keep-sensors-on-exit --tracing-policy enforcement.yaml

Verify that the enforcement policy is in place.
```
cat /tmp/tetragon
```
The output should be similar to
```
Killed
```

Kill tetragon with CTRL+C.

time="2024-07-26T14:47:45Z" level=info msg="Perf ring buffer size (bytes)" percpu=68K total=272K
time="2024-07-26T14:47:45Z" level=info msg="Perf ring buffer events queue size (events)" size=63K
time="2024-07-26T14:47:45Z" level=info msg="Listening for events..."
^C
time="2024-07-26T14:50:50Z" level=info msg="Received signal interrupt, shutting down..."
time="2024-07-26T14:50:50Z" level=info msg="Listening for events completed." error="context canceled"

Verify that the enforcement policy is STILL in place.
```
cat /tmp/tetragon
```
The output should be still similar to
```
Killed
```

`Limitations`

At the moment we are not able to receive any events during the tetragon down time, only the the enforcement is in place.

`5 - Event throttling`

Monitor and throttle cgroup events rate

This page shows you how to configure per-cgroup rate monitoring.

`Concept`

The idea is that tetragon monitors events rate per cgroup and throttle them (stops posting its events) if they cross configured threshold.

The throttled cgroup is monitored and if its traffic gets stable under the limit again, it stops the cgroup throttling and tetragon resumes receiving the cgroup’s events.

The throttle action generates following events:

THROTTLE start event is sent when the group rate limit is crossed
THROTTLE stop event is sent when the cgroup rate is again below the limit stable for 5 seconds

Note The threshold for given cgroup is monitored per CPU.
When the events are spread around on multiple CPUs we will throttle
them per CPU only if they cross the threshold on that CPU.

NoteAt the moment we monitor and limit base sensor events:
PROCESS_EXEC
PROCESS_EXIT

`Setup`

The cgroup rate is configured with --cgroup-rate option:

--cgroup-rate string
  Base sensor events cgroup rate <events,interval> disabled by default
  ('1000,1s' means rate 1000 events per second)

--cgroup-rate=10,1s
sets the cgroup threshold on 10 events per 1 second
--cgroup-rate=1000,1s
sets the cgroup threshold on 1000 events per 1 second
--cgroup-rate=100,1m
sets the cgroup threshold on 100 events per 1 minute
--cgroup-rate=10000,10m
sets the cgroup threshold on 10000 events per 10 minutes

`Events`

The throttle events contains fields as follows.

THROTTLE_START

{
  "process_throttle": {
    "type": "THROTTLE_START",
    "cgroup": "session-429.scope"
  },
  "node_name": "ubuntu-22",
  "time": "2024-07-26T13:07:43.178407128Z"
}

THROTTLE_STOP

{
  "process_throttle": {
    "type": "THROTTLE_STOP",
    "cgroup": "session-429.scope"
  },
  "node_name": "ubuntu-22",
  "time": "2024-07-26T13:07:55.501718877Z"
}

`Example`

This example shows how to generate throttle events when cgroup rate monitoring is enabled.

Start tetragon with cgroup rate monitoring 10 events per second.

tetragon --bpf-lib ./bpf/objs/ --cgroup-rate=10,1s

The successful configuration will show in tetragon log.

...
time="2024-07-26T13:33:19Z" level=info msg="Cgroup rate started (10/1s)"
...

Spawn more than 10 events per second.
```
while :; do sleep 0.001s; done
```

Monitor events shows throttling.

tetra getevents -o compact

The output should be similar to:

🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
🧬 throttle START session-429.scope
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s
💥 exit    ubuntu-22 /usr/bin/sleep 0.001s 0
🚀 process ubuntu-22 /usr/bin/sleep 0.001s

🧬 throttle STOP  session-429.scope

When you stop the while loop from the other terminal you will get above throttle STOP event after 5 seconds.

`Limitations`

The cgroup rate is monitored per CPU
At the moment we only monitor and limit base sensor and kprobe events:PROCESS_EXEC PROCESS_EXIT