By: Ania Kapuścińska (@lambdanis)
Edited by: Shaun Mouton (@sdmouton )
Like many engineers, for a long time I’ve thought of the Linux kernel as a black box. I've been using Linux daily for many years - but my usage was mostly limited to following the installation guide, interacting with the command line interface and writing bash scripts.
Some time ago I heard about eBPF (extended BPF). The first thing I heard was that it’s a programmable interface for the Linux kernel. Wait a second. Does that mean I can now inject my code into Linux without fully understanding all the internals and compiling the kernel? The answer turns out to be approximately yes!
An eBPF (or BPF - these acronyms are used practically interchangeably) program is written in a restricted version of C. Restricted, because a dedicated verifier checks that the program is safe to run in an BPF VM - it can’t crash, loop infinitely, or access arbitrary memory. If the program passes the check, it can be attached to some kind of event in the Linux kernel, and run every time this event happens.
A growing ecosystem makes it easier to create tools on top of BPF. One very popular framework is BCC (BPF Compiler Collection), containing a Python interface for writing BPF programs. Python is a very popular scripting language, for a good reason - simple syntax, dynamic typing and rich standard library make writing even complex scripts quick and fun. On top of that, bcc provides easy compilation, events attachment and output processing of BPF programs. That makes it the perfect tool to start experimenting with writing BPF code.
To run code examples from this article, you will need a Linux machine with a fairly recent kernel version (supporting eBPF). If you don’t have a Linux machine available, you can experiment in a Vagrant box. You will also need to install Python bcc package.
Very complicated hello
Let’s start in a very unoriginal way - with a “hello world” program. As I mentioned before, BPF programs are written in (restricted) C. A BPF program printing “Hello World!” can look like that:
hello.c
#define HELLO_LENGTH 13 BPF_PERF_OUTPUT(output); struct message_t { char hello[HELLO_LENGTH]; }; static int strcp(char *src, char *dest) { for (int i = 0; src[i] != '\0'; i++) { dest[i] = src[i]; } return 0; }; int hello_world(struct pt_regs *ctx) { struct message_t message = {}; strcp("Hello World!", message.hello); output.perf_submit(ctx, &message, sizeof(message)); return 0; }
The main piece here is the hello_world function - later we will attach it to a kernel event. We don’t have access to many common libraries, so we are implementing strcp (string copy) functionality ourselves. Extra functions are allowed in BPF code, but have to be defined as static. Loops are also allowed, but the verifier will check that they are guaranteed to complete.
The way we output data might look unusual. First, we define a perf ring buffer called “output” using the BPF_PERF_OUTPUT macro. Then we define a data structure that we will put in this buffer - message_t. Finally, we write to the “output” buffer using perf_submit function.
Now it’s time to write some Python:
hello.py
from bcc import BPF b = BPF(src_file="hello.c") b.attach_kprobe( event=b.get_syscall_fnname("clone"), fn_name="hello_world" ) def print_message(_cpu, data, _size): message = b["output"].event(data) print(message.hello) b["output"].open_perf_buffer(print_message) while True: try: b.perf_buffer_poll() except KeyboardInterrupt: exit()
We import BPF from bcc as BPF is the core of the Python interface with eBPF in the bcc package. It loads our C program, compiles it, and gives us a Python object to operate on. The program has to be attached to a Linux kernel event - in this case it will be the clone system call, used to create a new process. The attach_kprobe method hooks the hello_world C function to the start of a clone system call.
The rest of Python code is reading and printing output. A great functionality provided by bcc is automatic translation of C structures (in this case “output” perf ring buffer) into Python objects. We access the buffer with a simple b[“output”], and use open_perf_buffer method to associate it with the print_message function. In this function we read incoming messages with the event method. The C structure we used to send them gets automatically converted into a Python object, so we can read “Hello World!” by accessing the hello attribute.
To see it in action, run the script with root privileges:
> sudo python hello.py
In a different terminal window run any commands, e.g. ls. “Hello World!” messages will start popping up.
Does it look awfully complicated for a “hello world” example? Yes, it does :) But it covers a lot, and most of the complexity comes from the fact that we are sending data to user space via a perf ring buffer.
In fact, similar functionality can be achieved with much simpler code. We can get rid of the complex printing logic by using the bpf_trace_printk function to write a message to the shared trace_pipe. Then, in Python script we can read from this pipe using trace_print method. It’s not recommended for real world tools, as trace_pipe is global and the output format is limited - but for experiments or debugging it’s perfectly fine.
Additionally, bcc allows us to write C code inline in the Python script. We can also use a shortcut for attaching C functions to kernel events - if we name the C function kprobe__<kernel function name>, it will get hooked to the desired kernel function automatically. In this case we want to hook into the sys_clone function.
So, hello world, the simplest version, can look like this:
from bcc import BPF BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello World!\\n"); return 0; }').trace_print()
The output will be different, but what doesn’t change is that while the script is running, custom code will run whenever a clone system call is starting.
What even is an event?
Code compilation and attaching functions to events are greatly simplified by the bcc interface. But a lot of its power lies in the fact that we can glue many BPF programs together with Python. Nothing prevents us from defining multiple C functions in one Python script and attaching them to multiple different hook points.
Let’s talk about these “hook points”. What we used in the “hello world” example is a kprobe (kernel probe). It’s a way to dynamically run code at the beginning of Linux kernel functions. We can also define a kretprobe to run code when a kernel function returns. Similarly, for programs running in user space, there are uprobes and uretprobes.
Probes are extremely useful for dynamic tracing use cases. They can be attached almost anywhere, but that can cause stability problems - a function rename could break our program. Better stability can be achieved by using predefined static tracepoints wherever possible. Linux kernel provides many of those, and for user space tracing you can define them too (user statically defined tracepoints - USDTs).
Network events are very interesting hook points. BPF can be used to inspect, filter and route packets, opening a whole sea of possibilities for very performant networking and security tools. In this category, XDP (eXpress Data Path) is a BPF framework that allows running BPF programs not only in Linux kernel, but also on supported network devices.
We need to store data
So far I’ve mentioned functions attached to other functions many times. But interesting computer programs generally have something more than functions - a state that can be shared between function calls. That can be a database or a filesystem, and in the BPF world that’s BPF maps.
BPF maps are key/value pairs stored in Linux kernel. They can be accessed by both kernel and user space programs, allowing communication between them. Usually BPF maps are defined with C macros, and read or modified with BPF helpers. There are several different types of BPF maps, e.g.: hash tables, histograms, arrays, queues and stacks. In newer kernel versions, some types of maps let you protect concurrent access with spin locks.
In fact, we’ve seen a BPF map in action already. The perf ring buffer we’ve created with BPF_PERF_OUTPUT macro is nothing more than a BPF map of type BPF_MAP_TYPE_PERF_EVENT_ARRAY. We also saw that it can be accessed from Python bcc script, including automatic translation of items structure into Python objects.
A good, but still simple example of using a hash table BPF map for communication between different BPF programs can be found in “Linux Observability with BPF” book (or in the accompanying repo). It’s a script using uprobe and uretprobe to measure duration of a Go binary execution:
from bcc import BPF bpf_source = """ BPF_HASH(cache, u64, u64); int trace_start_time(struct pt_regs *ctx) { u64 pid = bpf_get_current_pid_tgid(); u64 start_time_ns = bpf_ktime_get_ns(); cache.update(&pid, &start_time_ns); return 0; } """ bpf_source += """ int print_duration(struct pt_regs *ctx) { u64 pid = bpf_get_current_pid_tgid(); u64 *start_time_ns = cache.lookup(&pid); if (start_time_ns == 0) { return 0; } u64 duration_ns = bpf_ktime_get_ns() - *start_time_ns; bpf_trace_printk("Function call duration: %d\\n", duration_ns); return 0; } """ bpf = BPF(text = bpf_source) bpf.attach_uprobe(name = "./hello-bpf", sym = "main.main", fn_name = "trace_start_time") bpf.attach_uretprobe(name = "./hello-bpf", sym = "main.main", fn_name = "print_duration") bpf.trace_print()
First, a hash table called “cache” is defined with the BPF_HASH macro. Then we have two C functions: trace_start_time writing the process start time to the map using cache.update(), and print_duration reading this value using cache.lookup(). The former is attached to a uprobe, and the latter to uretprobe for the same function - main.main in hello-bpf binary. That allows print_duration to, well, print duration of the Go program execution.
Sounds great! Now what?
To start using the bcc framework, visit its Github repo. There is a developer tutorial and a reference guide. Many tools have been built on the bcc framework - you can learn them from a tutorial or check their code. It’s a great inspiration and a great way to learn - code of a single tool is usually not extremely complicated.
Two goldmines of eBPF resources are ebpf.io and eBPF awesome list. Start browsing any of those, and you have all your winter evenings sorted :)
Have fun!
No comments :
Post a Comment