Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support kernel stack map #2671

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

i-Pear
Copy link
Contributor

@i-Pear i-Pear commented Apr 1, 2024

Solves #2553

It's with some hacking, and only for test purpose.

@i-Pear
Copy link
Contributor Author

i-Pear commented Apr 1, 2024

Currently, I am unsure about who should hold the stackIDMap (perhaps the front end which prints the stack? Is it localManager or IgManager?), nor do I know how the stack should be printed. I need further understanding of the project's structure.

@i-Pear
Copy link
Contributor Author

i-Pear commented Apr 1, 2024

The method mentioned in #2553 seems overly complicated; I am unsure why it requires mapping fd:

struct gadget_stack_ref {
    int stack_map; // fd of the stack map
    int stack_id; // value returned by bpf_get_stackid()
}

Furthermore, its purpose differs from that of MntNsFilter. I think we cannot simply copy the code from MntNsFilter because MntNsFilter needs to be accessed by facilities like localManager responsible for filtering container access, whereas stackIdMap is only used by the frontend and does not need to be accessed in various parts of the project.

@alban
Copy link
Member

alban commented Apr 1, 2024

I have not thought of defining the ebpf map in a header (include/gadget/stack_map.h). I was thinking it should be the responsibility of the gadget itself to define its stack map.

With your method, there can be only one stack map per gadget, it has to be named "gadget_stack_trace_map" and that's part of the ABI between IG and the gadget. That ABI would need to be documented on gadget-helper-api.md.

The headers in include/gadget are meant to be helpers but are not mandatory. It should be possible for gadget authors to write their gadgets in a different way, for example in Rust and compiled into eBPF and packaged in the OCI image. IG should be agnostic about that.

I am unsure why it requires mapping fd:

struct gadget_stack_ref {
int stack_map; // fd of the stack map
int stack_id; // value returned by bpf_get_stackid()
}

I added stack_map as a way to write a reference to the stack map because I didn't want to limit this to one stack map per gadget.

Also, note that a uprobe program such as:

SEC("uprobe/libc:free")

could be attached to several containers with different versions of libc. So the address of a function on the stack has to be interpreted differently depending on the libc version. This can be resolved by using a different stack map for each uprobe attachment. In this case, it is useful to have the field stack_map to distinguish which stack map it is referring to.

If you look at the pkg/networktracer/tracer.go as example, it has its own listen(), eventHandler() and SetEventHandler() functions. If pkg/uprobetracer/tracer.go does the same, then it has access to the []byte from the ring buffer and you also have access to the ebpf stack maps. But that's probably not the right place to add this because stack maps should work for different kinds of ebpf programs...

@flyth I don't know where to add the code after the refactoring. I see field accessors have a method Set() which takes []byte as input, but in the case of a stack passed in a ring buffer, the serialized bytes are not enough because we need to have access to the stack map and do a bpf(BPF_MAP_LOOKUP_ELEM). Could you shed some light on this?

@i-Pear
Copy link
Contributor Author

i-Pear commented Apr 2, 2024

containers with different versions of libc. So the address of a function on the stack has to be interpreted differently depending on the libc version. This can be resolved by using a different stack map for each uprobe attachment.

I may lack the necessary knowledge, but why can't stacks from different libraries & different gadgets be mixed within the same stack map? As I understand, the stack map simply associates a stackID with a stack (a set of addresses). In the example eBPF program attached to this PR, stackIDs are recorded alongside PIDs, and then the frontend uses the stackID to find the corresponding stack data, interpreting it using symbols associated with the PID. This doesn't seem to cause any confusion.

@alban
Copy link
Member

alban commented Apr 2, 2024

containers with different versions of libc. So the address of a function on the stack has to be interpreted differently depending on the libc version. This can be resolved by using a different stack map for each uprobe attachment.

I may lack the necessary knowledge, but why can't stacks from different libraries & different gadgets be mixed within the same stack map? As I understand, the stack map simply associates a stackID with a stack (a set of addresses).

Correct.

In the example eBPF program attached to this PR, stackIDs are recorded alongside PIDs, and then the frontend uses the stackID to find the corresponding stack data, interpreting it using symbols associated with the PID. This doesn't seem to cause any confusion.

Do we have to use the PID? I am concerned that if the target process terminates before ig could open /proc/$pid/maps, it can't work. I thought that if we know that the event comes from /bin/bash and /lib64/libc.so.6, we can resolve the addresses to the symbols even when the process terminated. In this case, we need to know if a probe comes from /lib64/libc.so.6 or another version of libc. But I guess I didn't account for memory relocations of dynamic libraries, so maybe my idea does not work.

@flyth
Copy link
Member

flyth commented Apr 2, 2024

@flyth I don't know where to add the code after the refactoring. I see field accessors have a method Set() which takes []byte as input, but in the case of a stack passed in a ring buffer, the serialized bytes are not enough because we need to have access to the stack map and do a bpf(BPF_MAP_LOOKUP_ELEM). Could you shed some light on this?

I think I need more info here - but it sounds like you would want to receive the stackID from the ring buffer, do the lookup in userspace and then send whatever you receive through the DataSource (and not (just) the stackID).

@alban
Copy link
Member

alban commented Apr 2, 2024

@flyth I don't know where to add the code after the refactoring. I see field accessors have a method Set() which takes []byte as input, but in the case of a stack passed in a ring buffer, the serialized bytes are not enough because we need to have access to the stack map and do a bpf(BPF_MAP_LOOKUP_ELEM). Could you shed some light on this?

I think I need more info here - but it sounds like you would want to receive the stackID from the ring buffer, do the lookup in userspace and then send whatever you receive through the DataSource (and not (just) the stackID).

Yes.

  1. In the ring buffer, we get i.e. stack_id = 42.
  2. We lookup stack_id = 42 in the stack map and we get the value []uintptr{0x123, 0x456, 0x789} (if the stack has a depth of 3).
  3. We look at the target process to resolve those 3 addresses. And return the stack []string{"getchar", "readline", "main"} to the user.

@flyth
Copy link
Member

flyth commented Apr 2, 2024

@flyth I don't know where to add the code after the refactoring. I see field accessors have a method Set() which takes []byte as input, but in the case of a stack passed in a ring buffer, the serialized bytes are not enough because we need to have access to the stack map and do a bpf(BPF_MAP_LOOKUP_ELEM). Could you shed some light on this?

I think I need more info here - but it sounds like you would want to receive the stackID from the ring buffer, do the lookup in userspace and then send whatever you receive through the DataSource (and not (just) the stackID).

Yes.

  1. In the ring buffer, we get i.e. stack_id = 42.
  2. We lookup stack_id = 42 in the stack map and we get the value []uintptr{0x123, 0x456, 0x789} (if the stack has a depth of 3).
  3. We look at the target process to resolve those 3 addresses. And return the stack []string{"getchar", "readline", "main"} to the user.

What should the UX be for now, then? Just a list of function names returned as string in a "stack" field - in JSON + columns?

Here's how I'd approach it (afaiu the PR):

  • prep: this needs to be extended to support maps in a generic way (right now we do it for hardcoded names, but we should also populate maps with a given prefix, IMHO); for a quick test, just add another exception a couple of lines below
  • create a new operator named "StackOperator" or something like it; make sure the operator registers itself on init() of the file like most other operators do and include it where other operators are included (for testing, could manually add in both occurrences in cmd/common/oci.go)
  • need to implement DataOperator interface and additionally the DataOperatorInstance interface on another type that can hold the stack map and is returned by the InstantiateDataOperator() func
  • in the InstantiateDataOperator(), check for the stack-map reference (and create/set/cache it) by using GetVar("mapname") & SetVar(); if it's there, also check DataSources in gadgetCtx for stackId (type:gadget_stack_id); if found, add a new field called "stack" to the DataSource and cache the accessor.
  • in PreStart(), subscribe to DataSources with the stackId; in the callbacks, extract stackId using the accessor, do the map lookup + extract info from the target process and set info using the accessor for the "stack" field (for initial version, I'd suggest just a concatenated, comma-separated string)
  • in Stop(), destroy the map.

See other operators like OciHandler, pkg/datasource/compat (used by KubeManager + LocalManager) and pkg/operators/formatters for general info on how operators are used.

I just talked to @alban and he said it might be much more complex than what I proposed since there could be multiple stack maps per gadget run (per container + per target lib version...). So if you prefer, feel free to go ahead and do a PoC that just prints the results to stdout and we'll find a way to integrate it properly afterwards.

@alban
Copy link
Member

alban commented Apr 3, 2024

What should the UX be for now, then? Just a list of function names returned as string in a "stack" field - in JSON + columns?

I think yes. There could be an option for a multiline output, such as bcc's tcpdrop tool:
https://github.com/iovisor/bcc/blob/6a5602cef2ebd97c351554d53a4f95532db6a568/tools/tcpdrop_example.txt#L7-L38

@i-Pear i-Pear force-pushed the stack_map branch 3 times, most recently from d9819ef to 1f27ea4 Compare April 25, 2024 07:55
@i-Pear
Copy link
Contributor Author

i-Pear commented Apr 25, 2024

I've pushed a demo version where IG can currently read the stack and print it to stdout. The architectural design of this version is likely to be unreasonable; it's just my initial idea, I'll later look again into the comments above. Currently, I'm using a "Converter" to translate stackID into stack data.

@i-Pear i-Pear force-pushed the stack_map branch 3 times, most recently from 4bd9e2a to 5a0f4a7 Compare April 28, 2024 15:53
@i-Pear i-Pear changed the title [WIP] Support stack map [WIP] Support kernel stack map Apr 29, 2024
@i-Pear i-Pear force-pushed the stack_map branch 2 times, most recently from 489974a to 29ad10a Compare April 30, 2024 12:15
@i-Pear i-Pear changed the title [WIP] Support kernel stack map Support kernel stack map Apr 30, 2024
@i-Pear i-Pear marked this pull request as ready for review April 30, 2024 12:15
@i-Pear i-Pear requested a review from flyth April 30, 2024 12:15
@i-Pear
Copy link
Contributor Author

i-Pear commented May 3, 2024

Also added trace_capabilities gadget, see #173 and #1319. But I don't know how to test it.

Update: tested with --host, and it looks well.

pkg/operators/ebpf/converters.go Outdated Show resolved Hide resolved
Comment on lines 510 to 511
ValueSize: 8 * PerfMaxStackDepth,
MaxEntries: 10000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PerfMaxStackDepth and MAX_ENTRIES should be kept in sync between the Go source and the header file.

Add a comment // Keep in sync with ....

I am also concerned with the growing API surface between ig and the gadget. Since third-party gadgets and ig are not to be released in lock-steps, we could have a gadget compiled with stack_map.h from an older version of ig, and then run it with a newer version of ig.

It might not matter for a map of type StackTrace, but for other kind of maps, this can cause problems. So the code pattern makes me uneasy.

I think we can accept it for now, but I am hoping we can use ebpf extensions later on.

cc @mauriciovasquezbernal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, since I have no experience in ebpf extensions, I will try it with USDT arguments first.

@i-Pear
Copy link
Contributor Author

i-Pear commented May 13, 2024

TODO: use ebpf extension to refactor this

@i-Pear
Copy link
Contributor Author

i-Pear commented May 16, 2024

The failure in documentation checks could be ignored, once this got merged, the link will be available.

@i-Pear
Copy link
Contributor Author

i-Pear commented May 18, 2024

@i-Pear i-Pear force-pushed the stack_map branch 3 times, most recently from 6857285 to 77e7552 Compare May 23, 2024 17:27
@i-Pear
Copy link
Contributor Author

i-Pear commented May 23, 2024

Changed kernel stack map name to ig_kstack.

@i-Pear i-Pear force-pushed the stack_map branch 2 times, most recently from b99d6b5 to 1d1e7e6 Compare May 27, 2024 02:20
Copy link
Member

@alban alban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

gadgets/trace_capabilities/gadget.yaml Outdated Show resolved Hide resolved
gadgets/trace_capabilities/gadget.yaml Outdated Show resolved Hide resolved
gadgets/trace_capabilities/gadget.yaml Outdated Show resolved Hide resolved
gadgets/trace_capabilities/program.bpf.c Show resolved Hide resolved
gadgets/trace_capabilities/program.bpf.c Outdated Show resolved Hide resolved
include/gadget/kernel_stack_map.h Outdated Show resolved Hide resolved
pkg/operators/ebpf/converters.go Outdated Show resolved Hide resolved
Comment on lines +320 to +322
bpf_map_update_elem(&current_syscall, &pid_tgid, &sc_ctx,
BPF_ANY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put the code in #2475 in a common header? For example, we can add fix_execve.h and provide two helper functions hook_execve_enter and hook_execve_exit. Also, seems the execveat syscall needs the same fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the implementation is slightly different for container-hook vs trace-exec. I didn't find a way to have common code... Maybe it's easier to fix it separately in this trace-capabilities gadget, and do the refactoring in a separate PR.

About execveat: it seems that the trace-exec gadgets (builtin and image-based) miss events from execveat, but that's a separate bug. The ebpf maps are not getting full in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementations in trace-exec and trace_capabilities should be the same? Container-hook might have different requirements, but I just want to provide a common implementation for gadgets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #2965

Comment on lines +259 to +265
if (LINUX_KERNEL_VERSION >= KERNEL_VERSION(5, 1, 0)) {
event->audit = (ap->cap_opt & CAP_OPT_NOAUDIT) == 0;
event->insetid = (ap->cap_opt & CAP_OPT_INSETID) != 0;
} else {
event->audit = ap->cap_opt;
event->insetid = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a future PR:

It might be possible to use bpf_core_type_matches instead of LINUX_KERNEL_VERSION:

This comes from: torvalds/linux@c1a85a0

$ sudo bpftool btf dump id 1 format c
 union security_list_options {
-        int (*capable)(const struct cred *, struct user_namespace *, int, int);
+        int (*capable)(const struct cred *, struct user_namespace *, int, unsigned int);
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recommend to use bpf_core_type_matches here, because struct cred also changed in Linux 6.7-rc6 [1].

Because the vmlinux.h in IG is version 6.6, and my local environment is 6.10, I spent a lot of time looking for why the bpf_core_type_matches returns false. In such cases, seems we need to trace all the nested structures, and provide headers for each version. What do you think?

[1] torvalds/linux@f8fa5d7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess your motivation for using bpf_core_type_matches is that some distributions may pick patches or do certain backports, making it better to judge the structure than the version number.

However, we cannot predict which of the two patches (1: updating struct cred, 2: updating union security_list_options) is included in the user's Linux distribution. This leads to four scenarios, with potentially more combinations in the future.

i-Pear and others added 3 commits May 31, 2024 00:53
Co-authored-by: Alban Crequy <albancrequy@linux.microsoft.com>
Signed-off-by: Tianyi Liu <i.pear@outlook.com>
Co-authored-by: Alban Crequy <albancrequy@linux.microsoft.com>
Signed-off-by: Tianyi Liu <i.pear@outlook.com>
Co-authored-by: Alban Crequy <albancrequy@linux.microsoft.com>
Signed-off-by: Tianyi Liu <i.pear@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants