Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwind through frame pointer #116

Merged
merged 27 commits into from May 9, 2022
Merged

Unwind through frame pointer #116

merged 27 commits into from May 9, 2022

Conversation

YangKeao
Copy link
Member

@YangKeao YangKeao commented Apr 6, 2022

Add the feature to unwind through frame pointer in pprof-rs.

The frame pointer has much better performance and stability. The only problem is that we cannot ensure every components have correct frame pointer. During my testing, I have met several problems:

  1. The rust std binary, doesn't have a correct frame pointer for every function. Ask the cargo to rebuild the std library from source can solve this.
  2. The libstdc++ in my PC doesn't enable the frame pointer (but I used gentoo, so I could recompile it simply)

Compiling with CFLAGS="-fno-omit-frame-pointer", CXXFLAGS="-fno-omit-frame-pointer" RUSTFLAGS="-Cforce-frame-pointers=yes" cargo +nightly build -Z build-std will get a correct behavior.

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
@sticnarf
Copy link
Contributor

sticnarf commented Apr 7, 2022

If the Rust std library or other C/C++ dependencies are not compiled with frame pointers, what will the result look like? Will it panic or just drop some information?

@YangKeao
Copy link
Member Author

YangKeao commented Apr 7, 2022

@sticnarf panic. As I cannot give a proper upper bound of the stack 😢 . (And that's exactly how I found those problem)

I remember @mornyx was working on getting a stack boundary, but I don't know the progress. And I still have some confusion about how @mornyx will implement the boundary check, as the stack for non-main thread is allocated dynamically.

@mornyx
Copy link
Contributor

mornyx commented Apr 7, 2022

@sticnarf panic. As I cannot give a proper upper bound of the stack 😢 . (And that's exactly how I found those problem)

I remember @mornyx was working on getting a stack boundary, but I don't know the progress. And I still have some confusion about how @mornyx will implement the boundary check, as the stack for non-main thread is allocated dynamically.

I use thread_local! here to initialize its own address range for each thread (via /proc/<PID>/tasks/<TID>/maps).

https://github.com/mornyx/unwind/blob/master/src/utils/maps.rs

(I just judge whether the address is readable or not, without distinguishing the stack address or instruction address.)

The only thing that needs to be confirmed is whether the address range of each thread will change once it is allocated? If so, I think we should remove thread_local! and re-parse each time instead, but this has a performance overhead.

@YangKeao
Copy link
Member Author

YangKeao commented Apr 7, 2022

The only thing that needs to be confirmed is whether the address range of each thread will change once it is allocated? If so, I think we should remove thread_local! and re-parse each time instead, but this has a performance overhead.

@mornyx It will, but I think it doesn't matter (at least for unwinding). Because though there will be new maps created through mmap syscall (by mallocator), the stack range will not change. While unwinding, we only care for the stack.

@mornyx
Copy link
Contributor

mornyx commented Apr 7, 2022

The only thing that needs to be confirmed is whether the address range of each thread will change once it is allocated? If so, I think we should remove thread_local! and re-parse each time instead, but this has a performance overhead.

@mornyx It will, but I think it doesn't matter (at least for unwinding). Because though there will be new maps created through mmap syscall (by mallocator), the stack range will not change. While unwinding, we only care for the stack.

When I write the following test in Rust:

#[test]
fn test_address_is_readable() {
     let v = 0;
     assert!(address_is_readable(&v as *const i32 as u64));
}

I found that the address of the variable v is not in the range of [stack], but in the range of anonymous addresses in maps. I guess I still need to figure this out to be sure it's working correctly.

@YangKeao
Copy link
Member Author

YangKeao commented Apr 7, 2022

I found that the address of the variable v is not in the range of [stack], but in the range of anonymous addresses in maps. I guess I still need to figure this out to be sure it's working correctly.

@mornyx Sure. The [stack] is allocated by the system for the execution. The later thread is created by pthread_create (the corresponding clone system call), and their stack is allocated by the glibc (through malloc, or cached previous allocated stack), and the system cannot distinguish the "stack" memory and "heap" memeory, so there will not be special mark in the maps file.

As the glibc doesn't provide any AS-safe api to get the stack boundary, I think getting the whole accessible address space is fine. We have three different ranges here:

  1. The "stack" memory
  2. The accessible memory space, recorded when we first call the with
  3. The real whole accessible memory space.

We can assume that, 1 is the subset of the intersection of 2 and 3.

(This is only about the glibc implementation of stack. Theoretically, a userspace program can use the userfaultd to handle the page fault, and increase the stack dynamically, but as I know, no famous implementation did this.)

@YangKeao
Copy link
Member Author

YangKeao commented Apr 7, 2022

@mornyx Oops, there is still a problem: the 2 and 3 doesn't have other meaningful relationship, which means the accessible memory in 2 is not necessarily accessible in 3 (because the mmap can be unmapped). It's still not 100% safe to only read once for every thread.

@mornyx
Copy link
Contributor

mornyx commented Apr 7, 2022

the 2 and 3 doesn't have other meaningful relationship, which means the accessible memory in 2 is not necessarily accessible in 3 (because the mmap can be unmapped). It's still not 100% safe to only read once for every thread.

As you said, the space for the thread stack is allocated by glibc and passed to pthread_create, so when does the munmap happen if the user doesn't explicitly call the pthread API? If user does not actively access the pthread library, can it be considered that the original maps have not been modified?

@YangKeao
Copy link
Member Author

YangKeao commented Apr 7, 2022

the 2 and 3 doesn't have other meaningful relationship, which means the accessible memory in 2 is not necessarily accessible in 3 (because the mmap can be unmapped). It's still not 100% safe to only read once for every thread.

As you said, the space for the thread stack is allocated by glibc and passed to pthread_create, so when does the munmap happen if the user doesn't explicitly call the pthread API? If user does not actively access the pthread library, can it be considered that the original maps have not been modified?

@mornyx The "2" contains not only the "stack", it also contains all heaps, mmapped files, etc. The mallocator will munmap the allocated memory (if these memory are not needed), and the storage engine may munmap unneeded files.

Take the frame pointer for an example. Consider the situation: the program (which omit the frame pointer) use rbp register to store an address in a mmap backed by a file, then this file is munmapped. At this time, the signal handler is triggered to get the frame pointer and find that the content in rbp register is accessible (based on outdated "2"), but read it will cause a panic.

We can shrink the possibility by ignoring any mmap backed by a file, but the mallocator is still possible to call munmap.

@mornyx
Copy link
Contributor

mornyx commented Apr 7, 2022

@YangKeao Thanks for the detailed explanation! I think we do need to parse each time. I'm going to measure the performance overhead and decide if we need a better idea..

@mornyx
Copy link
Contributor

mornyx commented Apr 18, 2022

@mornyx Ping. Is there any result about the benchmark and test?

I've done sysbench tests on my local cluster of VMs, but have not observed a noticeable performance regression (PS. with RUSTFLAGS="-Cforce-frame-pointers=yes, the result looks correct). I guess the VM env might be somewhat inaccurate. Since benchbot's testing resources are occupied for a long time, I am preparing to deploy a dedicated cluster on aws for benchmarking, which will be done today.

@mornyx
Copy link
Contributor

mornyx commented Apr 19, 2022

In general, after opening framer-pointer (with branch YangKeao/frame-pointer, with flag -Cforce-frame-pointers=yes), the performance dropped by about 1%~3% (based on sysbench and tpcc load). Note that this is in the case of "long term" profiling, the overhead of continuous profiling should be lower.

My test environment is as follows:

env value
Image Ubuntu 20.04 AArch64 (ami-02af65b2d1ebdfafc)
Machine ARM64 8c16g (a1.2xlarge)
Topology 3 TiDB, 1 TiKV, 1 PD

@YangKeao
Copy link
Member Author

Thanks for @mornyx

@BusyJay @sticnarf I think this PR is ready for review.

@YangKeao YangKeao mentioned this pull request Apr 22, 2022
src/lib.rs Outdated Show resolved Hide resolved
src/profiler.rs Outdated Show resolved Hide resolved
src/backtrace/frame_pointer.rs Show resolved Hide resolved
src/backtrace/frame_pointer.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated
Comment on lines 48 to 49
#[cfg(target_os = "linux")]
mod addr_validate;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also provide addr_validate for macOS, this is possible. The only problem is that macOS doesn't have pipe2, but we can simply simulate it.

Here is a possible piece of code:

#[inline]
#[cfg(target_os = "linux")]
unsafe fn create_pipe(fds: *mut libc::c_int) -> libc::c_int {
    libc::pipe2(fds, libc::O_CLOEXEC | libc::O_NONBLOCK)
}

#[cfg(target_os = "macos")]
unsafe fn create_pipe(fds: *mut libc::c_int) -> libc::c_int {
    let res = libc::pipe(fds);
    if res != 0 {
        return res;
    }
    let fds = fds as *mut [libc::c_int; 2];
    for n in 0..2 {
        let mut flags = libc::fcntl((*fds)[n], libc::F_GETFD);
        flags |= libc::O_CLOEXEC;
        let res = libc::fcntl((*fds)[n], libc::F_SETFD, flags);
        if res != 0 {
            return res;
        }
        let mut flags = libc::fcntl((*fds)[n], libc::F_GETFL);
        flags |= libc::O_NONBLOCK;
        let res = libc::fcntl((*fds)[n], libc::F_SETFL, flags);
        if res != 0 {
            return res;
        }
    }
    0
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rework an implementation with nix. Could you test it on Mac OS (as I don't have a mac environment)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran cargo test --features=flamegraph on ARM64 macOS and got test addr_validate::test::failed_validate has been running for over 60 seconds. I guess the problem happens because we haven't set O_NONBLOCK for pipe on macOS at src/addr_validate.rs#L21-L38.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mornyx PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FdFlag::O_NONBLOCK does not exist, I have made a suggestion below.

src/backtrace/frame_pointer.rs Outdated Show resolved Hide resolved
@mornyx
Copy link
Contributor

mornyx commented May 5, 2022

Hi @sticnarf @BusyJay, could you please review this PR? Let's bring this to the TiKV v6.1 to fix profiling's crash~

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
@YangKeao YangKeao force-pushed the frame-pointer branch 2 times, most recently from 96caf30 to 1136ee1 Compare May 7, 2022 07:53
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
src/addr_validate.rs Outdated Show resolved Hide resolved
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
@mornyx
Copy link
Contributor

mornyx commented May 9, 2022

LGTM

}

#[inline]
#[cfg(target_os = "macos")]
Copy link
Contributor

@sticnarf sticnarf May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, addr validation is only enabled on Linux. But there is an implementation for macos?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. It's a mistake. I'll remove the conditional flag on addr validate.

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants