Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use xctrace on macOS #246

Open
kraktus opened this issue Jun 11, 2022 · 15 comments · May be fixed by #286
Open

Use xctrace on macOS #246

kraktus opened this issue Jun 11, 2022 · 15 comments · May be fixed by #286

Comments

@kraktus
Copy link

kraktus commented Jun 11, 2022

This is a crosspost from the flamegraph repo: flamegraph-rs/flamegraph#192

Right now flamegraph uses dtrace on macOS but this runs into some limitations:

Inline frames are not properly handled (flamegraph-rs/flamegraph#113)
Running dtrace requires root rights
dtrace is generally considered deprecated functionality on macOS and will likely disappear longer term
One potential solution is to switch to xctrace and then implement a reader for the .trace / .run files. speedscope appears to have partial support for these files but I haven't been able to make it work at all yet. There is an issue from a while ago that it's not working (jlfwong/speedscope#196) but the code there might still be useful.

Although not familiar with the codebase I'd be ready to give it a go, but would it falls into the scope of inferno or do you not want to support an additional profiler?

@jonhoo
Copy link
Owner

jonhoo commented Jun 11, 2022

Yeah, what you'd want to do is implement a "collapser" for the xctrace output format (see the existing collapsers). Once that's in place, you should be able to take an output file from xctrace and feed it through inferno-collapse-xctrace and from then into inferno-flamegraph!

@kraktus
Copy link
Author

kraktus commented Jun 18, 2022

Progress so far

The raw file is undocumented and as of Xcode >=11 there is no OSS that can parse it. However it xctrace comes with an export command which produces XML files.

Here is the different schema that can be accessed after xctrace record -t time (CPU profiling).
$ xctrace export --toc --output test.xml --input PATH/TO/TRACE/FILE.trace

# test.xml
<table schema="tick" frequency="10"/>
<table schema="life-cycle-period" target-pid="SINGLE"/>
<table schema="tick" frequency="1"/>
<table schema="device-thermal-state-intervals"/>
<table schema="os-log" category="PointsOfInterest"/>
<table subsystem="&quot;com.apple.ConditionInducer.LowSeverity&quot;" schema="os-signpost" category="InduceCondition"/>
<table schema="os-signpost" category="PointsOfInterest"/>
<table target-pid="SINGLE" exclude-os-logs="0" schema="global-poi-layout" signpost-code-map="" colorize-by-arg4="0"/>
<table target-pid="SINGLE" context-switch-sampling="0" high-frequency-sampling="0" schema="time-profile" needs-kernel-callstack="0" record-waiting-threads="0"/>
<table target-pid="SINGLE" schema="kdebug-signpost" signpost-code-map=""/>
<table codes="&quot;0x1f,0x05&quot; &quot;0x1f,0x07&quot; &quot;0x1f,0x08&quot;" schema="kdebug-strings"/>
<table schema="dyld-library-load" target-pid="SINGLE"/>
<table schema="os-log-arg" category="PointsOfInterest"/>
<table codes="&quot;0x2b,0xdc&quot;" schema="kdebug" target="SINGLE"/>
<table target-pid="SINGLE" kdebug-match-rule="0" exclude-os-logs="0" schema="region-of-interest" signpost-code-map=""/>
<table codes="&quot;0x1,0x25&quot;" schema="kdebug" target="SINGLE"/>
<table target="SINGLE" schema="kdebug" codes="&quot;0x2b,0x87&quot;"/>
<table codes="&quot;0x2b,0x65&quot;" schema="kdebug"/>
<table codes="&quot;0x1f,0x7&quot;" schema="kdebug" target="SINGLE"/>
<table target="SINGLE" schema="kdebug" codes="&quot;0x2b,0xd8&quot;"/>
<table codes="&quot;0x2d,*&quot;" schema="kdebug" target="SINGLE" callstack="user"/>
<table schema="developer-thread-name-update"/>
<table target="SINGLE" schema="kdebug" codes="&quot;0x31,0xca&quot;"/>
<table codes="&quot;46,2&quot;" schema="kdebug" target="SINGLE" callstack="user"/>
<table schema="app-spin" target-pid="SINGLE"/>
<table codes="&quot;0x1,0xa&quot;" schema="kdebug" target="SINGLE" callstack="user"/>
<table codes="&quot;0x07,0x00&quot; &quot;0x1f,0x05&quot; &quot;0x1f,0x07&quot; &quot;0x1f,0x08&quot;" schema="kdebug" target="SINGLE" callstack="user"/>
<table target-pid="SINGLE" kdebug-match-rule="0" schema="global-roi-layout" signpost-code-map="" colorize-by-arg4="0"/>
<table codes="&quot;0x21,0xa&quot;" schema="kdebug" target="SINGLE" callstack="user"/>
<table schema="gcd-perf-event" target-pid="SINGLE"/>
<table schema="thread-name" target-pid="SINGLE"/>
<table target-pid="SINGLE" kdebug-match-rule="0" exclude-os-logs="0" schema="roi-metadata" signpost-code-map=""/>
<table sample-rate-micro-seconds="1000" callstack="user" schema="time-sample" target="SINGLE" all-thread-states="NO"/>
<table schema="os-signpost-arg" category="PointsOfInterest"/>

I think we'd be interested by the time-profile one (sample):
$ xctrace export --output test-time-sample.xml --input PATH/TO/TRACE/FILE.trace --xpath '/trace-toc/run[@number="1"]/data/table[@schema="time-sample"]'

<row><sample-time id="22" fmt="00:00.039.455">39455541</sample-time><thread ref="2"/><process ref="4"/><core ref="14"/><thread-state ref="8"/><weight ref="9"/><backtrace id="23" fmt="arrayvec::arrayvec::ArrayVec$LT$T$C$_$GT$::retain::h202491b79ad35b30 ← (11 other frames)"><process ref="4"/><text-addresses id="24" fmt="frag 1732">4331406713 4331406716</text-addresses><process ref="4"/><text-addresses id="25" fmt="frag 1733">4331403404 4331396176 4331396516 4331396516 4331396992 4331396636 4331396660 4331486836 4331398140 4336799884</text-addresses></backtrace></row>
<row><sample-time id="26" fmt="00:00.040.455">40455541</sample-time><thread ref="2"/><process ref="4"/><core ref="14"/><thread-state ref="8"/><weight ref="9"/><backtrace id="27" fmt="arrayvec::arrayvec::ArrayVec$LT$T$C$_$GT$::retain::h202491b79ad35b30 ← (11 other frames)"><process ref="4"/><text-addresses id="28" fmt="frag 1735">4331406669 4331406716</text-addresses><process ref="4"/><text-addresses ref="25"/></backtrace></row>
<row><sample-time id="29" fmt="00:00.041.455">41455541</sample-time><thread ref="2"/><process ref="4"/><core ref="14"/><thread-state ref="8"/><weight ref="9"/><backtrace id="30" fmt="shakmaty::attacks::attacks::ha1bbcd54d01602de ← (13 other frames)"><process ref="4"/><text-addresses id="31" fmt="frag 1737">4331408097 4331404772</text-addresses><process ref="4"/><text-addresses id="32" fmt="frag 1738">4331404284 4331406716 4331403404 4331396176 4331396516 4331396516 4331396992 4331396636 4331396660 4331486836 4331398140 4336799884</text-addresses></backtrace></row>
<row><sample-time id="33" fmt="00:00.042.455">42455541</sample-time><thread ref="2"/><process ref="4"/><core ref="14"/><thread-state ref="8"/><weight ref="9"/><backtrace id="34" fmt="retroboard::retroboard::RetroBoard::pseudo_legal_unmoves::h01fcdf6bff18987a ← (11 other frames)"><process ref="4"/><text-addresses id="35" fmt="frag 1740">4331400581 4331400208</text-addresses><process ref="4"/><text-addresses ref="21"/></backtrace></row>

This is where I'm not sure on how to continue to parse it and reconstruct the tree, as not very familiar with low-level CPU record, and there is little documentation. I think that text-addresses is the list the addresses of all the frames at during that sample.

@jonhoo
Copy link
Owner

jonhoo commented Jun 19, 2022

Hmm, interesting. It's not clear that there's quite enough information in there to construct a flamegraph since it chops off the other frames in the backtrace (or at least it looks that way). Maybe there's enough context elsewhere to reconstruct the full call stack the way flamegraph requires though.

In general, the goal is be able to out the total number of samples spent in each call stack leaf (so ~= function, but taking where that function was called from into account) not including its children, and to print that out along with the call stack that led to that function call. So, for example, if foo calls bar calls baz, however many samples we have for baz should be printed as foo;bar;baz;N. If baz was also separately called by quox, there should be a separate line like quox;baz;M. Any time spent in, say, bar that's not spent in baz should be emitted under foo;bar;O, etc.

I'm afraid I know basically nothing about the xctrace data storage or output formats, so can't be of much help in terms of going from that to what flamegraph needs.

@stevecheckoway
Copy link

For what it's worth, I spent some time looking at the exported xml files. They make heavy use of references to nodes in the XML document. Some samples reference complete backtraces. Other samples split the backtrace into two parts, a unique portion (I think) and a shared portion. Here's an example of that.

<backtrace id="182" fmt="first ← (3 other frames)">
  <process ref="141"/>
  <text-addresses id="183" fmt="frag 1862">4309499693 4309499756</text-addresses>
  <process ref="141"/>
  <text-addresses ref="180"/>
</backtrace>

Note the <text-addresses ref="180"/> node. It's referring to the second text-addresses node in the backtrace from this earlier sample.

<backtrace id="178" fmt="first ← (3 other frames)">
  <process ref="141"/>
  <text-addresses id="179" fmt="frag 1859">4309499697 4309499756</text-addresses>
  <process ref="141"/>
  <text-addresses id="180" fmt="frag 1860">4309499756 4310323340</text-addresses>
</backtrace>

So I think the backtrace consists of these addresses:
4309499693 4309499756 4309499756 4310323340

I don't know why there's a duplicated address (in bold above). In my experiments, there is usually—but not always—a duplicated address.

Note that these addresses are subject to address space layout randomization. I wasn't able to find the dynamic image base addresses in the exported files. I considered using function name (first is the name of the function in both of these samples) plus symbols in the binary to find the offset between the dynamic base address and the static base address.

For example, this test binary has the following symbols.

$ nm a.out
0000000100000000 T __mh_execute_header
0000000100008000 S _dummy
0000000100003ed0 T _first
0000000100003d48 T _fourth
0000000100003f58 T _main
                 U _puts
0000000100003e48 T _second
0000000100003dc0 T _third

The difference between the first text address and the address of _first is 4309499693 - 0x0000000100003ed0 = 0xdd805d. This computer is an M1 so the page size is 0x4000. This suggests that the difference between the dynamic image base of a.out when I ran xctrace was 0xdd8000.

We can subtract that value from each of the addresses in the backtrace.

  1. 4309499693 - 0xdd8000 = 0x100003f2d
  2. 4309499756 - 0xdd8000 = 0x100003f6c
  3. 4309499756 - 0xdd8000 = 0x100003f6c
  4. 4310323340 - 0xdd8000 = 0x1000cd08c

The second and third are duplicates as mentioned above and the fourth is wrong because it's not actually from a.out but is from the start function which calls main. I confirmed this by looking at the corresponding sample in Instruments.

Sample Time	Core	Process	Thread	State	Backtrace	
00:00.051.294	CPU 3 (Performance)	a.out (94640)	Main Thread  0x766c18	Running	first ← (3 other frames)	

The Backtrace pane for this sample shows first, main, main, and start.

You might notice that the first address is odd. I have no idea what to make of that!
Here's some disassembly in first from around that address.

100003f24: 4a 29 cb 9a 	asr	x10, x10, x11
100003f28: 08 01 0a eb 	subs	x8, x8, x10
100003f2c: 28 01 00 b9 	str	w8, [x9]
100003f30: 01 00 00 14 	b	0x100003f34 <_first+0x64>
100003f34: e8 07 40 f9 	ldr	x8, [sp, #8]
100003f38: 08 05 00 91 	add	x8, x8, #1

Maybe it refers to address 0x100003f2c. I'm not sure.

Here's the initial disassembly of main.

0000000100003f58 <_main>:
100003f58: ff 83 00 d1 	sub	sp, sp, #32
100003f5c: fd 7b 01 a9 	stp	x29, x30, [sp, #16]
100003f60: fd 43 00 91 	add	x29, sp, #16
100003f64: bf c3 1f b8 	stur	wzr, [x29, #-4]
100003f68: da ff ff 97 	bl	0x100003ed0 <_first>

The address 0x100003f6c makes sense. It's the address of the instruction following the bl instruction in main which is exactly what you'd get by walking a call stack.

I stopped looking at this when I first saw the odd address a week or so ago hence this is the end of my analysis.

Open questions:

  1. Can the load addresses of the binary and libraries be recovered correctly? (My approach above is heuristic and is going to fail at least when a sample is taken more than a page from the start of a function which is more likely on x86-64 than on ARM64 due to page size differences.)
  2. Why are some addresses odd?

@kraktus
Copy link
Author

kraktus commented Aug 13, 2022

Hey,

Thanks for your investigations, on my end also tried to reconstruct the backtrace from the addresses but failed. I was under the assumption that several text-addresses meant multiple threads.

The encoding is an array of packed values, where the 32 MSBs are the process and the 32LSBs is the fragment.

Debugging display only. Fragments lack the process they came from, so those text addresses can't be converted to symbols or further analyzed.

A month ago I did create a post about it on the apple forums which instructed me to file a report feedback, and I haven’t gotten any update since :| Maybe bumping the thread will be on interest for those who have an apple dev account.

@mstange
Copy link

mstange commented Oct 11, 2022

There's some existing parsing code for this format in speedscope and in a Firefox Profiler PR, but neither works with the format from recent Xcode versions. Anyone interested in implementing this may want to consult firefox-devtools/profiler#3684 (comment) and follow various links from there.

@kraktus
Copy link
Author

kraktus commented Oct 11, 2022

Unfortunately all links there are outdated. I feel upvoting the question on apple forums is our best bet. Maybe asking on SO could be beneficial too.

@kraktus
Copy link
Author

kraktus commented Oct 30, 2022

I have created a SO question hoping for someone more knowledgable about xctrace to answer, feel free to upvote to raise its visibility.

@benjaminRomano
Copy link

benjaminRomano commented Nov 6, 2022

I just stumbled upon this thread. I’ve built a tool that can convert Instruments into Gecko format for Firefox Profiler. It can read the samples from XCTrace, extract load address for all loaded libraries from the KDebug tracepoints and desymbolicate using that. I’m going to see if I can open source this, then you could use that as a frame of reference (its written in Kotlin). My use case was that I wanted interactive flame graphs and more importantly time-ordered sample view.

In the meantime, here is a run down of how this works...

Background

First a quick primer on the XCTrace output format. XCTrace tries to avoid redundancies in the output by using reference nodes. Each node with a given tag, set of attributes and value is given a unique ID for that tag . Future nodes that use the same tag, attributes and values will include a ref attribute whose value is the ID of the original node.

<kdebug-class id="1" fmt="0x1">1</kdebug-class>
<!-- Equivalent to <kdebug-class id="1" fmt="0x1">1</kdebug-class> -->
<kdebug-class ref="1 />

Documentation on the set of tables available and their formats can be found by opening up a trace file in Instruments and then going to Documents > Inspector.

Secondly, the XCTrace file contains kdebug events. We can inspect KDebug tracepoints to find the data we need for load addresses.

For example, we can find the launch executable kdebug event which contains the loadAddress of main application

 launchTraceID = dyld3::kdebug_trace_dyld_duration_start(DBG_DYLD_TIMING_LAUNCH_EXECUTABLE, (uint64_t)state.config.process.mainExecutable, 0, 0);

This event corresponds to KDEBUG_CODE(0x1f, 0x07, 1) or the following in the XML:


<row>
   <kdebug-class fmt="0x1f">31</kdebug-class>
    <kdebug-subclass fmt="0x5">5</debug-subclass>
    <kdebug-code>1</debug-code>
    <kdebug-func>
    <kdebug-arg>
        <-- first arg is result for this event (this can be determined by tracing through dyld code -->
    </kdebug-arg>
    <kdebug-arg>
       <-- load address -->
    </debug-arg>
</row>

Note: As shown below, we'll actually want to use a different tracepoint but this one is simpler to understand

Summary

At a high-level, we need to extract the backtraces from the XCTrace format and secondly find the load addresses for every library loaded by dylidb using the kdebug event info.

1. Extract backtraces

// Grab the backtraces (Note: `@schema="time-sample" may be easier to parse, but I only found out about that table after the fact)
// Note: The below assumes that we are analyzing the first run in the trace file as indicated by `run[1]` 
xtrace export --input <file> --xpath '/trace-toc[1]/run[1]/data[1]/table[@schema="time-profile"]'

Find backtrace nodes and then the sample-time and thread info can be found as siblings of this backtrace node. Within the backtrace node can be multiple text-addresses as far as I can tell (this is worth sanity checking though) you just parse the text-address values left to right (top of stack to bottom of stack). For the text addresses, they can then be converted into hex codes.

 <row>
     <sample-time id="177" fmt="00:00.411.046">411046250</sample-time>
    <thread ref="169"/>
     <process ref="40"/>
    <core ref="171"/><thread-state ref="172"/>
    <weight ref="173"/>
     <backtrace id="178" fmt="0x11d690681 ← (1 other frames)">
        <process ref="40"/>
        <text-addresses id="179" fmt="frag 801">4788389505</text-addresses>
        <process ref="40"/>
         <text-addresses id="180" fmt="frag 802">4788032064</text-addresses>
     </backtrace>
</row>

Note, the backtrace node fmt attribute has a string which can be ignored AFAICT.

3. Find load addresses

We can get library name and it's load address using KDBG_CODE(DBG_DYLD, DBG_DYLD_UUID, DBG_DYLD_UUID_MAP_A) event (reference). Note, the info for this tracepoint is split into another kdebug event KDBG_CODE(DBG_DYLD, DBG_DYLD_UUID, DBG_DYLD_UUID_MAP_B).

// Grab the kdebug events 
xtrace export --input <file> --xpath '/trace-toc/run[1]/data/table[@schema="kdebug"]'


Find the relevant kdebug node for KDBG_CODE(DBG_DYLD, DBG_DYLD_UUID, DBG_DYLD_UUID_MAP_A) and the associated KDBG_CODE(DBG_DYLD, DBG_DYLD_UUID, DBG_DYLD_UUID_MAP_B) entries in XML. I just scanned for each individually and then associated the two in-order.

From their get the relevant arguments from each. The 2nd arg of DYLD_UUID_MAP_A is the load address and 1st arg of DYLD_UUID_MAP_B is the string ID for the library.

Given that string ID, go and find the following entry <kdebug-string fmt="<ID>" then the string will be in a sibling node, <raw-string> under the fmt attribute.

// Grab kdebug strings (we want strings from `0x1f,0x05` events which corresponds to DBG_DYLD,DBG_DYLD_UUID)
xtrace export --input <file> --xpath '/trace-toc/run[1]/data/table[@schema="kdebug-strings"][@codes="&quot;0x1f,0x05&quot; &quot;0x1f,0x07&quot; &quot;0x1f,0x08&quot;"]'

e.g.

<row><kdebug-string fmt="0x1">1</kdebug-string><raw-string>String for ID #1</raw-string></row>

4. De-symbolicate

Given the backtraces and addresses, de-symbolicate using a tool like atos. The framework dsyms can be found at:

# Replace `16.1` with your device / arch
# Library name comes from kdebug strings above
~/Library/Developer/Xcode/iOS\ DeviceSupport/16.1\ \(20B82\)\ arm64e/Symbols/<libraryName>
# Run the list of addresses through `atos` for every dSYM file
# Unfortunately, this is really slow. Maybe there is a better way. It takes ~1m30s for ~800 dSYMs on my M1 macbook even when parallelized
$ atos -arch arm64e -o <dSymFile> -l <loadAddress> -f <file with all addresses>

Running this for each dSym will give you a list of partially de-obfuscated symbols. Merge the lists together (preserve the lines that don't start with 0x). This will give you a final address to symbolName mapping.

With that, you can take the backtraces and the mapping and transform into some other format.

@mstange
Copy link

mstange commented Nov 6, 2022

This sounds excellent. It would be amazing if you could open source your code. I'd love to integrate a rust version of it into samply, which can perform fast symbolication and also allows the source view in the Firefox profiler to work.

@benjaminRomano
Copy link

Here is repo with logic to convert Instruments to Gecko: https://github.com/benjaminRomano/instruments-to-gecko

@kraktus
Copy link
Author

kraktus commented Nov 12, 2022

Thank you very much for sharing your findings and code @benjaminRomano!

Quick question, I see that you parse the backtraces then de-symbolicate. I think that for inferno it’d would make better sense to get as input de-symbolicated backtraces. The preprocessing would be left to the user. Do you think that would cause issues? Can we just de-symbolicate beforehand with atmos and replace addresses in the backtrace file without corrupting it?

@benjaminRomano
Copy link

benjaminRomano commented Nov 13, 2022

So the xctrace file should already contain all the symbols (and if it doesn't xctrace symbolicate can be used), but the open question is how do we get to that data from xtrace export, which its very possible it's not exported. For example, there is a dyld-library-load table listed in the table of contents of xtrace export --toc, but accessing it through it's xpath doesn't return any data even though I can see it in Instruments Inspector.

If we can figure that out it greatly simplifies the conversion and it also avoids one of the issues I ran into where the KDebug tracepoints with the load addresses only get emitted on process creation.

@benjaminRomano
Copy link

@kraktus XCode 14.3 Beta updated xctrace export to enable getting the fully desymbolicated backtraces. I've updated https://github.com/benjaminRomano/instruments-to-gecko to use that approach. This makes the conversion much simpler and faster.

@kraktus
Copy link
Author

kraktus commented Mar 25, 2023

Very nice! Apple recently came around my feedback assistant request with the same answer, but I wasn’t able to install the beta version due to lack of developer status at the time.

@ldm0 ldm0 linked a pull request Mar 26, 2023 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants