Add more nvFuser debug information #387

riccardofelluga · 2024-05-08T19:51:26Z

🚀 Feature

For debugging purposes, I would like to be able to quickly retrieve input information for a fusion.

Motivation

When debugging performance issues an important aspect is to be able to narrow the search space quickly and efficiently.

Pitch

At the time of writing, debugging nvFuser fusions is very approachable already as it is possible to retrieve fusion definitions from the trace. It is however not trivial to retrieve the input information. Therefore, to improve the experience even further, I propose to add a mechanism to retrieve input information for a fusion definition.

Alternatives

Trigger an error in nvFuser and let it print the debugging information.
Meticulously read a trace and write down input shapes.

Additional context

This idea came out after starting to work on #205 where an idea by @kevinstephano is to show how to dump the debug for specific fusions and execute them in a notebook. This issue works toward reaching that goal.

cc @carmocca @apaz-cli @Borda @tfogal

IvanYashchuk · 2024-05-10T14:41:07Z

A nvFuser fusion region in the Thunder trace is represented as a BoundSymbol. You can get all bound symbols by accessing TraceCtx.bound_symbols and for all bound symbols, you can access their inputs with BoundSymbol.args, BoundSymbol.kwargs, or BoundSymbol.flat_args.
All nvFuser bound symbols have BoundSymbol.is_fusion set to True and their name starts with nvFusion, you can filter out all other symbols with a simple list comprehension

nvfuser_symbols = [bsym for bsym in trace.bound_symbols if bsym.sym.name.startswith("nvFusion")]

There's also thunder.examine.get_fusion_symbols function that does the same using is_fusion

lightning-thunder/thunder/examine/__init__.py

Line 207 in dd42bb3

    
           def get_fusion_symbols(trace: TraceCtx, warn_if_fusions_unavailable: bool = True) -> list[BoundSymbol]:

Here's an example session using bound symbols info to retrieve information on inputs:

In [1]: import torch

In [2]: import thunder

In [3]: @thunder.jit
   ...: def func(x):
   ...:     t1 = thunder.prims.var(x, (0, 1), correction=1)
   ...:     t2 = thunder.prims.add(t1, t1)
   ...:     return t2
   ...: 

In [4]: x = torch.randn(512, 512, device="cuda")

In [5]: out = func(x)

In [6]: thunder.last_traces(func)[-1]
Out[6]: 
# Constructed by Delete Last Used (took 0 milliseconds)
import torch
from thunder.executors.torchex import no_autocast

@torch.no_grad()
@no_autocast
def computation(x):
  # x: "cuda:0 f32[512, 512]"
  [t2] = nvFusion0(x)
    # t1 = prims.var(x, (0, 1), correction=1)  # t1: "cuda:0 f32[]"
    # t2 = prims.add(t1, t1)  # t2: "cuda:0 f32[]"
  del x
  return t2

In [7]: import thunder.examine

In [8]: thunder.examine.get_fusion_symbols(thunder.last_traces(func)[-1])
Out[8]: 
[[t2] = nvFusion0(x)
   # t1 = prims.var(x, (0, 1), correction=1)  # t1: "cuda:0 f32[]"
   # t2 = prims.add(t1, t1)  # t2: "cuda:0 f32[]"]

In [9]: trace = thunder.last_traces(func)[-1]

In [10]: nvfuser_symbols = [bsym for bsym in trace.bound_symbols if bsym.sym.name.startswith("nvFusion")]

In [11]: nvfuser_symbols
Out[11]: 
[[t2] = nvFusion0(x)
   # t1 = prims.var(x, (0, 1), correction=1)  # t1: "cuda:0 f32[]"
   # t2 = prims.add(t1, t1)  # t2: "cuda:0 f32[]"]

In [12]: nvfuser_symbols[0].args
Out[12]: (x,)

In [13]: nvfuser_symbols[0].args[0].shape
Out[13]: (512, 512)

In [14]: nvfuser_symbols[0].args[0].dtype
Out[14]: float32

I propose to add a mechanism to retrieve input information for a fusion definition.

Given the information above what mechanism are you planning to add?

riccardofelluga · 2024-05-10T17:09:29Z

Thanks for the comment! To see what mechanism I came up with before you had a chance to comment, please check out the linked PR #388.

With this added context, I'll check out how I can reuse the examine mechanism in my PR and update this issue.

riccardofelluga · 2024-05-15T16:49:59Z

After further inspection @IvanYashchuk I still think this are two sightly different things. In the PR the output is a ready to run python code for the fusion and the method you explained allows to get similar information but missing the stride and the code to run the fusions. However, I agree with you on that it might be better to move the code from that PR to the examine.py file. Say, making something similar to get_fusion_symbols but that returns the repro code for the fusions, what do you think?

Even better, I think I can get the information about the inputs from the trace using your technique, eliminating the need to modify the nvfuserex_impl.py file at all.

IvanYashchuk · 2024-05-16T10:13:30Z

What is the goal?
Depending on the goal strides information might not be needed or essential to have.

Let's think of debugging scenarios, here are example I could come up with:

thunder.jit(fn)(inputs) worked but the result is incorrect or performance is slow and we'd like to rerun only one specific fusion region that happens to be nvFuser since this issue is about improving Thunder+nvFuser experience. We can query the FusionDefinition, we can print the nvFuser's representation, and call FusionDefinition.execute. Having strides information of the actual inputs for this fusion region is a must-have information because contiguity of a tensor is a static property of given FusionDefinition instance also affecting performance.
thunder.jit(fn)(inputs) is not used, instead the Thunder Trace object is constructed manually following the steps usually happening in thunder.jit and now before we even attempt to run the full execution trace we want to test individual fusion regions. At this time we don't know what strides could be in a real program because we're executing in isolation. We can create sample inputs using the shape, type, and device information from the trace and pass to our FusionDefinitionWrapper that has a stride info cache. Allowing users to specify different memory layouts for inputs could be beneficial to test performance and correctness of similar but different FusionDefinitions.

That's all specific for nvFuser as a FusingExecutor. How can this be extended to run any slice of a trace that involves any FusingExecutor and/or OperatorExecutor ops?

riccardofelluga added enhancement New feature or request debugging developer efficiency nvfuser labels May 8, 2024

riccardofelluga self-assigned this May 8, 2024

This was referenced May 8, 2024

Add input information to fusion definitions for trace inspection and debugging #388

Open

benchmarking — create a notebook showing how to work with the single gpu benchmarks #205

Open

mruberry added triage review and removed triage review labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more nvFuser debug information #387

Add more nvFuser debug information #387

riccardofelluga commented May 8, 2024 •

edited

IvanYashchuk commented May 10, 2024

riccardofelluga commented May 10, 2024

riccardofelluga commented May 15, 2024

IvanYashchuk commented May 16, 2024

Add more nvFuser debug information #387

Add more nvFuser debug information #387

Comments

riccardofelluga commented May 8, 2024 • edited

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

IvanYashchuk commented May 10, 2024

riccardofelluga commented May 10, 2024

riccardofelluga commented May 15, 2024

IvanYashchuk commented May 16, 2024

riccardofelluga commented May 8, 2024 •

edited