Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding node labels for more diagnosable error messages for async errors. #585

Open
philloooo opened this issue Feb 26, 2024 · 17 comments

Comments

@philloooo
Copy link
Contributor

As a follow up of #572 , we propose platform specific validations should be done during the async build step.

This poses a challenge for developers: they submitted a complex graph and one step within the graph is failing a platform specific check, it's hard to trace back the specific operand in the graph the error is about.

I propose to follow WebGPU’s practice to define a MLObjectBase with a label field to let MLOperand extend from.
The usage would be like:

const builder = new MLGraphBuilder(context);

const A = builder.input('0', operandType);
const B = builder.input('1', operandType);

const C = builder.matmul(A, B); 
C.label = "step1:matmul";
const D = builder.add(A, C);
D.label = "step2:add";
// ... keep building a complex graph
...
const finalOperand = builder.add(E, F);

// Build the graph.
const graph = await builder.build({'output': finalOperand});


> Uncaught DOMException: Model graph build error: [Operand "step1:matmul"] input dimensions XXX exceed supported limit.

The MLObjectBase could also be extended by:

  • MLBuffer to help with debugging async buffer related errors.
  • MLGraph to help with debugging async errors from chained inference.
@zolkis
Copy link
Collaborator

zolkis commented Feb 27, 2024

Could the build process auto-add the labels according to an auto-instrumentation algorithm to be specified?

@philloooo
Copy link
Contributor Author

@zolkis that's an interesting idea, can you elaborate more with some examples?

@zolkis
Copy link
Collaborator

zolkis commented Feb 28, 2024

Well, IIUC the example above has the use case to annotate ops with labels, to help developers figure out what went wrong when exceptions are thrown.

If that is the main use case, boilerplate code is needed for manually adding a label at every single level. When the label is not specified, there is no information in the exception.

Instead of manually setting explicit labels, annotations (implicit labels) could be added automatically by the build algorithm, which knows where in the compute graph it currently is. When an exception occurs, the internal label like "step1:matmul" (standard names to be defined) could be passed.

The example is changed just in that the labels are not developer-injected, but internally generated (as will be specified by an eventual future algorithm).

const builder = new MLGraphBuilder(context);

const A = builder.input('0', operandType);
const B = builder.input('1', operandType);

const C = builder.matmul(A, B); 
const D = builder.add(A, C);
// ... keep building a complex graph
...
const finalOperand = builder.add(E, F);

// Build the graph.
const graph = await builder.build({'output': finalOperand});

> Uncaught DOMException: Model graph build error: [Operand "step1:matmul"] input dimensions XXX exceed supported limit.

IOW, the labels could be owned and attached by the implementation.

The advantage is that this covers all graphs at all level automatically, the disadvantage is the lack of developer-given labels (and no possibility to piggy-back other instrumentation). However, the whole process is under the control of the implementation, no worries about sanitizing/checking developer injected labels.

I am not even sure we must standardize the name space for such labels, as this could be owned by the implementations (when it's only meant for human eyes) -- unless programmatic handling of that information is needed.

On the other hand, if there is experience and positive developer feedback on the label feature in WebGPU (citations needed), I have no objections using it also in WebNN.

@inexorabletash
Copy link
Contributor

annotations (implicit labels) could be added automatically by the build algorithm, which knows where in the compute graph it currently is

Thanks for explaining - I like this idea. Basically in build() it would copy the label from the MLOperand if set, if not would use an algorithm to generate a label. This gets stuffed in what I'm calling the "platform operator" (or is it the "platform operand"?), and used if build() or compute() fails.

A few notes:

  • Since build() is async we need to deal with labels changing during the build process. Right now MLOperand is immutable. This proposal would change that. Various options are possible, including freezing or snap-shotting, but it needs to be thought through. Also relevant if a builder can be re-used (Can an MLGraphBuilder be reused? #567) I think?
  • Re: "I am not even sure we must standardize the name space for such labels" - someone will end up parsing them, even if we say "please don't!", so we probably will have to specify them, but we can look for precedent

@zolkis
Copy link
Collaborator

zolkis commented Mar 7, 2024

Basically in build() it would copy the label from the MLOperand if set, if not would use an algorithm to generate a label.

I support this, looks like the best way to go.

@fdwr fdwr changed the title Consider using label to allow better error handling for async errors. Consider adding node labels for more diagnosable error messages for async errors. Mar 7, 2024
@philloooo
Copy link
Contributor Author

thanks @zolkis ! auto-instrumentation would be indeed less work for developers. The thing I was concerned is whether the system can add meaningful enough labels. The more complex a model gets to, the less useful is something like a step number(as in your example).

Imagine a transformer model, the developers would probably namespace the labels to something like: decoder.attn.0.cross_attn.conv1 to give structural context.

Allowing developers to specify labels, and fallback to system label seems more plausible.

As for WebGPU label usage, I don't have exact stats to point to, but I did consult our WebGPU team and they mentioned developers find the labels extremely useful.

@zolkis
Copy link
Collaborator

zolkis commented Mar 7, 2024

Right, developers can add labels - but when they don't, does it mean they don't want any other information, i.e. should we ditch the auto-generation? Or set an option for that?

@philloooo
Copy link
Contributor Author

philloooo commented Mar 7, 2024

For auto-generated information, we have a couple options:

  1. use them as default labels only when developers don't specify labels.
  2. Just always add these information to the error message, after the user specified labels. And we don't need to specify them in the spec, it's implementation detail. The spec only specifies that if developers provide labels, they will be included in the error message.

option 2 will look like:

[Operand "user_specified_label"] op:mul step:2 - input dimensions XXX exceed supported limit.

It seems that if the user agent can add useful annotation to the error message they can just always add them. So I'd prefer option 2?

@fdwr
Copy link
Collaborator

fdwr commented Mar 18, 2024

Related, we're currently trying to diagnose some Yolo-V9 slice issues, and the lack of diagnostic info is impeding investigation ("Failed to execute 'slice' on 'MLGraphBuilder': For dimension (0): the starting index to slice must be less than input size (2)." - cool, but which slice node?...)

@inexorabletash
Copy link
Contributor

I may have said this in a WG telecon, but this feels like something where a prototype implementation would help inform the spec. So if any of the Chromium contributors who are feeling the pain want to hack something in, don't wait on spec discussions and it doesn't need to be perfect! Let's iterate and learn.

@lisa0314
Copy link

lisa0314 commented Apr 26, 2024

This CL 5492314: WebNN: inital implementation of Add label for mloperand | https://chromium-review.googlesource.com/c/chromium/src/+/5492314 attempts to add label for MLOperand to report more detailed
error message during the async build. As POC, I only add the label for slice operator.

const builder = new MLGraphBuilder(context);

const A = builder.input('0', operandType);
const B = builder.input('1', operandType);

const C = builder.matmul(A, B); 
C.label = "step1:matmul";
const D = builder.add(A, C);
D.label = "step2:add";
// ... keep building a complex graph
...
const finalOperand = builder.add(E, F);

// Build the graph.
const graph = await builder.build({'output': finalOperand});

One question about the IDL definition. If a sequence of operands were returned when invoking builder.split(), should we set labels for all operands?

@lisa0314
Copy link

lisa0314 commented Apr 26, 2024

Another proposal: the label also could be added into MLOperator.

const builder = new MLGraphBuilder(context);

const A = builder.input('0', operandType);
const B = builder.input('1', operandType);

const C = builder.matmul(A, B, "matmul"); 
const D = builder.add(A, C, "add");
// ... keep building a complex graph
...
const finalOperand = builder.add(E, F, "add2");

// Build the graph.
const graph = await builder.build({'output': finalOperand});

Any thoughts?

@mingmingtasd
Copy link
Contributor

Another proposal: the label also could be added into MLOperator.

const builder = new MLGraphBuilder(context);

const A = builder.input('0', operandType);
const B = builder.input('1', operandType);

const C = builder.matmul(A, B, "matmul"); 
const D = builder.add(A, C, "add");
// ... keep building a complex graph
...
const finalOperand = builder.add(E, F, "add2");

// Build the graph.
const graph = await builder.build({'output': finalOperand});

Any thoughts?

Agree! From our experience of debugging the graph translated from frameworks like onnxruntime, the operator/node name is very useful to match the operator in the .onnx model with the implemented operator in backends of WebNN.
Especially when there are too many operators in some large models, it's difficult to distinguish which operator we are supporting on WebNN since these operators can have same type/attributes. But the operator's name is unique, if we can pass the name to WebNN backend, our debugging process will be easier, and we can also report more detailed error messages to web user/developer, for example: Failed to create the matmul operator: /layers.0/self_attn/q_proj/MatMul, then you can open the .onnx model in some visualization tool like https://netron.app/ to search the operator by the name.

image

@mingmingtasd
Copy link
Contributor

We can have both of them: operand name and operator name.
Another proposal: for the operand label, we can extend it like

const C = builder.matmul(A, B); 
C.label = {name: "C", fromOperator: "step1:matmul"}

@philloooo
Copy link
Contributor Author

philloooo commented May 2, 2024

The downside with adding to the MLOperator is that it doesn't exist as a concept in the spec, so we will need to add the param to each of the builder method. It also makes the param list longer.

Alternative A - add to options dict

Another alternative is to add it to the options dict, so now all the builder methods take a options dict, existing option dict like MLClampOptions will inherit from the base options dict that has a label field.

Alternative B - extend MLOperand's label field

Extending the operand label field seems a bit non-intuitive:

  • For most cases, we just want the label to take a string. So then do we support both string and object type for label?
  • If we have C, D = builder.method(A, B); and C, D set different fromOperator, do we let the last one win?

Alternative C - keep with current proposal

We can also keep with current proposal, treat the multiple outputs case as an edge case and handle it less elegantly.
For example for lstm, we can iterate through the output operands, and use the first output operand's label string that's not empty. It's probably still sufficiently informational for debugging.

@zolkis
Copy link
Collaborator

zolkis commented May 3, 2024

Considering Joshua's comment, alternative A seems to fit the best (to handle MLOperand's immutability vs labels mutability). The parameter / options-dict can be optional.

We also need an algorithm for generating good enough default labels (impl. specific, but as per the comment above, we should rather standardize the namespace before devs start to parse them and come up with various private namespaces).
If that is good enough, we'd probably won't need manual labels very often.

But if labels are used frequently, then from a developer coding perspective, I'd prefer the solution with setting the labels on separate lines, like in alternative B, as it allows separating code instrumentation from business logic. If we could find a mean to correctly do this, I'd go with that.

@lisa0314
Copy link

lisa0314 commented May 11, 2024

This 5528797: WebNN: initial implementation of adding name for MLOperator | https://chromium-review.googlesource.com/c/chromium/src/+/5528797 attempts to add label for MLOperator to report more detailed
error message during the async build. As POC, I only add the label for resample2d operator.

const builder = new MLGraphBuilder(context);

const inputShape = [1, 1, 2, 4];
const input = builder.input(
      `input`,
      {dataType: 'float32', dimensions: inputShape});
const options = {scales: [2.0, 2.0], label: "resample2d"};
 const resample2d =
      builder.resample2d(input, options);

// Build the graph.
const graph = await builder.build({'output': resample2d});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants