Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input size estimates for more robust heuristic cost calculations #477

Closed
TristonianJones opened this issue Feb 3, 2022 · 5 comments
Closed
Assignees

Comments

@TristonianJones
Copy link
Collaborator

Change
There is limited support for estimating the heuristic compute cost of an expression. Inferences are
limited though when dealing with lists and maps, particularly within comprehensions where the cost
is estimated to be infinite as the number of elements in the value is not known.

To aide in computation of cost, provide a call back that can return the max size of the declaration
provided to the CEL environment. The range of size will ensure that the min and max compute
costs can be more accurately computed in a majority of cases.

Example

env.Program(ast, cel.DeclSizeEstimator(func(d *exprpb.Decl) int {
   if (d.GetName() == "big_list") {
       return 1000
   }
   if (d.GetType().GetName() == "list") {
     return 10
   }
   return 1
})

Alternatives considered

Attaching a size directly to the Decl would be vastly preferred, but is not yet possible as all declarations
are based on protobuf objects. In the future, these proto-based declarations will likely be replaced with
Go-native structs, but this is a bigger refactor than is needed to provide this functionality.

@jpbetz
Copy link
Collaborator

jpbetz commented Feb 7, 2022

I tried out a quick prototype of this and was able to get it basically working.

The biggest problem is that I needed to use afunc(t *exprpb.Type) int callback (note the use of type, not decl).

When I do this, there needs to be a clear way to map t to the type information the callback has size estimates for. This is tricky because there can be multiple lists with the same element type within a schema, each with their own maxLength limit, and the Types will be identical.

I haven't figured out how to address that yet. Some options I've considered:

  1. Pass the same exact type reference that was provided into the env to the callback so the callback can associate it with the types that were passed into the env (this is tricky because of sanitize(), but maybe CEL could use sanitized types in all the codebase except where it calls the callback?)
  2. Find some way to associate identifiers to the types ("named types" in the env?), and provide the identifier to the callback. I haven't figured out a way to do this.
  3. Provide both the type and a fieldpath to the callback (e.g. "var1.field.path.to.list")

I'm not thrilled with any of these options, TBH. (2) might be the most future proof against replacing the protobuf types with go structs, but I haven't found a way to associate the identifiers. (3) is complicated and I haven't proved to myself that the information needed is available. (1) is something I know how to implement, but relying on reference equality checks seems less than ideal.

@TristonianJones
Copy link
Collaborator Author

Hi Joe,

Can you help me understand why you're using the Type as input? Perhaps it would make more sense to provide an Attribute and the referenced Type? There are a few flavors of Attribute:

  • Absolute - Fully qualified name to a variable identifier and a set of qualifiers.
  • Maybe - One of several possible qualified variable names and qualifiers.
  • Conditional - Branched variable selection followed by qualifier.
  • Relative - Value selection depends on some other attribute to be resolved first

It had been assumed that all variables have a fixed cost of 1, and each qualifier would only cost 1 to resolve; however, it seems now that the type of the value produced by the attribute should be the actual cost ...

You might want to consider wiring your estimation changes into the AttributeFactory in interpreter/attributes.go since the factory has access to the ref.TypeProvider and could be combined with additional metadata regarding the cost of various fields within the object graph more accurately. This change would then make it possible to have a more accurate estimate of the input range size to comprehensions.

Thoughts?

@jpbetz
Copy link
Collaborator

jpbetz commented Feb 7, 2022

Can you help me understand why you're using the Type as input?

It's not ideal actually.

Given an OpenAPIv3 schema:

type: object
properties:
  name: list1
  type: array
  maxLength: 50
  items:
    type: int 
  name: list2
  type: array
  maxLength: 200
  items:
    type: int

And a CEL expression: va.list1.all(x, x < 100) && var.list2.all(x, x < 200)

There would only be one declaration (var) with two fields that are both of type list[int]. So providing the type to the callback is ambiguous. Hence the list of options in my previous comment.

Perhaps it would make more sense to provide an Attribute and the referenced Type?

Yes, this looks like it's close to what I want.

You might want to consider wiring your estimation changes into the AttributeFactory in interpreter/attributes.go since the factory has access to the ref.TypeProvider and could be combined with additional metadata regarding the cost of various fields within the object graph more accurately. This change would then make it possible to have a more accurate estimate of the input range size to comprehensions.

Thoughts?

If I can use the schema path to find the corresponding node in the OpenAPIv3 schema then I have what I need. Attribute seems like the right concept. I don't know how to handle this case:

var1.list.all(sublist, sublist.all(e, e < 100))

For the inner sublist, it looks like the attribute is just "sublist"? I would need the "var1.list" path part as well to match it up with the OpenAPIv3 schema.

@TristonianJones
Copy link
Collaborator Author

TristonianJones commented Feb 8, 2022

Comprehension variables are harder to track. You would have access to the initial iterRange and could then determine whether the var1.list contains other lists. I'm not sure whether the sublists in your case would have identifiers associated with them or would have a per-element max size that could be configured. But one thought would be to move the m*n cost calculation into the client's domain as the top-level configuration size, and then pretending all nested operations on the iterVar cost the same as though they were single element lists.

@TristonianJones
Copy link
Collaborator Author

Closed, tracking remaining work as #504

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants