Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "User defined coercion" rules #10423

Closed
alamb opened this issue May 8, 2024 · 2 comments · Fixed by #10439
Closed

Support "User defined coercion" rules #10423

alamb opened this issue May 8, 2024 · 2 comments · Fixed by #10439
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 8, 2024

Is your feature request related to a problem or challenge?

DataFusion automatically "coerces" (see docs here) input argument types to match the types required of operations or functions.

For functions, this is described by a desired TypeSignature

pub enum TypeSignature {
    Variadic(Vec<DataType>),
    VariadicEqual,
    VariadicAny,
    Uniform(usize, Vec<DataType>),
    Exact(Vec<DataType>),
    Any(usize),
    OneOf(Vec<TypeSignature>),
    ArraySignature(ArrayFunctionSignature),
}

However, some functions have special hard coded coercion logic such as sum and count (TODO link) as well as some Array functions like make_array. We started down the path of encoding the special array semantics into TypeSignature (see ArrayFunctionSignature))

However, as we continue to find other examples of different desired rules (most recently in sum and count), TypeSignature will grow and become more and more specialized

Describe the solution you'd like

@jayzhan211 had a great suggestion #10268 (comment) that in addition to encoding common coercion behaviors in TypeSignature, we can also add a variant of TypeSignature that permits user defined coercion rules

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label May 8, 2024
@alamb alamb changed the title Support "User defined coercsion" function Support "User defined coercion" rules May 8, 2024
@jayzhan211 jayzhan211 self-assigned this May 8, 2024
@jayzhan211
Copy link
Contributor

jayzhan211 commented May 9, 2024

I found that to have coerce_types for both scalarUDF and aggregateUDF, I would need to introduce a more general trait for both function

impl UDFImpl for T {
    fn name(&self) -> &str {
       &self.name
    }

    fn coerce_types(&self, data_types: &[DataType]) -> Result<Vec<DataType>> {
        not_impl_err!("Function {} does not implement coerce_types", self.name)
    }
}

impl ScalarUDFImpl: UDFImpl
impl AggregateUDFImpl: UDFImpl

Then we can have

fn coerce_arguments_for_signature(
    expressions: Vec<Expr>,
    schema: &DFSchema,
    signature: &Signature,
    func: Arc<dyn UDFImpl>,
) -> Result<Vec<Expr>> {}

Another alternative is having a duplicate function for scalar and aggregate for a related function

fn coerce_arguments_for_signature(
    expressions: Vec<Expr>,
    schema: &DFSchema,
    signature: &Signature,
    func: &ScalarUDF,
) -> Result<Vec<Expr>> {}

fn coerce_arguments_for_signature(
    expressions: Vec<Expr>,
    schema: &DFSchema,
    signature: &Signature,
    func: &AggregateUDF,
) -> Result<Vec<Expr>> {}

I think the first option is potentially beneficial in the long run(?) but the user now needs to define two traits. The second option only increases the maintenance cost.

What do you think about this @alamb
I also track if there is rust solution for this in https://users.rust-lang.org/t/inheritance-like-with-rust-trait/111102

@alamb
Copy link
Contributor Author

alamb commented May 9, 2024

It is a good observation that ScalarUDFImpl and AggregateUDFImpl don't share a common base trait and thus adding functionality that affects both requires duplication of code

I would need to introduce a more general trait for both function
Another alternative is having a duplicate function for scalar and aggregate for a related function

I agree with your analysis of the tradeoffs: a common base trait would result in less duplication in DataFusion

However, I personally prefer duplicating coerce_arguments_for_signature in each trait rather than introducing a common base trait because:

  1. It is backwards compatible (not an API change for the existing library of functions)
  2. Makes it slightly easier to implement ScalarUDF and AggregateUDF (especially when new to rust) -- rather than two impls for your function, you only need one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants