Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to DataFusion 13 (784f10bb) / Arrow 25.0.0 #176

Merged
merged 12 commits into from
Oct 27, 2022

Commits on Oct 26, 2022

  1. Upgrade DataFusion to 13.0.0, Arrow to 25.0.0

    The actual 13.0.0 DF release uses Arrow 24.0.0, but we need to pick up 25.0.0,
    since it brings back the Arrow Schema/Field-to-JSON serialization code (albeit
    in a different crate for integration tests).
    
    apache/arrow-rs#2868
    apache/arrow-rs#2724
    mildbyte committed Oct 26, 2022
    Configuration menu
    Copy the full SHA
    0cb2ed9 View commit details
    Browse the repository at this point in the history
  2. Remove hashbrown

    It's now the default HashMap implementation and DF's
    planner uses it as well, so we can use std::HashMap everywhere.
    mildbyte committed Oct 26, 2022
    Configuration menu
    Copy the full SHA
    b2d78ef View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    94dc4d4 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    efb6187 View commit details
    Browse the repository at this point in the history
  5. Fix some expected output change tests

    Arrow file hash changes and minor changes in the query plan output
    mildbyte committed Oct 26, 2022
    Configuration menu
    Copy the full SHA
    2635b05 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    38c0568 View commit details
    Browse the repository at this point in the history
  7. Include UPDATE/DELETE in the query optimizer

    Make the `Update`/`Delete` nodes expose `inputs` and `expressions` in order to
    let the DF query optimizer work on the `WHERE ...` / `SET col = expr`
    expressions. This is slightly hacky:
    
      - as an "input", we return a `TableScan` node that we don't use after that
        (this is just so that the optimizer knows the input schema for all the
        expressions)
      - return the expressions used by the node and add code to pack/unpack them
        into a list
    
    The point of this is to let DataFusion run the `TypeCoercion` optimization,
    without which something like `WHERE float_col > 42` will raise an error (as
    after DF 13 these type coercions got removed from other places and moved into
    optimizations)
    
    (NB this doesn't work yet, we still get type coercion errors)
    mildbyte committed Oct 26, 2022
    Configuration menu
    Copy the full SHA
    7a61a7c View commit details
    Browse the repository at this point in the history

Commits on Oct 27, 2022

  1. Run the query optimizer for UPDATE/DELETE

    (normally it's run only by DataFusion's `create_physical_plan`, but we don't run
    that, so we have to execute it manually to get auto type coercion working)
    mildbyte committed Oct 27, 2022
    Configuration menu
    Copy the full SHA
    0c6ece6 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2a27fc2 View commit details
    Browse the repository at this point in the history
  3. Add more verbose plan output to Update/Delete

    Include `SET` expressions and the predicate if it exists to aid debugging.
    mildbyte committed Oct 27, 2022
    Configuration menu
    Copy the full SHA
    b4cfc90 View commit details
    Browse the repository at this point in the history
  4. Remove aliases from optimized Update/Deletes

    These expressions are similar to what DataFusion uses in the `Filter` node and
    not doing this seems to break partition pruning (perhaps it stops at the
    `Alias` node and doesn't prone anything, didn't investigate in depth).
    
    Copy the `ExprRewriter` visitor from
    https://github.com/apache/arrow-datafusion/blob/c50573939d21de40e591c04915d41f7c46a51d0d/datafusion/expr/src/utils.rs#L384-L428
    and adapt it to remove aliases from all expressions that the query optimizer
    gives back to `Update`/`Delete` nodes.
    mildbyte committed Oct 27, 2022
    Configuration menu
    Copy the full SHA
    32eabcc View commit details
    Browse the repository at this point in the history
  5. Assert the query plan in update/delete tests

    Make sure the constants are correctly cast and let us detect changes to the
    optimizer faster with new DF updates.
    mildbyte committed Oct 27, 2022
    Configuration menu
    Copy the full SHA
    7877070 View commit details
    Browse the repository at this point in the history