Make `CommonSubexprEliminate` faster by avoiding the use of strings #10426

alamb · 2024-05-08T17:20:28Z

Is your feature request related to a problem or challenge?

Part of #5637

One of the optimizer passes is "common subexpression elimination" that removes redundant computation

However, as @peter-toth noted on #10396 and the CSE code says

datafusion/datafusion/optimizer/src/common_subexpr_eliminate.rs

Lines 108 to 119 in d58bae4

    
           /// Identifier for each subexpression. 
        
           /// 
        
           /// Note that the current implementation uses the `Display` of an expression 
        
           /// (a `String`) as `Identifier`. 
        
           /// 
        
           /// An identifier should (ideally) be able to "hash", "accumulate", "equal" and "have no 
        
           /// collision (as low as possible)" 
        
           /// 
        
           /// Since an identifier is likely to be copied many times, it is better that an identifier 
        
           /// is small or "copy". otherwise some kinds of reference count is needed. String description 
        
           /// here is not such a good choose. 
        
           type Identifier = String;

The way it tracks common subexpressions is with string manipulation is is non ideal for several reasons (including the cost of creating those strings)

Describe the solution you'd like

Revisit the identifiers as using these string identifiers as the keys of ExprStats was not the best choice. Please note this is how CSE has been working since the feature was added initially.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

peter-toth · 2024-05-08T17:26:25Z

I'm happy to take this.

erratic-pattern · 2024-05-10T22:55:59Z

Are there any potential issues with simply using the existing Hash implementation of Expr to create HashSets?

Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use HashSet<Expr> or HashSet<&Expr>

synthetic group by expressions for aggregates:

datafusion/datafusion/expr/src/logical_plan/builder.rs

Lines 1246 to 1270 in accce97

    
           fn add_group_by_exprs_from_dependencies( 
        
               mut group_expr: Vec<Expr>, 
        
               schema: &DFSchemaRef, 
        
           ) -> Result<Vec<Expr>> { 
        
               // Names of the fields produced by the GROUP BY exprs for example, `GROUP BY 
        
               // c1 + 1` produces an output field named `"c1 + 1"` 
        
               let mut group_by_field_names = group_expr 
        
                   .iter() 
        
                   .map(|e| e.display_name()) 
        
                   .collect::<Result<Vec<_>>>()?; 
        
               if let Some(target_indices) = 
        
                   get_target_functional_dependencies(schema, &group_by_field_names) 
        
               { 
        
                   for idx in target_indices { 
        
                       let expr = Expr::Column(Column::from(schema.qualified_field(idx))); 
        
                       let expr_name = expr.display_name()?; 
        
                       if !group_by_field_names.contains(&expr_name) { 
        
                           group_by_field_names.push(expr_name); 
        
                           group_expr.push(expr); 
        
                       } 
        
                   } 
        
               } 
        
               Ok(group_expr) 
        
           }

functional dependencies heavily uses display_name to represent group by exprs:
https://github.com/apache/datafusion/blob/main/datafusion/common/src/functional_dependencies.rs

decorrelate:

datafusion/datafusion/optimizer/src/decorrelate.rs

Line 65 in accce97

pub type ExprResultMap = HashMap<String, Expr>;

push down filter for aggregates:

datafusion/datafusion/optimizer/src/push_down_filter.rs

Lines 788 to 837 in accce97

    
               // We can push down Predicate which in groupby_expr. 
        
               let group_expr_columns = agg 
        
                   .group_expr 
        
                   .iter() 
        
                   .map(|e| Ok(Column::from_qualified_name(e.display_name()?))) 
        
                   .collect::<Result<HashSet<_>>>()?; 
        
               let predicates = split_conjunction_owned(filter.predicate.clone()); 
        
               let mut keep_predicates = vec![]; 
        
               let mut push_predicates = vec![]; 
        
               for expr in predicates { 
        
                   let cols = expr.to_columns()?; 
        
                   if cols.iter().all(|c| group_expr_columns.contains(c)) { 
        
                       push_predicates.push(expr); 
        
                   } else { 
        
                       keep_predicates.push(expr); 
        
                   } 
        
               } 
        
               // As for plan Filter: Column(a+b) > 0 -- Agg: groupby:[Column(a)+Column(b)] 
        
               // After push, we need to replace `a+b` with Column(a)+Column(b) 
        
               // So we need create a replace_map, add {`a+b` --> Expr(Column(a)+Column(b))} 
        
               let mut replace_map = HashMap::new(); 
        
               for expr in &agg.group_expr { 
        
                   replace_map.insert(expr.display_name()?, expr.clone()); 
        
               } 
        
               let replaced_push_predicates = push_predicates 
        
                   .iter() 
        
                   .map(|expr| replace_cols_by_name(expr.clone(), &replace_map)) 
        
                   .collect::<Result<Vec<_>>>()?; 
        
               let child = match conjunction(replaced_push_predicates) { 
        
                   Some(predicate) => LogicalPlan::Filter(Filter::try_new( 
        
                       predicate, 
        
                       agg.input.clone(), 
        
                   )?), 
        
                   None => (*agg.input).clone(), 
        
               }; 
        
               let new_agg = filter 
        
                   .input 
        
                   .with_new_exprs(filter.input.expressions(), vec![child])?; 
        
               match conjunction(keep_predicates) { 
        
                   Some(predicate) => LogicalPlan::Filter(Filter::try_new( 
        
                       predicate, 
        
                       Arc::new(new_agg), 
        
                   )?), 
        
                   None => new_agg, 
        
               } 
        
           }

single distinct to group by:

datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs

Lines 69 to 96 in accce97

    
           LogicalPlan::Aggregate(Aggregate { aggr_expr, .. }) => { 
        
               let mut fields_set = HashSet::new(); 
        
               let mut aggregate_count = 0; 
        
               for expr in aggr_expr { 
        
                   if let Expr::AggregateFunction(AggregateFunction { 
        
                       func_def: AggregateFunctionDefinition::BuiltIn(fun), 
        
                       distinct, 
        
                       args, 
        
                       filter, 
        
                       order_by, 
        
                       null_treatment: _, 
        
                   }) = expr 
        
                   { 
        
                       if filter.is_some() || order_by.is_some() { 
        
                           return Ok(false); 
        
                       } 
        
                       aggregate_count += 1; 
        
                       if *distinct { 
        
                           for e in args { 
        
                               fields_set.insert(e.canonical_name()); 
        
                           } 
        
                       } else if !matches!(fun, Sum | Min | Max) { 
        
                           return Ok(false); 
        
                       } 
        
                   } 
        
               } 
        
               Ok(aggregate_count == aggr_expr.len() && fields_set.len() == 1) 
        
           }

datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs

Line 185 in accce97

&& group_fields_set.insert(args[0].display_name()?)

peter-toth · 2024-05-11T10:16:11Z

Are there any potential issues with simply using the existing Hash implementation of Expr to create HashSets?

Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use HashSet<Expr> or HashSet<&Expr>

Thanks for these references @erratic-pattern.

Background and general thoughts:

I'm only familiar with CSE code and in its case unfortunately non-unique stringified expression were used as keys of the map that stores the occurrance counts. This bug was introduced in #9871 and reverted in #10396. The issue with these colliding string keys are explained here in details: #10333 (comment).

Some thougths about CSE:

After #10396 we still use stringified expressions as keys (Identifier), but the strings we use encode whole expression subtrees. This is far from optimal and this ticket / my work in progress change would like to help with that.

In case of CSE we could use Expr as keys of the ExprStats map, but then we would need to clone Exprs when we fill up the ExprStats map during the first traversal. This would be particulary costly in CSE because we need to store not only the counts for all top level expressions, but the counts of all their descendant subexpressions.

We could also use &Expr as keys (and so we didn't need to clone the expressions), but there is a problem here. The current TreeNode::apply() / TreeNode::visit() APIs aren't capable to fill up such a HashMap<&Expr, ...> map. This is because of restricted TreeNode reference lifetimes used in closures / TreeNodeVisitor methods.
I.e. this currently doesn't work:

let e = sum((col("a") * (lit(1) - col("b"))) * (lit(1) + col("c")));
let mut m = HashMap::new();
e.apply(|e| {
    *m.entry(e).or_insert(0) += 1;
    Ok(TreeNodeRecursion::Continue)
});

println!("m: {:#?}", m);

This issue can be solved by adding new TreeNode APIs or fixing the current ones.
I have a WIP commit here: peter-toth@e844799 that adds TreeNode::apply_ref() / TreeNode::visit_ref().
Using apply_ref() in the above example would make it work, but I haven't opened a PR yet as there are a few things to consider:
a. We don't really want to add any more new APIs (especially if their puspose is similar to existing ones).
b. We can't change the lifetimes of references in the current apply() / visit() easily. This is because some TreeNode implementations are not compatible with that. (E.g. DynTreeNode doesn't have a method to get references to its children, LogicalPlan creates temprorary objects in its map_subqueries(), ...).

Despite the fact that my WIP commit adds new APIs, I would prefer and lean towards option b.. But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet.

Now there is another thing to consider if we want use &Expr as keys of ExprStats. The current CSE algorithm, that was added by the original author of CSE in DataFusion (and not myself), is very clever and does the following:
In the first traversal it:

Creates a mapping for each top level expression (this is called IdArray) that stores the preorder visit index of a node to an Identifier (of a subexpression tree).
And also creates a map (this is called ExprStats) that contains the Identifier -> count stats gathered for all top level expressions and their subexpressions.

This is very nice, because the second, rewriting traversal can use the preorder visit index again to look up the identifier first and then the count from the ExprStats map. Providing that an identifier is small, this can be much faster then using &Expr as keys because:

Computing hash() of an &Expr (instead of using preorder index) in the second traversal is costly if the expression is deep and contains lots of indirections (Boxes).
When we generate the identifiers in the first traversal we can use the traversal's bottom-up phase to build up identifiers from the current node and the identifiers of the node's children very effectively.

In my work in progress change for this issue I would like to finalize the:

TreeNode API changes required (maybe open a separate PR for it)
and replce the current String based identifier to a (u64, &Expr) like tuple/struct.
The first item contains a precomputed hash of the identifier. (As I mentioned, we can use the bottom-up phase of the first traversal to compute that effectively since this logic is already implmeneted in the CSE algorithm.) The overriden hash() of the struct should return this precomputed hash.
The second item is a &Expr that can be used in the struct's eq() implementation in case of hash collision.

Back to the original question of using `HashSet<Expr, ...>` or `HashSet<&Expr, ...>`:

I think both are accepable but CSE is special as the maps need to store all the descendant subexpressions as well and the impemented CSE algorithm seems to offer a way to implement a better identifier than just a simple &Expr.
I don't know the other referenced usecases but if collision of string names can happen there then we should definitely fix it.

erratic-pattern · 2024-05-11T12:53:44Z

Thanks for the detailed write up @peter-toth . Though I did mention HashSet<Expr> specifically, my suggestion more generally goes along the lines of using the Hash implementation in some way to produce the identifiers. After looking at the code a bit more, I do see the cloning/lifetime issues with using Expr or &Expr as keys directly. I also did not consider the cost of re-computing hashes. I do think in that case it does make sense to pre-compute the hash instead.

I like the idea of generalizing the (u64, &Expr) struct into something reuseable across optimizations, as it seems to be a common pattern where we need to:

produce some unique identifier for an expression that can be stored in a data structure
use that identifier to generate aliases for newly generated expressions, or create a new Column/Field somewhere with that expression as a name. this can be done thanks to the &Expr in the struct which would allow us to call display_name
do so in a way that doesn't conflict with ownership/borrowing semantics. we might still run into borrowing issues because of the &Expr reference, but it's hard to say without trying to adapt this solution to other optimizers. Rc or Arc is a potential option as well. The struct could potentially be generic over Borrow to support any of these.
avoid recomputing the hash/key on every insert/lookup operation

Anyway, I don't want to over-abstract just yet, so for now just build something that works for CSE and then we can take it and see if it can be applied to any of the other optimizations.

I am curious if overriding hash() in this way will conflict with the Hash Eq property in some unforseen way. I think as long as we're constructing it such that the &Expr is always a reference to the Expr that produced the hash, it should be fine.

peter-toth · 2024-05-11T13:29:08Z

I like the idea of generalizing the (u64, &Expr) struct into something reuseable across optimizations.

Honestly, I don't know those referenced usecases, but I feel (u64, &Expr) (and any Identifier in general) makes sense only for CSE (2 traversals, we can build up a preorder visit cache of Identifiers in the first traversal and second traversal is top-down) and not sure the others have the same characteristics... If that's not the case then it doesn't make sense to use Identifiers instead of Expr/&Exprs.

Anyways, I will try to open the PR with it next week and then feel free to generalize the idea for other usecases if it makes sense.

peter-toth · 2024-05-12T18:17:02Z

I've opened a draft PR: #10473 and will try to wrap it up in the following days.

alamb · 2024-05-14T13:16:45Z

I have a WIP commit here: peter-toth@e844799 that adds TreeNode::apply_ref() / TreeNode::visit_ref().
Using apply_ref() in the above example would make it work, but I haven't opened a PR yet as there are a few things to consider:
...
But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet.

Here is one example API that I would love to implement with such a tree-node api: #10505

I also ran into an example when trying to find embedded Subquerys in an Expr in

datafusion/datafusion/optimizer/src/scalar_subquery_to_join.rs

Lines 54 to 68 in 424757f

    
           fn extract_subquery_exprs( 
        
               &self, 
        
               predicate: &Expr, 
        
               alias_gen: Arc<AliasGenerator>, 
        
           ) -> Result<(Vec<(Subquery, String)>, Expr)> { 
        
               let mut extract = ExtractScalarSubQuery { 
        
                   sub_query_info: vec![], 
        
                   alias_gen, 
        
               }; 
        
               predicate 
        
                   .clone() 
        
                   .rewrite(&mut extract) 
        
                   .data() 
        
                   .map(|new_expr| (extract.sub_query_info, new_expr)) 
        
           }

peter-toth · 2024-05-14T13:44:43Z

Here is one example API that I would love to implement with such a tree-node api: #10505

Thanks for sharing this @alamb. It's good to know that there are other possible usecases for this new API. #10473 seems to pass all tests now. I will extract the first commit of it into a separate PR today or tomorrow to add the new TreeNode API.

crepererum · 2024-05-15T09:27:57Z

FWIW I've also seen the high cost of expression string formatting (using Display/to_string()) in a good number of profiles.

I think there's nothing wrong about having a "display" infrastructure, but it shouldn't be used eagerly. As others pointed out, using a hash or any form of numeric ID would probably be better in many places.

alamb · 2024-05-15T17:44:44Z

FWIW I've also seen the high cost of expression string formatting (using Display/to_string()) in a good number of profiles.

I think there's nothing wrong about having a "display" infrastructure, but it shouldn't be used eagerly. As others pointed out, using a hash or any form of numeric ID would probably be better in many places.

100% -- btw #10454 from @erratic-pattern made this code faster (fewer allocations) though it would be better still as you point out to not use display as much.

I will say from personal experience working on postgres / postgres derived systems (which does use a numeric id to identify columns), using strings is much easier to debug when problems occur. I do think we can reduce it significantly however

erratic-pattern · 2024-05-16T20:43:44Z

I think there's nothing wrong about having a "display" infrastructure, but it shouldn't be used eagerly. As others pointed out, using a hash or any form of numeric ID would probably be better in many places.

@crepererum I am working on moving away from string allocations in a number of the optimization rules and switching to Hash based implementations.

Most of these use the Expr::display_name method which - maybe confusingly - doesn't actually use Display but instead uses an internal create_name function. It is similar to the Display implementation but has some differences. For instance, Cast expressions are ignored and column references are rendered with a different syntax.

I would be interested in seeing the profile data you mentioned, especially those that use Display and to_string as that might indicate that something else could be improved beyond the changes I am currently working on.

Also since we're no longer talking about strictly CommonSubexprEliminate at this point, it might be a good idea to track this as a separate issue and link this discussion.

crepererum · 2024-05-17T09:07:26Z

I can see if I get organize you some profiles next week 🙂

alamb added the enhancement New feature or request label May 8, 2024

This was referenced May 8, 2024

[Epic] A collection of issues to improve planning performance / speed / efficiency #5637

Open

Fix and improve CommonSubexprEliminate rule #10396

Merged

alamb assigned peter-toth May 8, 2024

peter-toth mentioned this issue May 9, 2024

make common expression alias human-readable #10333

Merged

erratic-pattern mentioned this issue May 11, 2024

Support "Tracing" / Spans #9415

Open

peter-toth mentioned this issue May 11, 2024

Stop copying Exprs and LogicalPlans so much during Common Subexpression Elimination #9873

Open

peter-toth linked a pull request May 12, 2024 that will close this issue

Better CSE identifier #10473

Draft

peter-toth mentioned this issue May 16, 2024

Add reference visitor TreeNode APIs #10543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `CommonSubexprEliminate` faster by avoiding the use of strings #10426

Make `CommonSubexprEliminate` faster by avoiding the use of strings #10426

alamb commented May 8, 2024

peter-toth commented May 8, 2024

erratic-pattern commented May 10, 2024

peter-toth commented May 11, 2024 •

edited

erratic-pattern commented May 11, 2024 •

edited

peter-toth commented May 11, 2024 •

edited

peter-toth commented May 12, 2024

alamb commented May 14, 2024

peter-toth commented May 14, 2024

crepererum commented May 15, 2024

alamb commented May 15, 2024

erratic-pattern commented May 16, 2024

crepererum commented May 17, 2024

Make CommonSubexprEliminate faster by avoiding the use of strings #10426

Make CommonSubexprEliminate faster by avoiding the use of strings #10426

Comments

alamb commented May 8, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

peter-toth commented May 8, 2024

erratic-pattern commented May 10, 2024

peter-toth commented May 11, 2024 • edited

Background and general thoughts:

Some thougths about CSE:

Back to the original question of using HashSet<Expr, ...> or HashSet<&Expr, ...>:

erratic-pattern commented May 11, 2024 • edited

peter-toth commented May 11, 2024 • edited

peter-toth commented May 12, 2024

alamb commented May 14, 2024

peter-toth commented May 14, 2024

crepererum commented May 15, 2024

alamb commented May 15, 2024

erratic-pattern commented May 16, 2024

crepererum commented May 17, 2024

Make `CommonSubexprEliminate` faster by avoiding the use of strings #10426

Make `CommonSubexprEliminate` faster by avoiding the use of strings #10426

peter-toth commented May 11, 2024 •

edited

Back to the original question of using `HashSet<Expr, ...>` or `HashSet<&Expr, ...>`:

erratic-pattern commented May 11, 2024 •

edited

peter-toth commented May 11, 2024 •

edited