Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CommonSubexprEliminate faster by avoiding the use of strings #10426

Open
Tracked by #5637
alamb opened this issue May 8, 2024 · 12 comments · May be fixed by #10473
Open
Tracked by #5637

Make CommonSubexprEliminate faster by avoiding the use of strings #10426

alamb opened this issue May 8, 2024 · 12 comments · May be fixed by #10473
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 8, 2024

Is your feature request related to a problem or challenge?

Part of #5637

One of the optimizer passes is "common subexpression elimination" that removes redundant computation

However, as @peter-toth noted on #10396 and the CSE code says

/// Identifier for each subexpression.
///
/// Note that the current implementation uses the `Display` of an expression
/// (a `String`) as `Identifier`.
///
/// An identifier should (ideally) be able to "hash", "accumulate", "equal" and "have no
/// collision (as low as possible)"
///
/// Since an identifier is likely to be copied many times, it is better that an identifier
/// is small or "copy". otherwise some kinds of reference count is needed. String description
/// here is not such a good choose.
type Identifier = String;

The way it tracks common subexpressions is with string manipulation is is non ideal for several reasons (including the cost of creating those strings)

Describe the solution you'd like

Revisit the identifiers as using these string identifiers as the keys of ExprStats was not the best choice. Please note this is how CSE has been working since the feature was added initially.

Describe alternatives you've considered

No response

Additional context

No response

@peter-toth
Copy link
Contributor

I'm happy to take this.

@erratic-pattern
Copy link
Contributor

Are there any potential issues with simply using the existing Hash implementation of Expr to create HashSets?

Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use HashSet<Expr> or HashSet<&Expr>

synthetic group by expressions for aggregates:

fn add_group_by_exprs_from_dependencies(
mut group_expr: Vec<Expr>,
schema: &DFSchemaRef,
) -> Result<Vec<Expr>> {
// Names of the fields produced by the GROUP BY exprs for example, `GROUP BY
// c1 + 1` produces an output field named `"c1 + 1"`
let mut group_by_field_names = group_expr
.iter()
.map(|e| e.display_name())
.collect::<Result<Vec<_>>>()?;
if let Some(target_indices) =
get_target_functional_dependencies(schema, &group_by_field_names)
{
for idx in target_indices {
let expr = Expr::Column(Column::from(schema.qualified_field(idx)));
let expr_name = expr.display_name()?;
if !group_by_field_names.contains(&expr_name) {
group_by_field_names.push(expr_name);
group_expr.push(expr);
}
}
}
Ok(group_expr)
}

functional dependencies heavily uses display_name to represent group by exprs:
https://github.com/apache/datafusion/blob/main/datafusion/common/src/functional_dependencies.rs

decorrelate:

pub type ExprResultMap = HashMap<String, Expr>;

push down filter for aggregates:

// We can push down Predicate which in groupby_expr.
let group_expr_columns = agg
.group_expr
.iter()
.map(|e| Ok(Column::from_qualified_name(e.display_name()?)))
.collect::<Result<HashSet<_>>>()?;
let predicates = split_conjunction_owned(filter.predicate.clone());
let mut keep_predicates = vec![];
let mut push_predicates = vec![];
for expr in predicates {
let cols = expr.to_columns()?;
if cols.iter().all(|c| group_expr_columns.contains(c)) {
push_predicates.push(expr);
} else {
keep_predicates.push(expr);
}
}
// As for plan Filter: Column(a+b) > 0 -- Agg: groupby:[Column(a)+Column(b)]
// After push, we need to replace `a+b` with Column(a)+Column(b)
// So we need create a replace_map, add {`a+b` --> Expr(Column(a)+Column(b))}
let mut replace_map = HashMap::new();
for expr in &agg.group_expr {
replace_map.insert(expr.display_name()?, expr.clone());
}
let replaced_push_predicates = push_predicates
.iter()
.map(|expr| replace_cols_by_name(expr.clone(), &replace_map))
.collect::<Result<Vec<_>>>()?;
let child = match conjunction(replaced_push_predicates) {
Some(predicate) => LogicalPlan::Filter(Filter::try_new(
predicate,
agg.input.clone(),
)?),
None => (*agg.input).clone(),
};
let new_agg = filter
.input
.with_new_exprs(filter.input.expressions(), vec![child])?;
match conjunction(keep_predicates) {
Some(predicate) => LogicalPlan::Filter(Filter::try_new(
predicate,
Arc::new(new_agg),
)?),
None => new_agg,
}
}

single distinct to group by:

LogicalPlan::Aggregate(Aggregate { aggr_expr, .. }) => {
let mut fields_set = HashSet::new();
let mut aggregate_count = 0;
for expr in aggr_expr {
if let Expr::AggregateFunction(AggregateFunction {
func_def: AggregateFunctionDefinition::BuiltIn(fun),
distinct,
args,
filter,
order_by,
null_treatment: _,
}) = expr
{
if filter.is_some() || order_by.is_some() {
return Ok(false);
}
aggregate_count += 1;
if *distinct {
for e in args {
fields_set.insert(e.canonical_name());
}
} else if !matches!(fun, Sum | Min | Max) {
return Ok(false);
}
}
}
Ok(aggregate_count == aggr_expr.len() && fields_set.len() == 1)
}

&& group_fields_set.insert(args[0].display_name()?)

@peter-toth
Copy link
Contributor

peter-toth commented May 11, 2024

Are there any potential issues with simply using the existing Hash implementation of Expr to create HashSets?

Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use HashSet<Expr> or HashSet<&Expr>

Thanks for these references @erratic-pattern.

Background and general thoughts:

I'm only familiar with CSE code and in its case unfortunately non-unique stringified expression were used as keys of the map that stores the occurrance counts. This bug was introduced in #9871 and reverted in #10396. The issue with these colliding string keys are explained here in details: #10333 (comment).

Some thougths about CSE:

After #10396 we still use stringified expressions as keys (Identifier), but the strings we use encode whole expression subtrees. This is far from optimal and this ticket / my work in progress change would like to help with that.

In case of CSE we could use Expr as keys of the ExprStats map, but then we would need to clone Exprs when we fill up the ExprStats map during the first traversal. This would be particulary costly in CSE because we need to store not only the counts for all top level expressions, but the counts of all their descendant subexpressions.

We could also use &Expr as keys (and so we didn't need to clone the expressions), but there is a problem here. The current TreeNode::apply() / TreeNode::visit() APIs aren't capable to fill up such a HashMap<&Expr, ...> map. This is because of restricted TreeNode reference lifetimes used in closures / TreeNodeVisitor methods.
I.e. this currently doesn't work:

let e = sum((col("a") * (lit(1) - col("b"))) * (lit(1) + col("c")));
let mut m = HashMap::new();
e.apply(|e| {
    *m.entry(e).or_insert(0) += 1;
    Ok(TreeNodeRecursion::Continue)
});

println!("m: {:#?}", m);

This issue can be solved by adding new TreeNode APIs or fixing the current ones.
I have a WIP commit here: peter-toth@e844799 that adds TreeNode::apply_ref() / TreeNode::visit_ref().
Using apply_ref() in the above example would make it work, but I haven't opened a PR yet as there are a few things to consider:
a. We don't really want to add any more new APIs (especially if their puspose is similar to existing ones).
b. We can't change the lifetimes of references in the current apply() / visit() easily. This is because some TreeNode implementations are not compatible with that. (E.g. DynTreeNode doesn't have a method to get references to its children, LogicalPlan creates temprorary objects in its map_subqueries(), ...).

Despite the fact that my WIP commit adds new APIs, I would prefer and lean towards option b.. But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet.

Now there is another thing to consider if we want use &Expr as keys of ExprStats. The current CSE algorithm, that was added by the original author of CSE in DataFusion (and not myself), is very clever and does the following:
In the first traversal it:

  • Creates a mapping for each top level expression (this is called IdArray) that stores the preorder visit index of a node to an Identifier (of a subexpression tree).
  • And also creates a map (this is called ExprStats) that contains the Identifier -> count stats gathered for all top level expressions and their subexpressions.

This is very nice, because the second, rewriting traversal can use the preorder visit index again to look up the identifier first and then the count from the ExprStats map. Providing that an identifier is small, this can be much faster then using &Expr as keys because:

  • Computing hash() of an &Expr (instead of using preorder index) in the second traversal is costly if the expression is deep and contains lots of indirections (Boxes).
  • When we generate the identifiers in the first traversal we can use the traversal's bottom-up phase to build up identifiers from the current node and the identifiers of the node's children very effectively.

In my work in progress change for this issue I would like to finalize the:

  • TreeNode API changes required (maybe open a separate PR for it)
  • and replce the current String based identifier to a (u64, &Expr) like tuple/struct.
    The first item contains a precomputed hash of the identifier. (As I mentioned, we can use the bottom-up phase of the first traversal to compute that effectively since this logic is already implmeneted in the CSE algorithm.) The overriden hash() of the struct should return this precomputed hash.
    The second item is a &Expr that can be used in the struct's eq() implementation in case of hash collision.

Back to the original question of using HashSet<Expr, ...> or HashSet<&Expr, ...>:

I think both are accepable but CSE is special as the maps need to store all the descendant subexpressions as well and the impemented CSE algorithm seems to offer a way to implement a better identifier than just a simple &Expr.
I don't know the other referenced usecases but if collision of string names can happen there then we should definitely fix it.

@erratic-pattern
Copy link
Contributor

erratic-pattern commented May 11, 2024

Thanks for the detailed write up @peter-toth . Though I did mention HashSet<Expr> specifically, my suggestion more generally goes along the lines of using the Hash implementation in some way to produce the identifiers. After looking at the code a bit more, I do see the cloning/lifetime issues with using Expr or &Expr as keys directly. I also did not consider the cost of re-computing hashes. I do think in that case it does make sense to pre-compute the hash instead.

I like the idea of generalizing the (u64, &Expr) struct into something reuseable across optimizations, as it seems to be a common pattern where we need to:

  • produce some unique identifier for an expression that can be stored in a data structure
  • use that identifier to generate aliases for newly generated expressions, or create a new Column/Field somewhere with that expression as a name. this can be done thanks to the &Expr in the struct which would allow us to call display_name
  • do so in a way that doesn't conflict with ownership/borrowing semantics. we might still run into borrowing issues because of the &Expr reference, but it's hard to say without trying to adapt this solution to other optimizers. Rc or Arc is a potential option as well. The struct could potentially be generic over Borrow to support any of these.
  • avoid recomputing the hash/key on every insert/lookup operation

Anyway, I don't want to over-abstract just yet, so for now just build something that works for CSE and then we can take it and see if it can be applied to any of the other optimizations.

I am curious if overriding hash() in this way will conflict with the Hash Eq property in some unforseen way. I think as long as we're constructing it such that the &Expr is always a reference to the Expr that produced the hash, it should be fine.

@peter-toth
Copy link
Contributor

peter-toth commented May 11, 2024

I like the idea of generalizing the (u64, &Expr) struct into something reuseable across optimizations.

Honestly, I don't know those referenced usecases, but I feel (u64, &Expr) (and any Identifier in general) makes sense only for CSE (2 traversals, we can build up a preorder visit cache of Identifiers in the first traversal and second traversal is top-down) and not sure the others have the same characteristics... If that's not the case then it doesn't make sense to use Identifiers instead of Expr/&Exprs.

Anyways, I will try to open the PR with it next week and then feel free to generalize the idea for other usecases if it makes sense.

@peter-toth peter-toth linked a pull request May 12, 2024 that will close this issue
@peter-toth
Copy link
Contributor

I've opened a draft PR: #10473 and will try to wrap it up in the following days.

@alamb
Copy link
Contributor Author

alamb commented May 14, 2024

I have a WIP commit here: peter-toth@e844799 that adds TreeNode::apply_ref() / TreeNode::visit_ref().
Using apply_ref() in the above example would make it work, but I haven't opened a PR yet as there are a few things to consider:
...
But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet.

Here is one example API that I would love to implement with such a tree-node api: #10505

I also ran into an example when trying to find embedded Subquerys in an Expr in

fn extract_subquery_exprs(
&self,
predicate: &Expr,
alias_gen: Arc<AliasGenerator>,
) -> Result<(Vec<(Subquery, String)>, Expr)> {
let mut extract = ExtractScalarSubQuery {
sub_query_info: vec![],
alias_gen,
};
predicate
.clone()
.rewrite(&mut extract)
.data()
.map(|new_expr| (extract.sub_query_info, new_expr))
}

@peter-toth
Copy link
Contributor

Here is one example API that I would love to implement with such a tree-node api: #10505

Thanks for sharing this @alamb. It's good to know that there are other possible usecases for this new API. #10473 seems to pass all tests now. I will extract the first commit of it into a separate PR today or tomorrow to add the new TreeNode API.

@crepererum
Copy link
Contributor

FWIW I've also seen the high cost of expression string formatting (using Display/to_string()) in a good number of profiles.

I think there's nothing wrong about having a "display" infrastructure, but it shouldn't be used eagerly. As others pointed out, using a hash or any form of numeric ID would probably be better in many places.

@alamb
Copy link
Contributor Author

alamb commented May 15, 2024

FWIW I've also seen the high cost of expression string formatting (using Display/to_string()) in a good number of profiles.

I think there's nothing wrong about having a "display" infrastructure, but it shouldn't be used eagerly. As others pointed out, using a hash or any form of numeric ID would probably be better in many places.

100% -- btw #10454 from @erratic-pattern made this code faster (fewer allocations) though it would be better still as you point out to not use display as much.

I will say from personal experience working on postgres / postgres derived systems (which does use a numeric id to identify columns), using strings is much easier to debug when problems occur. I do think we can reduce it significantly however

@erratic-pattern
Copy link
Contributor

I think there's nothing wrong about having a "display" infrastructure, but it shouldn't be used eagerly. As others pointed out, using a hash or any form of numeric ID would probably be better in many places.

@crepererum I am working on moving away from string allocations in a number of the optimization rules and switching to Hash based implementations.

Most of these use the Expr::display_name method which - maybe confusingly - doesn't actually use Display but instead uses an internal create_name function. It is similar to the Display implementation but has some differences. For instance, Cast expressions are ignored and column references are rendered with a different syntax.

I would be interested in seeing the profile data you mentioned, especially those that use Display and to_string as that might indicate that something else could be improved beyond the changes I am currently working on.

Also since we're no longer talking about strictly CommonSubexprEliminate at this point, it might be a good idea to track this as a separate issue and link this discussion.

@crepererum
Copy link
Contributor

I can see if I get organize you some profiles next week 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants