Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure in Join when ancestors in other domains #146

Open
JustusAdam opened this issue Oct 29, 2019 · 7 comments
Open

Assertion failure in Join when ancestors in other domains #146

JustusAdam opened this issue Oct 29, 2019 · 7 comments

Comments

@JustusAdam
Copy link
Collaborator

Setup

I am trying to run a query that computes an average. The graph and the operators are generated from a different language by a compiler, but in SQL it would look something like this

SELECT sum(x) / count(*)
FROM Tab

Error

The query itself runs fine, but I wanted to test how the performance would be if count(*) and sum(x) were computed on different domains. So I hacked into assignment to force these operators on their own domains.

When I do that however the join after the two calculations tries to access a non existent index in its right ancestor. I expanded the error message (see below) which says that the right ancestor with id 4 was short (tries accessing index 2 in the other slice, which only has two elements, in the generate_row function.

This is the error message for the two domains case, in the case of four domains its the same but the id is different (because more generated ingress/egress operators)

'right (4) was short', noria-server/dataflow/src/ops/join.rs:181:21

Questions

Is there something i am missing about domains? Can I not just make any operator into its own domain? Are there any invariants around what can go on a domain and what cant?

Runtime graphs

Here are the dot graphs for two domains and four domains and for good measure the original (working) singe domain.

The relevant operators here are ohua.generated/op_s_acc_0_0 (count(*)) and ohua.generated/op_s_acc_1_0 (sum(x)) and the join afterwards. (The rest is just generated code that does some column renaming)

How to reproduce

I uploaded a branch (join-after-domain-error-reproduction) to my fork that should contain the complete state of the system necessary (including generated operators) to reproduce the error.

In the udf-benchmarks directory run cargo run --bin features avg-split-domain/two-domainsf.toml

This will run the two domain scenario. For one or four use the one-domain.toml and four-domains.toml config respectively

@jonhoo
Copy link
Contributor

jonhoo commented Oct 29, 2019

I'm confused.. There's no join in the query you gave? The query looks like it'd hit the same issue as #137, no?

@jonhoo
Copy link
Contributor

jonhoo commented Oct 29, 2019

As to your question about domain assignment, you can move most operators into arbitrary domains, as long as you do so before you call migrate (because it adds a bunch of necessary internal operators at domain boundaries). Joins are "special" in that they always require that their inputs are materialized within the same domain as themselves, so moving them may not achieve the effect that you want.

@JustusAdam
Copy link
Collaborator Author

You are right. So basically I give it a different description of this query and it generates one similar to #137 but without the extra views for the two different aggregations.

Interesting. What do you think, would it work if I inserted just an Identity in between the join and its ancestors?

@jonhoo
Copy link
Contributor

jonhoo commented Oct 31, 2019

I think you'll have to specifically write the query such that the aggregations are done separately and then join them together, as in #137. I'm not sure what purpose the Identity would serve?

@ms705
Copy link
Member

ms705 commented Oct 31, 2019

The dot graphs for your multi-domain assignments look correct, and I would expect them to work. The error you get seems to indicate that you receive a record of incorrect length; are you sure that the Ohua-generated operators always produce the right output records?

@jonhoo My understanding (from looking at the graphs) is that @JustusAdam wrote the join-based version of the query (as per #137), and that he wants the aggregations to be in different domains for parallel processing. The join input materialization will use extra space (and some compute), but that's fine for his purpose.

@JustusAdam There's no need for an identity node, and it won't change anything -- the join merely forces the automatically-generated "ingress" node to be (partially) materialized, as indicated by the 3/4 symbol in the top right corner. If you added an identity node, that would get materialized instead.

@JustusAdam
Copy link
Collaborator Author

Ah, good to know.

I am fairly confident that it produces the right output, because the one-domain version works just fine. But I will run a trace over it anyway to figure out if it produces bogus output at any point.

@JustusAdam
Copy link
Collaborator Author

Also I am sorry for oversimplifying the query. Yes @ms705 is correct, I am generating the join-based query from #137.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants