Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adapter,tests: make test_session_linearizability resistant to slow propagation of source uppers #27080

Merged

Conversation

aljoscha
Copy link
Contributor

I discovered this with my PR that adds more concurrency to the controllers, where upper updates can sometimes be delayed. But it's a potential problem already today, where upper updates from sources can also be delayed.

The first commit is renaming and clearing up some things. The second commit is the actual fix.

Checklist

Comment on lines +1869 to +1875
let mut source_ts;
loop {
source_ts = test_util::get_explain_timestamp(pg_table_name, &mz_client).await;
if source_ts > 0 {
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore this, but the following might allow us to avoid the break.

    let mut source_ts = 0;
    while source_ts == 0 {
        source_ts = test_util::get_explain_timestamp(pg_table_name, &mz_client).await;
    }

Comment on lines +1869 to +1875
let mut source_ts;
loop {
source_ts = test_util::get_explain_timestamp(pg_table_name, &mz_client).await;
if source_ts > 0 {
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still not really clear to me why the query in test_util::wait_for_pg_table_population would choose a timestamp at a value larger than the source upper, causing it to return before the upper passes 0.

Copy link
Contributor Author

@aljoscha aljoscha May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the upper is still [0] (as far as the coord knows), it also holds a since at [0]. And the wait_for_pg_table_population query will use an as_of of 0, which is the lowest one it can pick and we happen to still have a read hold for it. And the data happens to be there in the source shard and the query succeeds, because the real upper in the source shard is already a bit further ahead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke about this in person, the key detail is that the first batch of data in the source will get written at timestamp 0. The next batch will get written at the current system time. This seems not correct, but explains the observed behavior. We've had multiple discussions about writing source data in the past, so I won't rehash that here.

@aljoscha aljoscha merged commit 8a44d26 into MaterializeInc:main May 15, 2024
73 checks passed
@aljoscha aljoscha deleted the adapter-harden-session-lin-test branch May 15, 2024 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants