-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adapter,tests: make test_session_linearizability resistant to slow propagation of source uppers #27080
adapter,tests: make test_session_linearizability resistant to slow propagation of source uppers #27080
Conversation
let mut source_ts; | ||
loop { | ||
source_ts = test_util::get_explain_timestamp(pg_table_name, &mz_client).await; | ||
if source_ts > 0 { | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to ignore this, but the following might allow us to avoid the break
.
let mut source_ts = 0;
while source_ts == 0 {
source_ts = test_util::get_explain_timestamp(pg_table_name, &mz_client).await;
}
let mut source_ts; | ||
loop { | ||
source_ts = test_util::get_explain_timestamp(pg_table_name, &mz_client).await; | ||
if source_ts > 0 { | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not really clear to me why the query in test_util::wait_for_pg_table_population
would choose a timestamp at a value larger than the source upper, causing it to return before the upper passes 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the upper is still [0]
(as far as the coord knows), it also holds a since at [0]
. And the wait_for_pg_table_population
query will use an as_of of 0
, which is the lowest one it can pick and we happen to still have a read hold for it. And the data happens to be there in the source shard and the query succeeds, because the real upper in the source shard is already a bit further ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke about this in person, the key detail is that the first batch of data in the source will get written at timestamp 0. The next batch will get written at the current system time. This seems not correct, but explains the observed behavior. We've had multiple discussions about writing source data in the past, so I won't rehash that here.
I discovered this with my PR that adds more concurrency to the controllers, where upper updates can sometimes be delayed. But it's a potential problem already today, where upper updates from sources can also be delayed.
The first commit is renaming and clearing up some things. The second commit is the actual fix.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.