Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add secondary watermark to handle robustly case where xmin spans many rows #75

Open
drdee opened this issue Dec 24, 2019 · 0 comments
Open

Comments

@drdee
Copy link
Contributor

drdee commented Dec 24, 2019

Currently, the xmin pseudo system column is used as watermark column for the initial load of a table. When the table is very large and it takes more than 6 hours to ingest the data it presents one of the following problems:

  1. Job gets stuck on ingesting xmin because it never processes all rows with same xmin in a 6 hour window. This happens, depending on the number of columns in the table, around 50M rows with the same xmin.
  2. Job does proceed on ingesting xmin with many rows but it typically takes two attempts. The first attempt happens near the end of the runtime window. This attempt will fail, the 2nd attempt will pass because there is a more runtime available because it's the first xmin being processed. However, the fist attempt will have loaded rows into the destination table and hence there will be duplicate data in the destination table that needs to be manually cleaned.

Adding support for a secondary watermark, either a timestamp column or an ID field will prevent both problems from happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant