Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Implementing incremental models for a custom plugin #349

Open
ramonvermeulen opened this issue Feb 27, 2024 · 2 comments
Open

Comments

@ramonvermeulen
Copy link

ramonvermeulen commented Feb 27, 2024

This is more of a question than an issue, I was wondering if there are examples of custom dbt-duckdb plugins that implement incremental models (and therefore incremental sources)? In my situation I have a custom plugin which fetches pull requests from the GitHub API for a list of given GitHub handles. However the Github API applies rate limiting which makes the ingestion kind of slow via the plugin (e.g. having to wait for 3600secs a couple of times per dbt execution for the rate limit to reset). Ideally on an incremental run I only want to fetch pull requests and repositories from the GitHub API which I did not ingest in my previous runs, so with a higher updated_on than the max() from what I already ingested.

For instance, is it possible in the load function to know if it is an incremental source the plugin is running for, and if it is an initial load (first run OR --full-refresh provided) or incremental load?

Would love to see an example 🙌

In my current implementation I inherit from: https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/plugins/__init__.py

I guess somehow there should be a way to access the RuntimeConfigObject from dbt.

EDIT:
Diving a bit deeper, it seems like I can access that via the TargetConfig, I will give this a try and close the issue if that covers this question.

EDIT2:
After testing it seems that the store() method is only called for external sources.

@ramonvermeulen ramonvermeulen changed the title Implementing incremental models for a plugin [Question] Implementing incremental models for a plugin Feb 27, 2024
@ramonvermeulen ramonvermeulen changed the title [Question] Implementing incremental models for a plugin [Question] Implementing incremental models for a custom plugin Feb 27, 2024
@jwills
Copy link
Collaborator

jwills commented Feb 28, 2024

Ah you might be interested in the work @milicevica23 is doing over here: #332

@ramonvermeulen
Copy link
Author

ramonvermeulen commented Feb 29, 2024

Ah you might be interested in the work @milicevica23 is doing over here: #332

Thanks for the mention, I'll take a look at it. It seems that I have incremental kind of working with the current set-up as well, but I had to do some "hacky" stuff which I'd rather do not. Basically I miss-use the configure_connection method to fetch all records that already exist in my .duckdb file for my source tables, and then load them into a dataframe which I store on the Plugin instance (in-memory). Then I use these dataframes to query the max timestamp per user and a list of repositories that I already have in my database. When fetching the (in my use-case) GitHub API I'd use this to only retrieve "newer" pull requests / repositories. In the end I'd concat the "old" and "new" dataframes together into a new dataframe, and return this as source.

I really don't like the code behind it, for instance the stuff I need to do in the configure_connection method to check if it is the "first" execution, because the method basically gets called for every model (at least it seems like that). But it seems to work 😅

See my impl:
https://github.com/godatadriven/github-contributions/blob/fe44bfb601c9c89c479ae69ca380b0f3059ab6e8/github_contributions/src/github_contributions/plugin.py#L151

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants