[Question] Implementing incremental models for a custom plugin #349

ramonvermeulen · 2024-02-27T12:19:40Z

This is more of a question than an issue, I was wondering if there are examples of custom dbt-duckdb plugins that implement incremental models (and therefore incremental sources)? In my situation I have a custom plugin which fetches pull requests from the GitHub API for a list of given GitHub handles. However the Github API applies rate limiting which makes the ingestion kind of slow via the plugin (e.g. having to wait for 3600secs a couple of times per dbt execution for the rate limit to reset). Ideally on an incremental run I only want to fetch pull requests and repositories from the GitHub API which I did not ingest in my previous runs, so with a higher updated_on than the max() from what I already ingested.

For instance, is it possible in the load function to know if it is an incremental source the plugin is running for, and if it is an initial load (first run OR --full-refresh provided) or incremental load?

Would love to see an example 🙌

In my current implementation I inherit from: https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/plugins/__init__.py

I guess somehow there should be a way to access the RuntimeConfigObject from dbt.

EDIT:
Diving a bit deeper, it seems like I can access that via the TargetConfig, I will give this a try and close the issue if that covers this question.

EDIT2:
After testing it seems that the store() method is only called for external sources.

The text was updated successfully, but these errors were encountered:

jwills · 2024-02-28T19:55:39Z

Ah you might be interested in the work @milicevica23 is doing over here: #332

ramonvermeulen · 2024-02-29T07:19:32Z

Ah you might be interested in the work @milicevica23 is doing over here: #332

Thanks for the mention, I'll take a look at it. It seems that I have incremental kind of working with the current set-up as well, but I had to do some "hacky" stuff which I'd rather do not. Basically I miss-use the configure_connection method to fetch all records that already exist in my .duckdb file for my source tables, and then load them into a dataframe which I store on the Plugin instance (in-memory). Then I use these dataframes to query the max timestamp per user and a list of repositories that I already have in my database. When fetching the (in my use-case) GitHub API I'd use this to only retrieve "newer" pull requests / repositories. In the end I'd concat the "old" and "new" dataframes together into a new dataframe, and return this as source.

I really don't like the code behind it, for instance the stuff I need to do in the configure_connection method to check if it is the "first" execution, because the method basically gets called for every model (at least it seems like that). But it seems to work 😅

See my impl:
https://github.com/godatadriven/github-contributions/blob/fe44bfb601c9c89c479ae69ca380b0f3059ab6e8/github_contributions/src/github_contributions/plugin.py#L151

ramonvermeulen changed the title ~~Implementing incremental models for a plugin~~ [Question] Implementing incremental models for a plugin Feb 27, 2024

ramonvermeulen changed the title ~~[Question] Implementing incremental models for a plugin~~ [Question] Implementing incremental models for a custom plugin Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Implementing incremental models for a custom plugin #349

[Question] Implementing incremental models for a custom plugin #349

ramonvermeulen commented Feb 27, 2024 •

edited

jwills commented Feb 28, 2024

ramonvermeulen commented Feb 29, 2024 •

edited

[Question] Implementing incremental models for a custom plugin #349

[Question] Implementing incremental models for a custom plugin #349

Comments

ramonvermeulen commented Feb 27, 2024 • edited

jwills commented Feb 28, 2024

ramonvermeulen commented Feb 29, 2024 • edited

ramonvermeulen commented Feb 27, 2024 •

edited

ramonvermeulen commented Feb 29, 2024 •

edited