Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to reference tables as parquet files in iceberg on glue via external source? #184

Open
dioptre opened this issue Jun 13, 2023 · 3 comments

Comments

@dioptre
Copy link

dioptre commented Jun 13, 2023

I see a method to write to glue, but no reference to use it to get the data back out?

Please help!

@jwills
Copy link
Collaborator

jwills commented Jun 13, 2023

Mmm I've started working on iceberg as a source using an AWS catalog here via a plugin mechanism: https://github.com/jwills/dbt-duckdb/blob/master/dbt/adapters/duckdb/plugins/iceberg.py
and then used here like this:
https://github.com/jwills/dbt-duckdb/blob/master/tests/functional/plugins/test_iceberg.py

...but it's still pretty early/experimental. Can you tell me a bit more about what you're up to, and I'll see if I can figure out a way to help?

@dioptre
Copy link
Author

dioptre commented Jun 13, 2023

Yes we have a db schema we pipe to glue that we want to do some testing/analysis on.

The db updates every few hours with an iceberg merge, so we need just the latest, to pipe into duck, and we want to then run dbt transformations on it.

@jwills
Copy link
Collaborator

jwills commented Jun 13, 2023

Okay-- so in theory, this will work:

  1. pip3 install pyiceberg[glue,s3fs,pyarrow]
  2. pip3 install git+https://github.com/jwills/dbt-duckdb.git (b/c I haven't cut a release with this functionality enabled yet)
  3. Setup the environment variables for pyiceberg to use as specified here: https://py.iceberg.apache.org/configuration/
  4. Configure the dbt profiles.yml with the iceberg plugin enabled, which would look something like this:
jaffle_shop:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: jaffle_shop.duckdb
      threads: 4
      plugins:
        - module: iceberg
           config:
              catalog: default
  1. and then finally define a sources: config entry in a dbt schema.yml file that references the iceberg plugin and specifies the catalog/name of the table to load in the iceberg_table config property sort of like it's done here: https://github.com/jwills/dbt-duckdb/blob/master/tests/functional/plugins/test_iceberg.py#L8

...and yeah, in theory, that should load the latest iceberg snapshot in its entirety as a pyarrow table into DuckDB for subsequent transformation by the rest of the pipeline. (But again, this is super-new, I haven't tried it for real, maybe take a look at dbt-athena and see if that meets your needs, etc., etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants