Is it possible to reference tables as parquet files in iceberg on glue via external source? #184

dioptre · 2023-06-13T21:08:22Z

I see a method to write to glue, but no reference to use it to get the data back out?

Please help!

jwills · 2023-06-13T21:14:47Z

Mmm I've started working on iceberg as a source using an AWS catalog here via a plugin mechanism: https://github.com/jwills/dbt-duckdb/blob/master/dbt/adapters/duckdb/plugins/iceberg.py
and then used here like this:
https://github.com/jwills/dbt-duckdb/blob/master/tests/functional/plugins/test_iceberg.py

...but it's still pretty early/experimental. Can you tell me a bit more about what you're up to, and I'll see if I can figure out a way to help?

dioptre · 2023-06-13T21:20:03Z

Yes we have a db schema we pipe to glue that we want to do some testing/analysis on.

The db updates every few hours with an iceberg merge, so we need just the latest, to pipe into duck, and we want to then run dbt transformations on it.

jwills · 2023-06-13T21:46:10Z

Okay-- so in theory, this will work:

pip3 install pyiceberg[glue,s3fs,pyarrow]
pip3 install git+https://github.com/jwills/dbt-duckdb.git (b/c I haven't cut a release with this functionality enabled yet)
Setup the environment variables for pyiceberg to use as specified here: https://py.iceberg.apache.org/configuration/
Configure the dbt profiles.yml with the iceberg plugin enabled, which would look something like this:

jaffle_shop:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: jaffle_shop.duckdb
      threads: 4
      plugins:
        - module: iceberg
           config:
              catalog: default

and then finally define a sources: config entry in a dbt schema.yml file that references the iceberg plugin and specifies the catalog/name of the table to load in the iceberg_table config property sort of like it's done here: https://github.com/jwills/dbt-duckdb/blob/master/tests/functional/plugins/test_iceberg.py#L8

...and yeah, in theory, that should load the latest iceberg snapshot in its entirety as a pyarrow table into DuckDB for subsequent transformation by the rest of the pipeline. (But again, this is super-new, I haven't tried it for real, maybe take a look at dbt-athena and see if that meets your needs, etc., etc.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to reference tables as parquet files in iceberg on glue via external source? #184

Is it possible to reference tables as parquet files in iceberg on glue via external source? #184

dioptre commented Jun 13, 2023

jwills commented Jun 13, 2023

dioptre commented Jun 13, 2023

jwills commented Jun 13, 2023

Is it possible to reference tables as parquet files in iceberg on glue via external source? #184

Is it possible to reference tables as parquet files in iceberg on glue via external source? #184

Comments

dioptre commented Jun 13, 2023

jwills commented Jun 13, 2023

dioptre commented Jun 13, 2023

jwills commented Jun 13, 2023