Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not read external file from s3 private bucket #371

Open
Aaron-Zhou opened this issue Mar 28, 2024 · 2 comments
Open

Can not read external file from s3 private bucket #371

Aaron-Zhou opened this issue Mar 28, 2024 · 2 comments

Comments

@Aaron-Zhou
Copy link

Aaron-Zhou commented Mar 28, 2024

  1. Create a private bucket from AWS S3, set environment variables.
  2. Run by dbt run -s ./models
  3. Got error Python model failed: HTTP Error: HTTP GET error on 'https://bucket.s3.amazonaws.com/ms/distribution/csv/orders.csv' (HTTP 403)
  4. My codes:

profiles.yml:

default:
  outputs:
    dev:
      type: duckdb
      path: db/dbt.duckdb
      extensions:
        - httpfs
        - parquet
        - aws
      settings:
        s3_region: "us-east-1"
        s3_access_key_id: "{{ env_var('S3_KEY') }}"
        s3_secret_access_key: "{{ env_var('S3_SECRET') }}"
  target: dev

sources.yml:

sources:
  - name: csv
    meta:
      external_location: "s3://bucket/ms/distribution/csv/{name}.csv"
    tables:
      - name: orders
      - name: customers  

models.py:

def model(dbt, session):
    orders = (
        dbt.source("csv", "orders").filter("status='PENDING_PAYMENT'").set_alias("o")
    )
    customers = dbt.source("csv", "customers").set_alias("c")
    pending_orders = orders.join(customers, "o.customer_id=c.id")
    return pending_orders
  1. Some findings: If I persist the s3 secret to duckdb home folder - "~/.duckdb/stored_secrets/" then the dbt run can work without error. It seems the dbt-duckdb did not actually feed s3 credentials to duckdb.
@jwills
Copy link
Collaborator

jwills commented Mar 28, 2024

It would be surprising if the problem was that we weren't feeding the s3 credentials to DuckDB given that so many people run with those settings (that is, I feel like I would have heard that complaint from lots of people if that was broken.)

The most common problem I see with these settings is that the region isn't configured correctly (i.e., the data lives in an S3 bucket defined in a region that is different from the one defined in s3_region.) Of course, if you have a way to make this work using the CREATE SECRET construct (which it sounds like you do), you can just run that in a macro that is configured using the on-run-start hook and be on your way.

@Aaron-Zhou
Copy link
Author

Aaron-Zhou commented Mar 29, 2024

@jwills Thanks for your reply, I double checked my s3_region and I am pretty sure the value is correct. I also did some testing by adding below config to write files to s3 bucket, it worked without using CREATE SECRET construct. So it proved that my s3 settings were correct. I also don't believe DuckDB has problem with reading files from s3, but could be something from dbt-duckdb adapter caused s3 settings not proper configured on DuckDB so caused reading from s3 bucket not working.
Below is the config I used so write table to external files on S3 bucket. Again, writing worked without the CREATE SECRET construct, but reading was not working.

Model.py


def model(dbt, session):
    dbt.config(
        materialized="external",
        location="s3://bucket/parquet/duckdb-sample/pending_orders.parquet",
        format="parquet",
        options={"partition_by": "customer_id", "overwrite_or_ignore": True},
        glue_register=True,
        glue_database="duckdb_sample",
    )
    orders = (
        dbt.source("csv", "orders").filter("status='PENDING_PAYMENT'").set_alias("o")
    )
    customers = dbt.source("csv", "customers").set_alias("c")
    pending_orders = orders.join(customers, "o.customer_id=c.id")
    return pending_orders

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants