Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using use_credential_provider: aws with instance profiles gives HTTP error 400 #226

Open
sacundim opened this issue Aug 4, 2023 · 11 comments

Comments

@sacundim
Copy link

sacundim commented Aug 4, 2023

  • dbt-core version: 1.6.0
  • dbt-duckdb version: 1.6.0
  • profiles.yml

When trying to use the aws target in the linked profile either from a ECS container or an EC2 instance that's known to have the correct permissions, we get nevertheless an HTTP 400 error:

05:37:56  �Runtime Error in model biostatistics_deaths (models/biostatistics/staging/biostatistics_deaths.sql)
05:37:56    HTTP Error: HTTP GET error on '/?encoding-type=url&list-type=2&prefix=biostatistics.salud.pr.gov%2Fdeaths%2Fparquet_v2%2F' (HTTP 400)

But if in the same EC2 instance I instead configure it this way, with credentials I get from aws sts get-session-token, it works:

    aws:
      type: duckdb
      extensions:
        - httpfs
        - parquet
      threads: 4
      external_root: "{{ env_var('OUTPUT_ROOT') }}"
      settings:
        s3_region: us-west-2
        s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
        s3_session_token: "{{ env_var('S3_SESSION_TOKEN') }}"
@jwills
Copy link
Collaborator

jwills commented Aug 4, 2023

huh, k-- these kinds of issues are very hard for me to debug since I don't have ready access to the environment in question; I think the best bet here is to open a python shell/run a simple script that calls the relevant function in dbt-duckdb (which is defined here) and see if we can deduce where the error is coming from, e.g.:

import dbt.adapters.duckdb.credentials as creds
creds._load_aws_credentials()

sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this issue Aug 4, 2023
@sacundim
Copy link
Author

sacundim commented Aug 4, 2023

...working on it, I've put together a simple Docker image to try out your approach, gotta get it running in AWS Batch to do the real deal

@sacundim
Copy link
Author

sacundim commented Aug 4, 2023

Running on Batch prints out a dict with keys s3_access_key_id, s3_secret_access_key , s3_session_token and s3_region. The values are sensitive so I obv can't share them. I did launch a duckdb 0.8.1 manually outside of AWS, did the corresponding SET statements, and I can query from there, so it's something in between. I'll try to extend my Python program somehow to test out more of the bits in between those two parts that work.

sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this issue Aug 4, 2023
@sacundim
Copy link
Author

sacundim commented Aug 4, 2023

I tried the following inside a Fargate container:

import dbt.adapters.duckdb.credentials as creds
import duckdb

credentials = creds._load_aws_credentials()
print(f'credentials keys = {credentials.keys()}')
connection = duckdb.connect()
cursor = connection.cursor()

cursor.execute('INSTALL httpfs')
cursor.execute('LOAD httpfs')
for key, value in credentials.items():
    cursor.execute(f"SET {key} = '{value}'")

...and ran a query like the one my DBT project gets the error for, but it works fine. Maybe elsewhere the adapter is doing something that interferes with this? I looked at e.g. DuckDBConnectionWrapper but I can't spot anything untoward.

@jwills
Copy link
Collaborator

jwills commented Aug 4, 2023

Hrm-- maybe related to this? duckdb/duckdb#6563

sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this issue Aug 4, 2023
@sacundim
Copy link
Author

sacundim commented Aug 4, 2023

My apologies, turns out my reproduction efforts failed to reproduce one of the elements of the original failure: the jobs with the errors are running in ECS cluster with EC2 nodes, but my earlier reproduction attempts ran in Fargate.

I see this perhaps crucial difference:

  1. Under Fargate, the _load_aws_credentials() call returns four keys: s3_access_key_id, s3_secret_access_key, s3_session_token, and s3_region
  2. Under EC2, it only returns three keys—the s3_region is missing!

And I can reproduce the HTTP 400 outside of AWS by not setting the s3_region.

sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this issue Aug 4, 2023
@jwills
Copy link
Collaborator

jwills commented Aug 4, 2023

Ah, good to know-- and nice detective work!

@jwills
Copy link
Collaborator

jwills commented Aug 4, 2023

Thinking I should add some logging in that _load_aws_credentials function to note which keys were set via the sts token call (tho obviously not the values) to help future folks track down these kinds of problems

sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this issue Aug 4, 2023
@jwills
Copy link
Collaborator

jwills commented Aug 4, 2023

...and also that it's possible that this extension may run into some of the same issues: https://github.com/duckdblabs/duckdb_aws

@sacundim
Copy link
Author

sacundim commented Aug 4, 2023

I've just confirmed a working workaround for the issue:

      use_credential_provider: aws
      settings:
        # In theory this shouldn't be necessary:
        s3_region: "{{ env_var('S3_REGION') }}"

@sacundim
Copy link
Author

sacundim commented Aug 5, 2023

...and also that it's possible that this extension may run into some of the same issues: https://github.com/duckdblabs/duckdb_aws

Actually I think we have a bug in the httpfs extension here. Its requests to the S3 endpoint are erroring with inscrutable errors in a scenario where other tools—most notably example boto3 the official AWS CLI—work fine. I wonder e.g. if it's sending an empty string for the region when it's supposed to either send none or send a valid one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants