Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS as S3 Failed to list bucket when use_memory_table=true #235

Closed
jychen7 opened this issue Dec 27, 2022 · 3 comments · Fixed by #236
Closed

GCS as S3 Failed to list bucket when use_memory_table=true #235

jychen7 opened this issue Dec 27, 2022 · 3 comments · Fixed by #236

Comments

@jychen7
Copy link
Collaborator

jychen7 commented Dec 27, 2022

Reproduce Step

# demo_gcs.yml
addr:
  # binding address for TCP port that speaks HTTP protocol
  http: 0.0.0.0:8084
  # binding address for TCP port that speaks Postgres wire protocol
  postgres: 0.0.0.0:5432
tables:
  - name: "nyc"
    uri: "s3://{gcs_bucket_name}/yellow_tripdata_2022-10.parquet"
    option:
      format: "parquet"
      use_memory_table: true

As of 2022-12, checkout main at commit c0bff95

export AWS_ACCESS_KEY_ID=******
export AWS_SECRET_ACCESS_KEY=******
export AWS_ENDPOINT_URL="https://storage.googleapis.com"
export AWS_REGION=us-east1

cargo build
RUST_LOG=debug ./target/debug/roapi -c demo_gcs.yml

Expect

[2022-12-27T02:38:08Z INFO  roapi::context] registered `uri(s3://{gcs_bucket_name}/yellow_tripdata_2022-10.parquet)` as table `nyc`
[2022-12-27T02:38:08Z INFO  roapi::startup] 🚀 Listening on 0.0.0.0:5432 for Postgres traffic...
[2022-12-27T02:38:08Z INFO  roapi::startup] 🚀 Listening on 0.0.0.0:8084 for HTTP traffic...

Actual

Error: Error loading data from S3 store: Failed to list bucket: Error obtaining body: Error obtaining chunk: error reading a body from connection: stream error received: unspecific protocol error detected

full debug log

[2022-12-27T02:54:41Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2022-12-27T02:54:41Z INFO  roapi::context] loading `uri(s3://{gcs_bucket_name}/yellow_tripdata_2022-10.parquet)` as table `nyc`
[2022-12-27T02:54:41Z DEBUG columnq::io::s3] using custom S3 endpoint https://storage.googleapis.com with region us-east1
[2022-12-27T02:54:41Z DEBUG rusoto_core::request] Full request:
     method: GET
     final_uri: https://storage.googleapis.com/{gcs_bucket_name}/yellow_tripdata_2022-10.parquet
    Headers:

[2022-12-27T02:54:41Z DEBUG rusoto_core::request] authorization:"AWS4-HMAC-SHA256 Credential=****** SignedHeaders=content-type;host;x-amz-content-sha256;x-amz-date, Signature=******"
[2022-12-27T02:54:41Z DEBUG rusoto_core::request] content-length:"0"
[2022-12-27T02:54:41Z DEBUG rusoto_core::request] content-type:"application/octet-stream"
[2022-12-27T02:54:41Z DEBUG rusoto_core::request] host:"storage.googleapis.com"
[2022-12-27T02:54:41Z DEBUG rusoto_core::request] x-amz-content-sha256:"******"
[2022-12-27T02:54:41Z DEBUG rusoto_core::request] x-amz-date:"20221227T025441Z"
[2022-12-27T02:54:41Z DEBUG rusoto_core::request] user-agent:"rusoto/0.47.0 rust/1.66.0 macos"
[2022-12-27T02:54:41Z DEBUG hyper::client::connect::dns] resolving host="storage.googleapis.com"
[2022-12-27T02:54:41Z DEBUG hyper::client::connect::http] connecting to 142.251.33.176:443
[2022-12-27T02:54:41Z DEBUG hyper::client::connect::http] connected to 142.251.33.176:443
[2022-12-27T02:54:41Z DEBUG rustls::client::hs] No cached session for DNSNameRef("storage.googleapis.com")
[2022-12-27T02:54:41Z DEBUG rustls::client::hs] Not resuming any session
[2022-12-27T02:54:41Z DEBUG rustls::client::hs] Using ciphersuite TLS13_CHACHA20_POLY1305_SHA256
[2022-12-27T02:54:41Z DEBUG rustls::client::tls13] Not resuming
[2022-12-27T02:54:41Z DEBUG rustls::client::tls13] TLS1.3 encrypted extensions: [Protocols([PayloadU8([104, 50])])]
[2022-12-27T02:54:41Z DEBUG rustls::client::hs] ALPN protocol is Some(b"h2")
[2022-12-27T02:54:41Z DEBUG h2::client] binding client connection
[2022-12-27T02:54:41Z DEBUG h2::client] client connection bound
[2022-12-27T02:54:41Z DEBUG h2::codec::framed_write] send frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }
[2022-12-27T02:54:41Z DEBUG h2::proto::connection] Connection; peer=Client
[2022-12-27T02:54:41Z DEBUG h2::codec::framed_write] send frame=WindowUpdate { stream_id: StreamId(0), size_increment: 5177345 }
[2022-12-27T02:54:41Z DEBUG hyper::client::pool] pooling idle connection for ("https", storage.googleapis.com)
[2022-12-27T02:54:41Z DEBUG h2::codec::framed_write] send frame=Headers { stream_id: StreamId(1), flags: (0x5: END_HEADERS | END_STREAM) }
[2022-12-27T02:54:41Z DEBUG rustls::client::tls13] Ticket saved
[2022-12-27T02:54:41Z DEBUG rustls::client::tls13] Ticket saved
[2022-12-27T02:54:41Z DEBUG h2::codec::framed_read] received frame=Settings { flags: (0x0), max_concurrent_streams: 100, initial_window_size: 1048576, max_header_list_size: 65536 }
[2022-12-27T02:54:41Z DEBUG h2::codec::framed_write] send frame=Settings { flags: (0x1: ACK) }
[2022-12-27T02:54:41Z DEBUG h2::codec::framed_read] received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 983041 }
[2022-12-27T02:54:41Z DEBUG h2::codec::framed_read] received frame=Settings { flags: (0x1: ACK) }
[2022-12-27T02:54:41Z DEBUG h2::proto::settings] received settings ACK; applying Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }
[2022-12-27T02:54:41Z DEBUG h2::codec::framed_read] received frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) }
[2022-12-27T02:54:41Z DEBUG h2::codec::framed_read] received frame=Reset { stream_id: StreamId(1), error_code: PROTOCOL_ERROR }
[2022-12-27T02:54:41Z DEBUG columnq::io::s3] using custom S3 endpoint https://storage.googleapis.com with region us-east1
[2022-12-27T02:54:42Z DEBUG rusoto_core::request] Full request:
     method: GET
     final_uri: https://storage.googleapis.com/{gcs_bucket_name}?list-type=2&prefix=yellow_tripdata_2022-10.parquet%2F&start-after=yellow_tripdata_2022-10.parquet%2F
    Headers:

[2022-12-27T02:54:42Z DEBUG rusoto_core::request] authorization:"AWS4-HMAC-SHA256 Credential=****** SignedHeaders=content-type;host;x-amz-content-sha256;x-amz-date, Signature=******"
[2022-12-27T02:54:42Z DEBUG rusoto_core::request] content-length:"0"
[2022-12-27T02:54:42Z DEBUG rusoto_core::request] content-type:"application/octet-stream"
[2022-12-27T02:54:42Z DEBUG rusoto_core::request] host:"storage.googleapis.com"
[2022-12-27T02:54:42Z DEBUG rusoto_core::request] x-amz-content-sha256:"******"
[2022-12-27T02:54:42Z DEBUG rusoto_core::request] x-amz-date:"20221227T025442Z"
[2022-12-27T02:54:42Z DEBUG rusoto_core::request] user-agent:"rusoto/0.47.0 rust/1.66.0 macos"
[2022-12-27T02:54:42Z DEBUG hyper::client::connect::dns] resolving host="storage.googleapis.com"
[2022-12-27T02:54:42Z DEBUG hyper::client::connect::http] connecting to 142.251.33.176:443
[2022-12-27T02:54:42Z DEBUG hyper::client::connect::http] connected to 142.251.33.176:443
[2022-12-27T02:54:42Z DEBUG rustls::client::hs] No cached session for DNSNameRef("storage.googleapis.com")
[2022-12-27T02:54:42Z DEBUG rustls::client::hs] Not resuming any session
[2022-12-27T02:54:42Z DEBUG rustls::client::hs] Using ciphersuite TLS13_CHACHA20_POLY1305_SHA256
[2022-12-27T02:54:42Z DEBUG rustls::client::tls13] Not resuming
[2022-12-27T02:54:42Z DEBUG rustls::client::tls13] TLS1.3 encrypted extensions: [Protocols([PayloadU8([104, 50])])]
[2022-12-27T02:54:42Z DEBUG rustls::client::hs] ALPN protocol is Some(b"h2")
[2022-12-27T02:54:42Z DEBUG h2::client] binding client connection
[2022-12-27T02:54:42Z DEBUG h2::client] client connection bound
[2022-12-27T02:54:42Z DEBUG h2::codec::framed_write] send frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }
[2022-12-27T02:54:42Z DEBUG h2::proto::connection] Connection; peer=Client
[2022-12-27T02:54:42Z DEBUG h2::codec::framed_write] send frame=WindowUpdate { stream_id: StreamId(0), size_increment: 5177345 }
[2022-12-27T02:54:42Z DEBUG hyper::client::pool] pooling idle connection for ("https", storage.googleapis.com)
[2022-12-27T02:54:42Z DEBUG h2::codec::framed_write] send frame=Headers { stream_id: StreamId(1), flags: (0x5: END_HEADERS | END_STREAM) }
[2022-12-27T02:54:42Z DEBUG rustls::client::tls13] Ticket saved
[2022-12-27T02:54:42Z DEBUG rustls::client::tls13] Ticket saved
[2022-12-27T02:54:42Z DEBUG h2::codec::framed_read] received frame=Settings { flags: (0x0), max_concurrent_streams: 100, initial_window_size: 1048576, max_header_list_size: 65536 }
[2022-12-27T02:54:42Z DEBUG h2::codec::framed_write] send frame=Settings { flags: (0x1: ACK) }
[2022-12-27T02:54:42Z DEBUG h2::codec::framed_read] received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 983041 }
[2022-12-27T02:54:42Z DEBUG h2::codec::framed_read] received frame=Settings { flags: (0x1: ACK) }
[2022-12-27T02:54:42Z DEBUG h2::proto::settings] received settings ACK; applying Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }
[2022-12-27T02:54:42Z DEBUG h2::codec::framed_read] received frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) }
[2022-12-27T02:54:42Z DEBUG h2::codec::framed_read] received frame=Reset { stream_id: StreamId(1), error_code: PROTOCOL_ERROR }
[2022-12-27T02:54:42Z DEBUG h2::codec::framed_write] send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }
[2022-12-27T02:54:42Z DEBUG h2::codec::framed_write] send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }
[2022-12-27T02:54:42Z DEBUG h2::proto::connection] Connection::poll; connection error error=GoAway(b"", NO_ERROR, Library)
[2022-12-27T02:54:42Z DEBUG h2::proto::connection] Connection::poll; connection error error=GoAway(b"", NO_ERROR, Library)
[2022-12-27T02:54:42Z DEBUG rustls::session] Sending warning alert CloseNotify
[2022-12-27T02:54:42Z DEBUG rustls::session] Sending warning alert CloseNotify
Error: Error loading data from S3 store: Failed to list bucket: Error obtaining body: Error obtaining chunk: error reading a body from connection: stream error received: unspecific protocol error detected
@jychen7
Copy link
Collaborator Author

jychen7 commented Dec 27, 2022

I initially thought it is related to ListObjectV2, but it was supported in GCS since 2021-11. https://cloud.google.com/storage/docs/release-notes#November_01_2021

Now I think the problem is related to rustls + http2. Reported at rusoto/rusoto#1985

A quick fix could be force http1, similar to quickwit-oss/quickwit#1612.
What do you think ❓


For long term, probably we can switch to object_store crate (same crate used by datafusion) and support GCS natively.

@houqp
Copy link
Member

houqp commented Dec 28, 2022

Good catch @jychen7. Both disabling http2 and switching to object_store crate sounds good to me!

@jychen7
Copy link
Collaborator Author

jychen7 commented Jan 1, 2023

@houqp I create a PR to disable http2 here: #236 :D

I think "switch to object_store crate" could be next step, if ObjectStoreProvider is implemented in #227 (comment)

@houqp houqp closed this as completed in #236 Jan 1, 2023
houqp pushed a commit that referenced this issue Jan 1, 2023
* upgrade rusoto and force http1 for rustls

* add dependency for  openssl_build
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants