Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to open a file in GCS #36993

Open
Fokko opened this issue Aug 2, 2023 · 2 comments
Open

Unable to open a file in GCS #36993

Fokko opened this issue Aug 2, 2023 · 2 comments

Comments

@Fokko
Copy link
Contributor

Fokko commented Aug 2, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I'm writing integration tests against a local GCS instance using fake-gcs-server, however, the call when reading the file does not seem to work:

➜  python git:(fd-gcs) ✗ ipython
Python 3.9.17 (main, Jun 20 2023, 18:00:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pyarrow.fs import GcsFileSystem
   ...: from datetime import datetime
   ...: 
   ...: fs = GcsFileSystem(
   ...:   access_token='anon',
   ...:   credential_token_expiration=datetime(2023, 8, 2, 16, 30, 4),
   ...:   scheme='http',
   ...:   endpoint_override='0.0.0.0:4443'
   ...: )

In [2]: location = 'warehouse/vo.txt'
   ...: 
   ...: with fs.open_output_stream(location) as f:
   ...:   print(f.write(b"foo"))
3

In [3]: print(fs.get_file_info(location))
<FileInfo for 'warehouse/vo.txt': type=FileType.File, size=3>

In [4]: with fs.open_input_file(location) as f:
   ...:   print(f.read())
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 with fs.open_input_file(location) as f:
      2   print(f.read())

File ~/Library/Python/3.9/lib/python/site-packages/pyarrow/_fs.pyx:763, in pyarrow._fs.FileSystem.open_input_file()

File ~/Library/Python/3.9/lib/python/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/Library/Python/3.9/lib/python/site-packages/pyarrow/error.pxi:113, in pyarrow.lib.check_status()

FileNotFoundError: [Errno 2] google::cloud::Status(NOT_FOUND: Permanent error in Read(): ). Detail: [errno 2] No such file or directory

Can be reproduced using:

from pyarrow.fs import GcsFileSystem
from datetime import datetime

fs = GcsFileSystem(
  access_token='anon',
  credential_token_expiration=datetime(2023, 8, 2, 16, 30, 4),
  scheme='http',
  endpoint_override='0.0.0.0:4443'
)

location = 'warehouse/vo.txt'

with fs.open_output_stream(location) as f:
  print(f.write(b"foo"))

print(fs.get_file_info(location))

with fs.open_input_file(location) as f:
  print(f.read())

Failing calls with PyArrow

time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /storage/v1/b/warehouse/o/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 404 59"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /storage/v1/b/warehouse/o?prefix=1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt%2F&pageToken= HTTP/1.1\" 200 27"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"POST /upload/storage/v1/b/warehouse/o?uploadType=resumable&name=1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 200 335"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"PUT /upload/storage/v1/b/warehouse/o?uploadType=resumable&name=1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt&upload_id=43a8ec7bc33a15592b750fc916790750 HTTP/1.1\" 200 570"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /storage/v1/b/warehouse/o/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 200 570"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /warehouse/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 404 10"

The last call is causing the 404, and it seems to be missing /storage/v1/b/.

The equivalent code using GCSSpec:

time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /warehouse/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 404 10"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o/d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt HTTP/1.1\" 404 59"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o?delimiter=/&prefix=d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt/ HTTP/1.1\" 200 27"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o/d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt HTTP/1.1\" 404 59"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"POST /upload/storage/v1/b/warehouse/o?uploadType=resumable HTTP/1.1\" 200 335"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"POST /upload/storage/v1/b/warehouse/o?uploadType=resumable&name=d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt&upload_id=2b6f8d48acf8dd87cc86d1e51bd3120e HTTP/1.1\" 200 570"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o/d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt HTTP/1.1\" 200 570"

This only seems to happen when the endpoint_override is set

Component(s)

Python

@pitrou
Copy link
Member

pitrou commented Aug 22, 2023

cc @coryan

@coryan
Copy link
Contributor

coryan commented Aug 22, 2023

I am not sure what version of google-cloud-cpp is this using under the hood. Until v2.7.0 google-cloud-cpp used the XML API for (most) downloads. The XML API does not use the /storage/v1/b/ prefix. AFAIK fake-gcs-server does not support the XML API. At least it used to not support it, and #331 is still open.

Setting the GOOGLE_CLOUD_CPP_STORAGE_REST_CONFIG environment variable to disable-xml should disable XML and allow you to use fake-gcs-server.

If you are using a newer version of google-cloud-cpp then disregard these comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants