Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GCS files without credentials #537

Closed
3 tasks done
ppwwyyxx opened this issue Sep 17, 2020 · 10 comments · Fixed by #728
Closed
3 tasks done

Support GCS files without credentials #537

ppwwyyxx opened this issue Sep 17, 2020 · 10 comments · Fixed by #728

Comments

@ppwwyyxx
Copy link

Problem description

Be able to read public GCS files without providing credentials.

Steps/code to reproduce the problem

path = "gs://tensorflow-nightly/prod/tensorflow/release/ubuntu_16/gpu_py37_full/nightly_release/18/20190813-010608/github/tensorflow/pip_pkg/tf_nightly_gpu-1.15.0.dev20190813-cp37-cp37m-linux_x86_64.whl"

import smart_open
try:
    f = smart_open.smart_open(path)
except Exception as e:
    print(e)


import tensorflow as tf
f = tf.io.gfile.GFile(path, "rb")
with open("out.whl", "wb") as fout:
    fout.write(f.read())

Running the above code, smart_open failed with

Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started

while tf.io is able to successfully download the public file, although with a warning:

W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".

Since it's possible to download the file, it's best to not require a credential so that public files can be easily downloaded by anyone.

Versions

Linux-5.4.63-1-lts-x86_64-with-glibc2.2.5
Python 3.8.5 (default, Sep 17 2020, 00:56:56)
smart_open 2.1.1

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software
@piskvorky
Copy link
Owner

Thanks! That makes sense, I wasn't aware of that use case.

Can you open a PR with a fix?

@petedannemann
Copy link
Contributor

petedannemann commented Sep 17, 2020

I do not suggest we make any code changes. Using an anonymous client is a rare use case and I think forcing someone to be explicit about it is OK. To accomplish what you want you just need to create the anonymous client and pass it into smart_open.open as a transport_param as documented in the README.

path = "gs://tensorflow-nightly/prod/tensorflow/release/ubuntu_16/gpu_py37_full/nightly_release/18/20190813-010608/github/tensorflow/pip_pkg/tf_nightly_gpu-1.15.0.dev20190813-cp37-cp37m-linux_x86_64.whl"

import smart_open
import google.cloud.storage

client = google.cloud.storage.Client.create_anonymous_client()
f = smart_open.open(path, transport_params=dict(client=client))

EDIT: I just tested this and I can confirm this works

@ravindrabhargava
Copy link

May I work on it, if no objection and still issue are open?

@petedannemann
Copy link
Contributor

petedannemann commented Sep 17, 2020

I do not suggest we make any code changes. Using an anonymous client is a rare use case and I think forcing someone to be explicit about it is OK. To accomplish what you want I think you just need to create the anonymous client and pass it into smart_open.open.

path = "gs://tensorflow-nightly/prod/tensorflow/release/ubuntu_16/gpu_py37_full/nightly_release/18/20190813-010608/github/tensorflow/pip_pkg/tf_nightly_gpu-1.15.0.dev20190813-cp37-cp37m-linux_x86_64.whl"

import smart_open
import google.cloud.storage

client = google.cloud.storage.Client.create_anonymous_client()
f = smart_open.open(path, transport_params=dict(client=client))

I am in favor of documenting this use case explicitly in the README though

@piskvorky
Copy link
Owner

piskvorky commented Sep 17, 2020

Hm, maybe we can move these recipes for the various storages (S3, GC, HTTPS…) into separate Wiki pages? And link to them from the README.

Because I'm worried the README is becoming unwieldy CC @mpenkov . Or is a single comprehensive page better? Needs a TOC though.

Btw README is showing a red "build failing" badge at the moment.

@ppwwyyxx
Copy link
Author

What's the down side of always trying the anonymous client when no credential is found?
If that fails with permission issues an error can still be thrown. It seems like a strict improvement to me. For projects that use GCS to serve public files, this is a very useful improvement.

@petedannemann
Copy link
Contributor

What's the down side of always trying the anonymous client when no credential is found?
If that fails with permission issues an error can still be thrown. It seems like a strict improvement to me. For projects that use GCS to serve public files, this is a very useful improvement.

Different behavior than the google.cloud.storage API

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 18, 2020

@piskvorky We already have a how-to guide explicitly for capturing edge cases like this.

https://github.com/RaRe-Technologies/smart_open/blob/develop/howto.md

@petedannemann I agree, let's deal with this in documentation for now.

@ppwwyyxx Please feel free to add to that guide using a PR.

@ppwwyyxx
Copy link
Author

Different behavior than the google.cloud.storage API

That's reasonable. However I thought the exact goal of this project is to provide simpler and more unified (in other words, less backend-specific) APIs. So this argument doesn't seem very compelling to me.

But I'll leave that to maintainers who know more about what's best for the project.

@petedannemann
Copy link
Contributor

Different behavior than the google.cloud.storage API

That's reasonable. However I thought the exact goal of this project is to provide simpler and more unified (in other words, less backend-specific) APIs. So this argument doesn't seem very compelling to me.

But I'll leave that to maintainers who know more about what's best for the project.

My understanding is that the goal of this project was to provide a unified API for file like objects . I thought handling authentication to the "file systems" to access these file like objects was expected to be so different from system to system that smart_open defers to the underlying Python package's for each file system for authentication. That is why our transport_params kwarg exists. I defer to the maintainers of this project on this topic though.

mpenkov pushed a commit that referenced this issue Oct 2, 2022

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
…ials (#728)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants