Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload blob from HTTPResponse #28

Open
jared-martin opened this issue Feb 4, 2019 · 3 comments
Open

Upload blob from HTTPResponse #28

jared-martin opened this issue Feb 4, 2019 · 3 comments
Labels
api: storage Issues related to the googleapis/python-storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@jared-martin
Copy link

I'm trying to use Blob.upload_from_file to upload an http.client.HTTPResponse object without saving it to disk first. It seems like this, or a version of this that wraps the HTTPResponse in an io object, should be possible.

However, because the response may be larger than _MAX_MULTIPART_SIZE, Blob.upload_from_file creates a resumable upload, which depends on tell to make sure the stream is at the beginning. Here is the code that reproduces this issue:

from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('my-file.csv', chunk_size=1 << 20)

import urllib.request
a_few_megs_of_data = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&batter_stands=&game_date_gt=2018-09-06&game_date_lt=2018-09-09&group_by=name&hfAB=&hfBBL=&hfBBT=&hfC=&hfFlag=&hfGT=R%7CPO%7CS%7C&hfInn=&hfNewZones=&hfOuts=&hfPR=&hfPT=&hfRO=&hfSA=&hfSea=2018%7C&hfSit=&hfZ=&home_road=&metric_1=&min_abs=0&min_pitches=0&min_results=0&opponent=&pitcher_throws=&player_event_sort=h_launch_speed&player_type=batter&position=&sort_col=pitches&sort_order=desc&stadium=&team=&type=details'
response = urllib.request.urlopen(a_few_megs_of_data)

blob.upload_from_file(response)

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1081, in upload_from_file
    client, file_obj, content_type, size, num_retries, predefined_acl
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 991, in _do_upload
    client, stream, content_type, size, num_retries, predefined_acl
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 934, in _do_resumable_upload
    predefined_acl=predefined_acl,
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 883, in _initiate_resumable_upload
    stream_final=False,
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/resumable_media/requests/upload.py", line 323, in initiate
    total_bytes=total_bytes, stream_final=stream_final)
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/resumable_media/_upload.py", line 409, in _prepare_initiate_request
    if stream.tell() != 0:
io.UnsupportedOperation: seek

Is it possible to read an HTTP response in chunks and write it to the blob without using the filesystem as an intermediary, or is this bad practice? If it is possible and not discouraged, what is the recommended way to do this?

@jared-martin
Copy link
Author

Update: wrapping the HTTP response in a class that trivially implements tell makes this work as expected.

from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('my-file.csv', chunk_size=1 << 20)

import urllib.request
a_few_megs_of_data = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&batter_stands=&game_date_gt=2018-09-06&game_date_lt=2018-09-09&group_by=name&hfAB=&hfBBL=&hfBBT=&hfC=&hfFlag=&hfGT=R%7CPO%7CS%7C&hfInn=&hfNewZones=&hfOuts=&hfPR=&hfPT=&hfRO=&hfSA=&hfSea=2018%7C&hfSit=&hfZ=&home_road=&metric_1=&min_abs=0&min_pitches=0&min_results=0&opponent=&pitcher_throws=&player_event_sort=h_launch_speed&player_type=batter&position=&sort_col=pitches&sort_order=desc&stadium=&team=&type=details'
response = urllib.request.urlopen(a_few_megs_of_data)

class HTTPResponseWithTell(object):

    def __init__(self, http_response):
        self.http_response = http_response
        self.number_of_bytes_read = 0

    def tell(self):
        return self.number_of_bytes_read

    def read(self, *args, **kwargs):
        buffer = self.http_response.read(*args, **kwargs)
        self.number_of_bytes_read += len(buffer)
        return buffer

response_with_tell = HTTPResponseWithTell(response)
blob.upload_from_file(response_with_tell)

This reads the response 1 MB at a time and uploads it to cloud storage without ever storing the whole thing in memory.

However, after reading through the code and understanding ResumableUpload a little bit better, the point seems to be that unseekable streams are not resumable, since seek is required to resume an upload from where it left off in the event of failure. There doesn't seem to be a supported option for uploading data in chunks that is not strictly "resumable".

@crwilcox
Copy link
Contributor

Thanks for providing this feedback. It seems we would need to alter the inner workings to not depend on being able to reverse through the stream. This is supported, to my knowledge, in our node client so it isn't an unreasonable ask for Python.

Thanks for the feedback!

@jiajie-chen-havas
Copy link

This would definitely be super helpful for our team as well!
Some of the data we're trying to upload is generated from generators/iterables, which we wrap in a custom read-only subclass of io.RawIOBase.

@crwilcox crwilcox transferred this issue from googleapis/google-cloud-python Jan 31, 2020
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Jan 31, 2020
@yoshi-automation yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Feb 3, 2020
@JesseLovelace JesseLovelace added the type: question Request for information or clarification. Not an issue. label Feb 4, 2020
@frankyn frankyn added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed 🚨 This issue needs some love. triage me I really want to be triaged. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Feb 4, 2020
@tseaver tseaver added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: question Request for information or clarification. Not an issue. labels May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/python-storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

7 participants