Upload blob from HTTPResponse #28

jared-martin · 2019-02-04T19:12:34Z

I'm trying to use Blob.upload_from_file to upload an http.client.HTTPResponse object without saving it to disk first. It seems like this, or a version of this that wraps the HTTPResponse in an io object, should be possible.

However, because the response may be larger than _MAX_MULTIPART_SIZE, Blob.upload_from_file creates a resumable upload, which depends on tell to make sure the stream is at the beginning. Here is the code that reproduces this issue:

from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('my-file.csv', chunk_size=1 << 20)

import urllib.request
a_few_megs_of_data = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&batter_stands=&game_date_gt=2018-09-06&game_date_lt=2018-09-09&group_by=name&hfAB=&hfBBL=&hfBBT=&hfC=&hfFlag=&hfGT=R%7CPO%7CS%7C&hfInn=&hfNewZones=&hfOuts=&hfPR=&hfPT=&hfRO=&hfSA=&hfSea=2018%7C&hfSit=&hfZ=&home_road=&metric_1=&min_abs=0&min_pitches=0&min_results=0&opponent=&pitcher_throws=&player_event_sort=h_launch_speed&player_type=batter&position=&sort_col=pitches&sort_order=desc&stadium=&team=&type=details'
response = urllib.request.urlopen(a_few_megs_of_data)

blob.upload_from_file(response)

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1081, in upload_from_file
    client, file_obj, content_type, size, num_retries, predefined_acl
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 991, in _do_upload
    client, stream, content_type, size, num_retries, predefined_acl
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 934, in _do_resumable_upload
    predefined_acl=predefined_acl,
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 883, in _initiate_resumable_upload
    stream_final=False,
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/resumable_media/requests/upload.py", line 323, in initiate
    total_bytes=total_bytes, stream_final=stream_final)
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/resumable_media/_upload.py", line 409, in _prepare_initiate_request
    if stream.tell() != 0:
io.UnsupportedOperation: seek

Is it possible to read an HTTP response in chunks and write it to the blob without using the filesystem as an intermediary, or is this bad practice? If it is possible and not discouraged, what is the recommended way to do this?

The text was updated successfully, but these errors were encountered:

jared-martin · 2019-02-11T20:28:05Z

Update: wrapping the HTTP response in a class that trivially implements tell makes this work as expected.

from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('my-file.csv', chunk_size=1 << 20)

import urllib.request
a_few_megs_of_data = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&batter_stands=&game_date_gt=2018-09-06&game_date_lt=2018-09-09&group_by=name&hfAB=&hfBBL=&hfBBT=&hfC=&hfFlag=&hfGT=R%7CPO%7CS%7C&hfInn=&hfNewZones=&hfOuts=&hfPR=&hfPT=&hfRO=&hfSA=&hfSea=2018%7C&hfSit=&hfZ=&home_road=&metric_1=&min_abs=0&min_pitches=0&min_results=0&opponent=&pitcher_throws=&player_event_sort=h_launch_speed&player_type=batter&position=&sort_col=pitches&sort_order=desc&stadium=&team=&type=details'
response = urllib.request.urlopen(a_few_megs_of_data)

class HTTPResponseWithTell(object):

    def __init__(self, http_response):
        self.http_response = http_response
        self.number_of_bytes_read = 0

    def tell(self):
        return self.number_of_bytes_read

    def read(self, *args, **kwargs):
        buffer = self.http_response.read(*args, **kwargs)
        self.number_of_bytes_read += len(buffer)
        return buffer

response_with_tell = HTTPResponseWithTell(response)
blob.upload_from_file(response_with_tell)

This reads the response 1 MB at a time and uploads it to cloud storage without ever storing the whole thing in memory.

However, after reading through the code and understanding ResumableUpload a little bit better, the point seems to be that unseekable streams are not resumable, since seek is required to resume an upload from where it left off in the event of failure. There doesn't seem to be a supported option for uploading data in chunks that is not strictly "resumable".

crwilcox · 2019-10-18T20:29:10Z

Thanks for providing this feedback. It seems we would need to alter the inner workings to not depend on being able to reverse through the stream. This is supported, to my knowledge, in our node client so it isn't an unreasonable ask for Python.

Thanks for the feedback!

jiajie-chen-havas · 2020-01-17T23:49:19Z

This would definitely be super helpful for our team as well!
Some of the data we're trying to upload is generated from generators/iterables, which we wrap in a custom read-only subclass of io.RawIOBase.

crwilcox transferred this issue from googleapis/google-cloud-python Jan 31, 2020

product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Jan 31, 2020

yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Feb 3, 2020

JesseLovelace added the type: question Request for information or clarification. Not an issue. label Feb 4, 2020

tseaver added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: question Request for information or clarification. Not an issue. labels May 4, 2021

simon-online mentioned this issue Jul 7, 2022

Product Media Upload URL breaks on Google Cloud Storage saleor/saleor#10122

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload blob from HTTPResponse #28

Upload blob from HTTPResponse #28

jared-martin commented Feb 4, 2019

jared-martin commented Feb 11, 2019

crwilcox commented Oct 18, 2019

jiajie-chen-havas commented Jan 17, 2020

Upload blob from HTTPResponse #28

Upload blob from HTTPResponse #28

Comments

jared-martin commented Feb 4, 2019

jared-martin commented Feb 11, 2019

crwilcox commented Oct 18, 2019

jiajie-chen-havas commented Jan 17, 2020