Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support incremental uploads #1430

Open
mariosasko opened this issue Oct 21, 2022 · 8 comments
Open

Support incremental uploads #1430

mariosasko opened this issue Oct 21, 2022 · 8 comments
Labels
enhancement New feature or request filesystem

Comments

@mariosasko
Copy link
Contributor

mariosasko commented Oct 21, 2022

It would be great to support incremental uploads to avoid a temporary file creation in HfFileSystemFile._initiate_upload and to be more aligned with fsspec's philosophy (see huggingface/hffs#1 (comment))

When uploading a HfFileSystemFile, the file contents are not known in advance, meaning we can't compute the file's sha and size, which are needed to fetch the upload mode or compute the number of parts in the multi-part upload mode on the moon-landing side, etc.

Fixing this would probably require a new endpoint that accepts file contents in chunks, computes their GIT metadata, and writes them to a repo (as a regular or an LFS file).

(cc @julien-c @coyotte508)

@mariosasko mariosasko added the enhancement New feature or request label Oct 21, 2022
@julien-c
Copy link
Member

I don't think we can really do this as:

  • LFS files are not uploaded to hf.co but to an external bucket
  • we wouldn't want to proxy files, especially larges files, it would become a big bandwidth SPOF

I think you should look into client-side workaround for this

@julien-c
Copy link
Member

tagging @XciD @coyotte508 @Kakulukian just in case (but i think it's a long term subject)

@coyotte508
Copy link
Member

coyotte508 commented Jan 13, 2023

We'd need to create a new cloud backend service that the user can stream arbitrary content to, including files over 5GB, and that can then compute the size / sha of the files.

IMO it'd be a new service separate from moon-landing, and it needs dev time

@XciD
Copy link
Member

XciD commented Jan 16, 2023

with the S3 upload url, it's not possible to push in multipart directly from the client ?
So you dont need to know the number of part ?

@coyotte508
Copy link
Member

coyotte508 commented Jan 16, 2023

Ah... I thought you needed to know the file size beforehand + have a minimum file size but apparently not.

I guess you can use the old multipart endpoint @mariosasko , /api/:repoType(models|spaces|datasets)?/:namespace?/:repo/upload/:rev/*. You'd need to stream upload a file there & put a Content-Length header that is at least the size of the file, e.g. 50GB if you want to be safe.

It's deprecated and will get interrupted every time the hub reloads, but maybe it's possible to put it in a separate kube pod @XciD

Edit: see the old code - you can reuse it maybe:

, you just need to set a 50GB content-length.

cc @julien-c @Pierrci

@Wauplin
Copy link
Contributor

Wauplin commented Jan 16, 2023

with the S3 upload url, it's not possible to push in multipart directly from the client ?

The current problem is that before getting a S3 upload url, we need to send to the server the size and sha of the file in order to know if the client should upload it as regular or LFS.

@mariosasko
Copy link
Contributor Author

The current problem is that before getting a S3 upload url, we need to send to the server the size and sha of the file in order to know if the client should upload it as regular or LFS.

If I'm not mistaken, this also means we cannot address huggingface/datasets#5045 (uploading Parquet shards iteratively in Dataset.push_to_hub) as the shards' sizes/SHAs are not known in advance.

And, Dataset.push_to_hub is one of the most used methods in datasets, so being able to do this would be nice.

@mariosasko mariosasko transferred this issue from huggingface/hffs Apr 6, 2023
@mariosasko mariosasko changed the title Avoid temporary file creation in file upload Avoid temporary file creation in HfFileSystem's file upload Apr 6, 2023
@julien-c
Copy link
Member

julien-c commented Apr 6, 2023

For the record, what we've been thinking a little bit about recently would be to move away from git hosting a little bit, and potentially either:

  • be able to stream-upload directly on the bucket and create the commit a posteriori, at the end of upload (if we want to keep git compatibility)
  • more extreme version: being able to opt out from git and just use a repo as a bucket (🤔)
  • or something else in between

In all cases, this is all very long term

@mariosasko mariosasko changed the title Avoid temporary file creation in HfFileSystem's file upload Support incremental uploads Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request filesystem
Projects
None yet
Development

No branches or pull requests

5 participants