Skip to content

Relationship between dvc commands (push, pull, add), file size and number of files. #10337

Discussion options

You must be logged in to vote

The pull processing time depends on the number of files, not the size. Is this because time is taken by IO to storage?
The push processing time is almost the same. Why is this?

Downloading lots of small files is always going to be slower than big files. That's probably what you are seeing in the benchmarks.

It also depends on what remote you are using. If you are using s3, azure, google storage, dvc tries to upload file asynchronously in batches. For other filesystems, dvc uses multithreaded executor which might have high overhead.

There may be some overhead with building index too, a sqlite database where it keeps record of files.

The add processing time also depends on the number of …

Replies: 2 comments 5 replies

Comment options

You must be logged in to vote
5 replies
@Shin-ichi-Takayama
Comment options

@skshetry
Comment options

@Shin-ichi-Takayama
Comment options

@skshetry
Comment options

@Shin-ichi-Takayama
Comment options

Answer selected by Shin-ichi-Takayama
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Help
Labels
None yet
3 participants