[C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK #40035

Tom-Newton · 2024-02-11T13:31:41Z

Describe the enhancement requested

Optimisation to #37511
Child of #18014

When reading from Azure blob storage the bandwidth we get per connection is very dependant on the latency to the filesystem. To achieve good bandwidth with high latency far greater concurrency is needed. For example this is relevant when reading from blob storage in a different region to your compute.

As an example lets consider reading a parquet file. There are 2 levels of parallelism that I'm aware of when using Arrow and the native AzureFileSystem:

Arrow will make concurrent calls to ReadAt for each column and row group combination. At most we can have one concurrent connection per column and row group combination, so for small parquet files this may be less than we would like.
Within ReadAt the AzureFileSystem calls BlobClient::DownloadTo which implements some extra concurrency internally https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/src/blob_client.cpp#L516. Purpose of this issue is to make the config options for this parallelism configurable by the user.

Component(s)

C++

The text was updated successfully, but these errors were encountered:

Tom-Newton · 2024-02-11T18:04:32Z

I'm going to run some performance tests for my usecase.

Tom-Newton · 2024-02-12T22:07:10Z

I can measure a clear performance advantage from tuning these parameters a bit.
Defaults: median=9.0 seconds P95=12.5 seconds

Setting

        initial_chunk_size=2*1024*1024,
        chunk_size=2*1024 * 1024,
        concurrency=10

median=5.6 seconds P95=6.7 seconds

I did not try especially hard to tune this but I think this is enough evidence to justify exposing these config options. Exposing these configs options is easy.

Tom-Newton · 2024-02-12T22:09:18Z

take

Tom-Newton · 2024-02-12T22:52:58Z

Given how small this change is I think I will make one PR for C++ and Python. Therefore I will wait for #40021 to merge first.

felipecrv · 2024-02-13T15:29:41Z

I think exposing all these settings in AzureOptions can be premature. They are per-request settings, so allowing a config in AzureOptions would force the internal implementation to stick to one set of values of every DownloadTo request.

My suggestion: we keep statistics about the ReadAt calls in a file handle and adjust the options as the calls come in. After this exercise we might expose settings in AzureOptions that expresses which policy should be used (assuming we can't come up with a good adaptive policy). What the policies would be called depends on which workloads we can isolate in the benchmarks: latency vs throughput, sequential vs random.

An alternative to the named policies can be: we expose only read_file_max_concurrency [1], and we can keep learning a simple model that sets the best initial_chunk_size and chunk_size parameters adaptively. The goal becomes minimizing the latency of each individual ReadAt call and the user can set a low max concurrency factor for latency-optimized workloads and a high concurrency factor for throughput-optimized. A low concurrency factor would also be useful when the user of the FileSystem interface is managing multiple threads themselves.

[1] set it to some multiple of CPU cores by default

Tom-Newton · 2024-02-13T17:01:43Z

These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that.

felipecrv · 2024-02-13T17:35:43Z

These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that.

Start exposing read_file_max_concurrency with a generous value and hardcode values for initial_chunk_size and chunk_size inside azurefs.cc to your liking, then open a separate issue that I can work on.

My goal is to avoid prematurely exposing hard-to-tweak settings that (1) are difficult to tweak in a well-informed way, and (2) preventing future optimizations on our side based on real-world workloads.

The default settings of the SDK seem to be very conservative regarding parallelization because most SDK users arelikely making full-blob downloads inside a system that already manages multiple I/O threads -- that explains their huge threshold for parallelization (>256MB). We know Arrow workloads are probably small to medium ReadAt calls to the same file on a single thread (a Jupyter notebook driving the calls) so we benefit from the parallelism offered by the SDK: lower initial_chunk_size and chunk_size with a high concurrency.

Tom-Newton · 2024-03-20T07:51:01Z

I unassigned myself because I don't really know when I can work on this, but I might pick it up again at a later date.

Tom-Newton · 2024-05-19T15:42:14Z

I've been thinking a bit about a policy we can use to set these parameters automatically (as @felipecrv suggested). So far my only idea is to make each call to ReadAt one iteration of an optimisation algorithm e.g. gradient descent optimising for ReadAt duration.

If we imagine varying chunk_size from very small to very large the optimal performance is going to be somewhere in the middle and I expect the optimum will depend on latency and bandwidth between blob storage and the client.

I've also taken a bit of a look at how azcopy implements this since my experience is that azcopy is very fast in a wide variety of situations without needing to provide any configuration.
I thought it was interesting that it deliberately tries to read out of order https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/cmd/copyEnumeratorHelper.go#L21
I think the downloading actually happens here https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/ste/xfer-remoteToLocal-file.go#L307-L332
There is mention of applying some auto-correct policy on the block-size if the user did not set it.

I'm definitely going to dig a bit deeper into what azcopy does before I commit to a particular strategy.

Tom-Newton added the Type: enhancement label Feb 11, 2024

github-actions bot added the Component: C++ label Feb 11, 2024

kou mentioned this issue Feb 12, 2024

[C++] Filesystem implementation for Azure Blob Storage #18014

Closed

github-actions bot assigned Tom-Newton Feb 12, 2024

Tom-Newton removed their assignment Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK #40035

[C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK #40035

Tom-Newton commented Feb 11, 2024

Tom-Newton commented Feb 11, 2024

Tom-Newton commented Feb 12, 2024 •

edited

Tom-Newton commented Feb 12, 2024

Tom-Newton commented Feb 12, 2024

felipecrv commented Feb 13, 2024

Tom-Newton commented Feb 13, 2024

felipecrv commented Feb 13, 2024

Tom-Newton commented Mar 20, 2024

Tom-Newton commented May 19, 2024 •

edited

[C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK #40035

[C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK #40035

Comments

Tom-Newton commented Feb 11, 2024

Describe the enhancement requested

Component(s)

Tom-Newton commented Feb 11, 2024

Tom-Newton commented Feb 12, 2024 • edited

Tom-Newton commented Feb 12, 2024

Tom-Newton commented Feb 12, 2024

felipecrv commented Feb 13, 2024

Tom-Newton commented Feb 13, 2024

felipecrv commented Feb 13, 2024

Tom-Newton commented Mar 20, 2024

Tom-Newton commented May 19, 2024 • edited

Tom-Newton commented Feb 12, 2024 •

edited

Tom-Newton commented May 19, 2024 •

edited