Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM when using S3TransferManager.downloadDirectory() #4987

Closed
zzz8307 opened this issue Mar 4, 2024 · 20 comments
Closed

OOM when using S3TransferManager.downloadDirectory() #4987

zzz8307 opened this issue Mar 4, 2024 · 20 comments
Assignees
Labels
bug This issue is a bug. crt-client p1 This is a high priority issue

Comments

@zzz8307
Copy link

zzz8307 commented Mar 4, 2024

Describe the bug

When downloading a directory from S3 using S3TransferManager.downloadDirectory() that contains hundreds of thousands of files then it fails with OutOfMemoryError.
image

Expected Behavior

S3TransferManager can work fine no matter how many files or how big the file is.

Current Behavior

OOM after downloading some of the files.

Caused by: 
software.amazon.awssdk.core.exception.SdkClientException: Failed to send request
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111) ~[sdk-core-2.24.13.jar!/:na]
	at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47) ~[sdk-core-2.24.13.jar!/:na]
	at software.amazon.awssdk.transfer.s3.internal.DownloadDirectoryHelper.lambda$doDownloadDirectory$2(DownloadDirectoryHelper.java:121) ~[s3-transfer-manager-2.24.13.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
	at software.amazon.awssdk.transfer.s3.internal.AsyncBufferingSubscriber.onError(AsyncBufferingSubscriber.java:79) ~[s3-transfer-manager-2.24.13.jar!/:na]
	at software.amazon.awssdk.utils.async.DelegatingSubscriber.onError(DelegatingSubscriber.java:40) ~[utils-2.24.13.jar!/:na]
	at software.amazon.awssdk.core.internal.pagination.async.ItemsSubscription.lambda$fetchNextPage$0(ItemsSubscription.java:89) ~[sdk-core-2.24.13.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
	at software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:79) ~[utils-2.24.13.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncApiCallMetricCollectionStage.lambda$execute$0(AsyncApiCallMetricCollectionStage.java:56) ~[sdk-core-2.24.13.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncApiCallTimeoutTrackingStage.lambda$execute$2(AsyncApiCallTimeoutTrackingStage.java:67) ~[sdk-core-2.24.13.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
	at software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:79) ~[utils-2.24.13.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeAttemptExecute(AsyncRetryableStage.java:103) ~[sdk-core-2.24.13.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:184) ~[sdk-core-2.24.13.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159) ~[sdk-core-2.24.13.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
	at software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:79) ~[utils-2.24.13.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$execute$0(MakeAsyncHttpRequestStage.java:108) ~[sdk-core-2.24.13.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.completeResponseFuture(MakeAsyncHttpRequestStage.java:255) ~[sdk-core-2.24.13.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$executeHttpRequest$3(MakeAsyncHttpRequestStage.java:167) ~[sdk-core-2.24.13.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
Caused by: java.util.concurrent.CompletionException: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Java heap space
	at software.amazon.awssdk.utils.CompletableFutureUtils.errorAsCompletionException(CompletableFutureUtils.java:65) ~[utils-2.24.13.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncExecutionFailureExceptionReportingStage.lambda$execute$0(AsyncExecutionFailureExceptionReportingStage.java:51) ~[sdk-core-2.24.13.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[na:na]
	... 32 common frames omitted
Caused by: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Java heap space
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111) ~[sdk-core-2.24.13.jar!/:na]
	at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47) ~[sdk-core-2.24.13.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:223) ~[sdk-core-2.24.13.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:218) ~[sdk-core-2.24.13.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:182) ~[sdk-core-2.24.13.jar!/:na]
	... 23 common frames omitted
Caused by: java.lang.OutOfMemoryError: Java heap space

Reproduction Steps

private final S3AsyncClient s3Client = S3AsyncClient.crtBuilder()
        .region(Region.AP_EAST_1)
        .build();

private final S3TransferManager s3TransferManager = S3TransferManager.builder().s3Client(s3Client).build();

public Path downloadDirectory(Path path, String bucket, String key) {
    CompletableFuture<CompletedDirectoryDownload> future = downloadDirectory(bucket, key, path)
            .exceptionally(e -> {
                throw new GeneralException(e);
            });
    CompletedDirectoryDownload completedDirectoryDownload = future.join();

    List<FailedFileDownload> failedTransfers = completedDirectoryDownload.failedTransfers();
    if (!failedTransfers.isEmpty()) {
        ArrayList<CompletableFuture<CompletedFileDownload>> retryFutures = new ArrayList<>(failedTransfers.size());
        failedTransfers.forEach(transfer -> {
            CompletableFuture<CompletedFileDownload> retryFuture = downloadFile(transfer.request());
            retryFutures.add(retryFuture);
        });
        CompletableFuture.allOf(retryFutures.toArray(new CompletableFuture[0])).join();
    }
    return path;
}

public CompletableFuture<CompletedDirectoryDownload> downloadDirectory(String bucket, String key, Path destination) {
    DownloadDirectoryRequest request = DownloadDirectoryRequest.builder()
            .destination(destination)
            .bucket(bucket)
            .listObjectsV2RequestTransformer(l -> l.prefix(key))
            .build();
    return s3TransferManager.downloadDirectory(request).completionFuture();
}

public CompletableFuture<CompletedFileDownload> downloadFile(DownloadFileRequest request) {
    return s3TransferManager.downloadFile(request).completionFuture();
}

Possible Solution

No response

Additional Information/Context

No response

AWS Java SDK version used

2.24.13

JDK version used

java version "11.0.22" 2024-01-16 LTS Java(TM) SE Runtime Environment 18.9 (build 11.0.22+9-LTS-219) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.22+9-LTS-219, mixed mode)

Operating System and version

Windows 10

@zzz8307 zzz8307 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 4, 2024
@debora-ito
Copy link
Member

Hi @zzz8307 I'll try to repro the issue.

In the meantime, can you try limiting the memory used by specifying lower values for targetThroughputInGbps (default 10 Gbps) and maxNativeMemoryLimitInBytes?

Check the S3CrtAsyncClientBuilder javadoc for more info.

@debora-ito debora-ito removed the needs-triage This issue or PR still needs to be triaged. label Mar 6, 2024
@debora-ito debora-ito self-assigned this Mar 6, 2024
@debora-ito debora-ito added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. p2 This is a standard priority issue labels Mar 6, 2024
@zzz8307
Copy link
Author

zzz8307 commented Mar 7, 2024

Hi @debora-ito, I added the limitation to S3AsyncClient as below but still the heap size kept growing.

private final S3AsyncClient s3Client = S3AsyncClient.crtBuilder()
        .region(Region.AP_EAST_1)
        .targetThroughputInGbps(1.0)
        .maxNativeMemoryLimitInBytes(1L * 1024 * 1024 * 1024)
        .build();

You can repro the issue by duplicating 100k small files in the same directory on S3 then download the folder. Looks like S3AsyncClient holds everything it downloads and it's causing GC fails to free the memory.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 7, 2024
@jensvogt
Copy link

We are facing the same problem. We need to split a 10GB XML into small pieces (~2Mio small files). The files are stored temporarily on the local disk. Afterwards we tried to use transferManager.uploadDirectory to upload the whole directory to a S3 bucket. This produces a OOM after a while (even with 10GB java heap size)

To my opinion the problem is in UploadDirectoryHelper.java:

private void doUploadDirectory(CompletableFuture<CompletedDirectoryUpload> returnFuture,
                                   UploadDirectoryRequest uploadDirectoryRequest) {

        Path directory = uploadDirectoryRequest.source();

        validateDirectory(uploadDirectoryRequest);

        Collection<FailedFileUpload> failedFileUploads = new ConcurrentLinkedQueue<>();
        List<CompletableFuture<CompletedFileUpload>> futures;

        try (Stream<Path> entries = listFiles(directory, uploadDirectoryRequest)) {
            futures = entries.map(path -> {
                CompletableFuture<CompletedFileUpload> future = uploadSingleFile(uploadDirectoryRequest,
                                                                                 failedFileUploads, path);

                // Forward cancellation of the return future to all individual futures.
                CompletableFutureUtils.forwardExceptionTo(returnFuture, future);
                return future;
            }).collect(Collectors.toList());
        }

        CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]))
                         .whenComplete((r, t) -> returnFuture.complete(CompletedDirectoryUpload.builder()
                                                                                               .failedTransfers(failedFileUploads)
                                                                                               .build()));
    }

This will create a huge list of CompleteableFutures (2Mio entries), which will result in a OOM after a while. As you do not care about the result of the individual file uploads (only failed are reported), you can skip the List all together.

This seems to work:

private void doUploadDirectory(CompletableFuture<CompletedDirectoryUpload> returnFuture,
                                   UploadDirectoryRequest uploadDirectoryRequest) {

        Path directory = uploadDirectoryRequest.source();

        validateDirectory(uploadDirectoryRequest);

        Collection<FailedFileUpload> failedFileUploads = new ConcurrentLinkedQueue<>();

        try (Stream<Path> entries = listFiles(directory, uploadDirectoryRequest)) {
            entries.forEach(path -> {
                CompletableFuture<CompletedFileUpload> future = uploadSingleFile(uploadDirectoryRequest,
                                                                                 failedFileUploads, path);

                // Forward cancellation of the return future to all individual futures.
                CompletableFutureUtils.forwardExceptionTo(returnFuture, future);
                CompletableFutureUtils.joinInterruptibly(future);
            });
        }

        returnFuture.complete(CompletedDirectoryUpload.builder().failedTransfers(failedFileUploads).build());
    }

@debora-ito
Copy link
Member

@zzz8307 we are investigating the issue.

@jensvogt memory issues with UploadDirectory were reported in a separate issue - #4999 (comment) - and we released a fix. Can you try the latest SDK version?

@debora-ito debora-ito added crt-client p1 This is a high priority issue and removed p2 This is a standard priority issue labels Mar 14, 2024
@jensvogt
Copy link

jensvogt commented Mar 15, 2024

@debora-ito sure, I'll test the newest version.

But actually, this is not a memory leak, its simply a bad design. If I want to upload 2 Mio files, the uploadDirectory method collects 2 Mio CompletableFutures in a Java ArrayList, which results in a huge memory allocation. You need to hold 2 Mio CompletableFutures in memory, I wonder if this is needed. Maybe there is a more clever solution to collect the results of the CompletableFutures inside the directory filename stream.

@jensvogt
Copy link

@debora-ito I created a new issues for the upload problem, as it is slightly different from the issue described here. See #5023

@zzz8307
Copy link
Author

zzz8307 commented Mar 17, 2024

Hi @jensvogt , @debora-ito ,
I think both issues share the same root cause. S3TransferManager holds all CompletableFutures in memory when downloading/uploading a large amount of files. Currently I’m using downloadFile() and maintaining the directory structure myself, then collect the futures every 1000 files as a workaround. This solves the OOM issue.

@jensvogt
Copy link

@zzz8307 Yes, you're right. We did the same as a workaround. Currently, we're using a "paged" solutions, where pages of 10000 files are uploaded using transferManager.uploadDirectory. Nicer would be if the AWS SDK would take care of the paging.

The failed uploads are collected anyhow (the number of failed uploads should be much less than the total). So there is no need for collecting the successes. Number of successes are simply (total - failed).

@zoewangg
Copy link
Contributor

For downloadDirectory, we don't actually store all CompletableFutures in a list, I think it's CompletableFutureUtils.forwardExceptionTo that prevents the futures from getting GC'd

I'm working on the fix.

@zzz8307 just wanted to double check, what is the average size of the objects are you downloading?

@jensvogt we'll make the fix for uploadDirectory as well.

@jensvogt
Copy link

@zoewangg Just for your info: We're getting sometimes a 10-12GB XML with ~2Mio product XMLs. We split the 10GB XML into small product XML files. Each product XML es roughtly 4-8kBytes. We use the transferManager.downloadFile for downloading the 10GB XML and transferManager.uploadDirectory for the upload of the 2 Mio small XML files. Download takes ~2-5min, splitting ~20-40min, and upload ~1-2h. Uploading each single XML piece (using putObject and not using transferManager) is not an option as it takes ~46h. So it's mainly a performance issue.

@zzz8307
Copy link
Author

zzz8307 commented Mar 19, 2024

@zoewangg in our use case we are downloading ~1 million files with the size of 1~100kb each.

@zoewangg
Copy link
Contributor

Hi @zzz8307 we released a fix in 2.25.15, can you try with the latest version?

https://github.com/aws/aws-sdk-java-v2/releases/tag/2.25.15

@zoewangg zoewangg added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 21, 2024
@thai-op
Copy link

thai-op commented Mar 21, 2024

@zoewangg I'm experiencing an issue with uploading a single 12GB file via TransferManger & CRT client and the container ran out of memory. I'm trying your fix right now. Fingers crossed!

@thai-op
Copy link

thai-op commented Mar 21, 2024

It's progressing a bit further, but still not good enough. I'm using a 6gb container & uploading a 12gb single file to s3 and it just OOM :(

@thai-op
Copy link

thai-op commented Mar 21, 2024

Using 2.25.15 btw

@zoewangg
Copy link
Contributor

@thai-op this issue tracks memory issue for download directory method specifically.
The error you are seeing seems to be a different issue. Do you mind creating a new GH issue? Can you share your client configuration in the issue?

@thai-op
Copy link

thai-op commented Mar 21, 2024

@zoewangg #5032 we can chat over there if you have any questions.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 22, 2024
@zoewangg zoewangg added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 22, 2024
@zzz8307
Copy link
Author

zzz8307 commented Mar 26, 2024

Hi @zoewangg , i've tested the latest version and the memory issue has been fixed. big thanks for the efforts!

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 26, 2024
@zoewangg
Copy link
Contributor

Awesome, thanks for verifying. Closing the issue.

Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. crt-client p1 This is a high priority issue
Projects
None yet
Development

No branches or pull requests

5 participants