Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 timing out on CI since 3.0.3 (tests using LocalStack) #1118

Open
daniel-frak opened this issue Mar 26, 2024 · 4 comments
Open

S3 timing out on CI since 3.0.3 (tests using LocalStack) #1118

daniel-frak opened this issue Mar 26, 2024 · 4 comments
Labels
component: s3 S3 integration related issue status: waiting-for-feedback Waiting for feedback from issuer

Comments

@daniel-frak
Copy link

daniel-frak commented Mar 26, 2024

Type: Bug

Component:
S3

Describe the bug
Running any version of spring-cloud-aws-dependencies higher than 3.0.2 (tested with 3.0.3 - 3.1.1) causes our Gitlab CI runner (an AWS t3.xlarge instance) to time out on S3 operations:

software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Read timed out
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
	at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:223)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:83)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
	at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
	at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
	at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
	at software.amazon.awssdk.services.s3.DefaultS3Client.createBucket(DefaultS3Client.java:1156)
	at software.amazon.awssdk.services.s3.S3Client.createBucket(S3Client.java:1708)
	at [...]
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: Read timed out
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Unable to execute HTTP request: Read timed out
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Unable to execute HTTP request: Read timed out
	Suppressed: java.lang.NullPointerException: Cannot invoke "java.util.Set.forEach(java.util.function.Consumer)" because the return value of "org.junit.jupiter.api.extension.ExtensionContext$Store.remove(Object, java.lang.Class)" is null
		at org.mockito.junit.jupiter.MockitoExtension.afterEach(MockitoExtension.java:190)

While the tests usually take 6-10 min, here they take up to 43 minutes before failing the build.

We are running the tests using Maven and creating a LocalStack instance using TestContainers (1.19.7). The relevant code is:

s3Client = S3Client.builder()
  .endpointOverride(LOCAL_STACK_CONTAINER.getEndpointOverride(LocalStackContainer.Service.S3))
  .credentialsProvider(StaticCredentialsProvider.create(AwsBasicCredentials.create(
    LOCAL_STACK_CONTAINER.getAccessKey(), LOCAL_STACK_CONTAINER.getSecretKey())))
  .region(Region.of(LOCAL_STACK_CONTAINER.getRegion()))
  .build();

// ...

// It times out here (@BeforeEach):
s3Client.createBucket(request -> request.bucket(bucket));

// ...And here (@AfterEach):
s3Client.deleteBucket(request -> request.bucket(bucket));

Importantly, everything works perfectly well on a local machine (the entire build takes about 3.5min). The issues only occur on CI. Furthermore, the issue does not occur on version 3.0.2. On 3.0.2, all S3 operations take milliseconds, as expected (even on CI).

I thought that maybe it's the fault of a new version of AWS SDK, so I tried downgrading AWS SDK to the version from 3.0.2:

<dependencyManagement>
	<dependencies>
		<dependency>
			<groupId>software.amazon.awssdk</groupId>
			<artifactId>bom</artifactId>
			<version>2.20.63</version>
			<type>pom</type>
			<scope>import</scope>
		</dependency>
		<dependency>
			<groupId>software.amazon.awssdk</groupId>
			<artifactId>s3-transfer-manager</artifactId>
			<version>2.20.63</version>
			<optional>true</optional>
		</dependency>
	</dependencies>
</dependencyManagement>

I found a semi-similar problem on LocalStack's Github, so I tried running LocalStack with the suggested environment variables:

DockerImageName imageName = DockerImageName.parse("localstack/localstack:latest")
        .asCompatibleSubstituteFor("localstack/localstack");
return (LocalStackContainerWrapper) new LocalStackContainer(imageName)
        // https://github.com/localstack/localstack/issues/10340#issuecomment-2008323237
        .withEnv("GATEWAY_SERVER", "twisted")
        // https://github.com/localstack/localstack/issues/10340#issuecomment-1979282202
        .withEnv("PROVIDER_OVERRIDE_S3", "legacy_v2");

None of the above fixed the problem. I tried looking through changes between 3.0.2 and 3.0.3 but I'm unable to find anything that would stand out as a likely culprit. (EDIT: It turns out AWS SDK is to blame after all. See comment below)

The bucket the tests are struggling to create and then remove is called test-bucket, so I don't think any name-related problems are to blame.

At the same time, I really need to update this dependency (I need the FIFO queue name bugfix introduced in 3.1.1) and I'm at a loss what to do.

The runner instance shows only mild memory and CPU consumption, so it doesn't seem like lack of resources is the problem here. Below is a simplified version of our .gitlab-ci.yml:

variables:
  MAVEN_CLI_OPTS: >-
    -gs ./ci_settings.xml 
    -Dmaven.repo.local=${CI_PROJECT_DIR}/.m2/repository 
    -Djavax.net.ssl.trustStore=./cacerts
    --batch-mode
    --errors

image: [...]/maven:3.9.4-amazoncorretto-21

code_quality_analysis:
  before_script:
    - >
      export DOCKER_AUTH_CONFIG=[...]
script:
    - > # Run tests
      mvn clean verify $MAVEN_CLI_OPTS
      -T 2C
# ...

I don't think our ci_settings.xml proxies are to blame, but just in case here's a simplified version of those:

<settings xmlns="http://maven.apache.org/SETTINGS/1.2.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd">
  
  <proxies>
    <proxy>
      <id>[...]-http</id>
      <active>true</active>
      <protocol>http</protocol>
      <port>8080</port>
      <host>[...]</host>
      <nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
    </proxy>
    <proxy>
      <id>[...]-https</id>
      <active>true</active>
      <protocol>https</protocol>
      <port>8080</port>
      <host>[...]</host>
      <nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
    </proxy>
  </proxies>
</settings>

In summary, some change (EDIT: the AWS SDK update) in 3.0.3 is causing our CI to time out during S3 operations and we can't figure out what it is.

Sample
Unfortunately, so far I was unable to reproduce this problem outside of Gitlab CI.

@daniel-frak
Copy link
Author

daniel-frak commented Mar 26, 2024

Apologies, it seems I previously downgraded the dependencies incorrectly. The following seems to fix the problem, after all, at least for 3.0.3:

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>bom</artifactId>
            <version>2.20.63</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

Unfortunately, it doesn't work with 3.1.1:

Caused by: java.lang.NoSuchMethodError: 'software.amazon.awssdk.services.s3.S3BaseClientBuilder software.amazon.awssdk.services.s3.S3BaseClientBuilder.crossRegionAccessEnabled(java.lang.Boolean)'
	at io.awspring.cloud.autoconfigure.s3.S3AutoConfiguration.s3ClientBuilder(S3AutoConfiguration.java:77)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:140)
	... 100 more

EDIT:

I tried running 3.1.1 with AWS SDK 2.25.17 and the original problem persists (S3 operations time out on CI).

@daniel-frak
Copy link
Author

It occurred to me that this is probably not the correct repo to create this issue in. I created another one for AWS SDK:

aws/aws-sdk-java-v2#5045

While I still think that this one might be useful, feel free to close it if you don't think it's relevant for spring-cloud-aws.

@maciejwalkowiak
Copy link
Contributor

@daniel-frak which Localstack version are you using? It does happen that when AWS releases new SDK, Localstack sometimes stops behaving as expected. Then they release a new version and all works well. So if your issue happens still on Localstack 3.2.0, I would recommend opening an issue there. AWS team have nothing to do with LocalStack

@maciejwalkowiak maciejwalkowiak added component: s3 S3 integration related issue status: waiting-for-feedback Waiting for feedback from issuer labels Mar 27, 2024
@daniel-frak
Copy link
Author

daniel-frak commented Mar 28, 2024

I tried with LocalStack 3.2.0, as well as latest. I also use LocalStack locally, and there it works perfectly well (on both versions). Nevertheless, I opened an additional issue there:

localstack/localstack#10565

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: s3 S3 integration related issue status: waiting-for-feedback Waiting for feedback from issuer
Projects
None yet
Development

No branches or pull requests

2 participants