Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd robustness test on Prow is flaky #17717

Open
siyuanfoundation opened this issue Apr 4, 2024 · 7 comments
Open

etcd robustness test on Prow is flaky #17717

siyuanfoundation opened this issue Apr 4, 2024 · 7 comments
Labels
area/robustness-testing area/testing priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/flake

Comments

@siyuanfoundation
Copy link
Contributor

Which Github Action / Prow Jobs are flaking?

https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-robustness-amd64

Which tests are flaking?

The flakiness is not specific to any particular test.
For each prow job, there is always 1 test run failing.
I don't think it is specific to Lazyfs.

The majority of failures report error like Requiring minimal 200.000000 qps for test results to be reliable, got 72.143568 qps.

It is always the Run0 that fails. Not sure if prow just orders the results so that failed test ranks to the top or if there any warmup issue.

Github Action / Prow Job link

https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-robustness-amd64

Reason for failure (if possible)

No response

Anything else we need to know?

No response

@siyuanfoundation
Copy link
Contributor Author

siyuanfoundation commented Apr 4, 2024

cc @jmhbnz @BenTheElder @ArkaSaha30 @upodroid

@BenTheElder
Copy link

Possible quick solution: Use a memory backed emptyDir for this tests's scratch space (configure this in the prowjob pod, make sure the test setup understands to use the mounted path for scratch space)

@siyuanfoundation
Copy link
Contributor Author

Possible quick solution: Use a memory backed emptyDir for this tests's scratch space (configure this in the prowjob pod, make sure the test setup understands to use the mounted path for scratch space)

Thanks Ben for the suggestion. But for this test, I think we actually want to test with disk io.

@BenTheElder
Copy link

At least on the GKE clusters, we're running on local SSD IIRC, but we can't make much more guarantee than that about I/O.

If this job is reduced to fit on n1-highmem-8 (and leave room for system-reserved) then it could run on the k8s infra prow builds GKE cluster and see if that's better.

I'm not sure what the disk config is on EKS, upodroid or @xmudrii would know better.

@xmudrii
Copy link
Contributor

xmudrii commented Apr 5, 2024

@BenTheElder @siyuanfoundation The disk config on EKS is also local NVME SSDs, so disk performance shouldn't be a problem. However, EKS nodes are large which contributes to larger bin-packing. If your test lands on a node that already has other tests running, it might affect I/O performance.

I see two options:

  • Try to reduce the job to fit on n1-highmem-8 on the GKE cluster
  • Increase resources for the job so that we have only that one job running on the node in the EKS cluster

@siyuanfoundation
Copy link
Contributor Author

@BenTheElder @siyuanfoundation The disk config on EKS is also local NVME SSDs, so disk performance shouldn't be a problem. However, EKS nodes are large which contributes to larger bin-packing. If your test lands on a node that already has other tests running, it might affect I/O performance.

I see two options:

  • Try to reduce the job to fit on n1-highmem-8 on the GKE cluster
  • Increase resources for the job so that we have only that one job running on the node in the EKS cluster

thanks @xmudrii. Do you know what machine type the eks clusters use?

@xmudrii
Copy link
Contributor

xmudrii commented Apr 8, 2024

@siyuanfoundation r5ad.4xlarge (16 vCPUs, 128 GB RAM, 2 x 300 NVMe SSD).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness-testing area/testing priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/flake
Development

No branches or pull requests

4 participants