etcd robustness test on Prow is flaky #17717

siyuanfoundation · 2024-04-04T20:28:59Z

Which Github Action / Prow Jobs are flaking?

https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-robustness-amd64

Which tests are flaking?

The flakiness is not specific to any particular test.
For each prow job, there is always 1 test run failing.
I don't think it is specific to Lazyfs.

The majority of failures report error like Requiring minimal 200.000000 qps for test results to be reliable, got 72.143568 qps.

It is always the Run0 that fails. Not sure if prow just orders the results so that failed test ranks to the top or if there any warmup issue.

Github Action / Prow Job link

https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-robustness-amd64

Reason for failure (if possible)

No response

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

siyuanfoundation · 2024-04-04T20:30:56Z

cc @jmhbnz @BenTheElder @ArkaSaha30 @upodroid

BenTheElder · 2024-04-04T20:58:56Z

Possible quick solution: Use a memory backed emptyDir for this tests's scratch space (configure this in the prowjob pod, make sure the test setup understands to use the mounted path for scratch space)

siyuanfoundation · 2024-04-04T21:03:59Z

Possible quick solution: Use a memory backed emptyDir for this tests's scratch space (configure this in the prowjob pod, make sure the test setup understands to use the mounted path for scratch space)

Thanks Ben for the suggestion. But for this test, I think we actually want to test with disk io.

BenTheElder · 2024-04-04T21:11:07Z

At least on the GKE clusters, we're running on local SSD IIRC, but we can't make much more guarantee than that about I/O.

If this job is reduced to fit on n1-highmem-8 (and leave room for system-reserved) then it could run on the k8s infra prow builds GKE cluster and see if that's better.

I'm not sure what the disk config is on EKS, upodroid or @xmudrii would know better.

xmudrii · 2024-04-05T10:34:31Z

@BenTheElder @siyuanfoundation The disk config on EKS is also local NVME SSDs, so disk performance shouldn't be a problem. However, EKS nodes are large which contributes to larger bin-packing. If your test lands on a node that already has other tests running, it might affect I/O performance.

I see two options:

Try to reduce the job to fit on n1-highmem-8 on the GKE cluster
Increase resources for the job so that we have only that one job running on the node in the EKS cluster

siyuanfoundation · 2024-04-08T16:43:53Z

@BenTheElder @siyuanfoundation The disk config on EKS is also local NVME SSDs, so disk performance shouldn't be a problem. However, EKS nodes are large which contributes to larger bin-packing. If your test lands on a node that already has other tests running, it might affect I/O performance.

I see two options:

Try to reduce the job to fit on n1-highmem-8 on the GKE cluster

Increase resources for the job so that we have only that one job running on the node in the EKS cluster

thanks @xmudrii. Do you know what machine type the eks clusters use?

xmudrii · 2024-04-08T17:06:25Z

@siyuanfoundation r5ad.4xlarge (16 vCPUs, 128 GB RAM, 2 x 300 NVMe SSD).

siyuanfoundation added area/testing type/flake labels Apr 4, 2024

siyuanfoundation mentioned this issue Apr 5, 2024

robustness: make qps below threshold retriable test. #17725

Closed

jmhbnz added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/robustness-testing labels Apr 5, 2024

This was referenced Apr 6, 2024

etcd: add robustness test for release 3.5 and 3.4. kubernetes/test-infra#32395

Merged

etcd: move ci-etcd-robustness-amd64 from eks to gke cluster. kubernetes/test-infra#32408

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd robustness test on Prow is flaky #17717

etcd robustness test on Prow is flaky #17717

siyuanfoundation commented Apr 4, 2024

siyuanfoundation commented Apr 4, 2024 •

edited

BenTheElder commented Apr 4, 2024

siyuanfoundation commented Apr 4, 2024

BenTheElder commented Apr 4, 2024

xmudrii commented Apr 5, 2024

siyuanfoundation commented Apr 8, 2024

xmudrii commented Apr 8, 2024

etcd robustness test on Prow is flaky #17717

etcd robustness test on Prow is flaky #17717

Comments

siyuanfoundation commented Apr 4, 2024

Which Github Action / Prow Jobs are flaking?

Which tests are flaking?

Github Action / Prow Job link

Reason for failure (if possible)

Anything else we need to know?

siyuanfoundation commented Apr 4, 2024 • edited

BenTheElder commented Apr 4, 2024

siyuanfoundation commented Apr 4, 2024

BenTheElder commented Apr 4, 2024

xmudrii commented Apr 5, 2024

siyuanfoundation commented Apr 8, 2024

xmudrii commented Apr 8, 2024

siyuanfoundation commented Apr 4, 2024 •

edited