New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd robustness test on Prow is flaky #17717
Comments
Possible quick solution: Use a memory backed emptyDir for this tests's scratch space (configure this in the prowjob pod, make sure the test setup understands to use the mounted path for scratch space) |
Thanks Ben for the suggestion. But for this test, I think we actually want to test with disk io. |
At least on the GKE clusters, we're running on local SSD IIRC, but we can't make much more guarantee than that about I/O. If this job is reduced to fit on n1-highmem-8 (and leave room for system-reserved) then it could run on the k8s infra prow builds GKE cluster and see if that's better. I'm not sure what the disk config is on EKS, upodroid or @xmudrii would know better. |
@BenTheElder @siyuanfoundation The disk config on EKS is also local NVME SSDs, so disk performance shouldn't be a problem. However, EKS nodes are large which contributes to larger bin-packing. If your test lands on a node that already has other tests running, it might affect I/O performance. I see two options:
|
thanks @xmudrii. Do you know what machine type the eks clusters use? |
@siyuanfoundation r5ad.4xlarge (16 vCPUs, 128 GB RAM, 2 x 300 NVMe SSD). |
Which Github Action / Prow Jobs are flaking?
https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-robustness-amd64
Which tests are flaking?
The flakiness is not specific to any particular test.
For each prow job, there is always 1 test run failing.
I don't think it is specific to Lazyfs.
The majority of failures report error like
Requiring minimal 200.000000 qps for test results to be reliable, got 72.143568 qps
.It is always the Run0 that fails. Not sure if prow just orders the results so that failed test ranks to the top or if there any warmup issue.
Github Action / Prow Job link
https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-robustness-amd64
Reason for failure (if possible)
No response
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: