Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky compliance checks #291

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tatsuhiro-t
Copy link
Contributor

Previously, the compliance check seeks for the strings "exited with
code 127" or "exit status 127" in stdout + stderr. It turns out that
these strings might not be present in rare cases. In order to
workaround this, this commit directly checks the exit code of the
relevant container with docker-compose --exit-code-from flag. The
flag implies --abort-on-container-exit.

When running the actual test case, the interop runner checks whether
the test case is supported by an implementation. The same method
cannot be applied there because we only get an exit code from a single
service. However, the downside of not detecting unsupported test case
is not severe, it just results in failed test. In contrast, the
failed compliance check skips all test cases for the particular client
and server combination.

Previously, the compliance check seeks for the strings "exited with
code 127" or "exit status 127" in stdout + stderr.  It turns out that
these strings might not be present in rare cases.  In order to
workaround this, this commit directly checks the exit code of the
relevant container with docker-compose --exit-code-from flag.  The
flag implies --abort-on-container-exit.

When running the actual test case, the interop runner checks whether
the test case is supported by an implementation.  The same method
cannot be applied there because we only get an exit code from a single
service.  However, the downside of not detecting unsupported test case
is not severe, it just results in failed test.  In contrast, the
failed compliance check skips all test cases for the particular client
and server combination.
Copy link
Collaborator

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the string not be present in rare cases?

@@ -130,12 +130,12 @@ def _check_impl_is_compliant(self, name: str) -> bool:
"DOWNLOADS=" + downloads_dir.name + " "
'SCENARIO="simple-p2p --delay=15ms --bandwidth=10Mbps --queue=25" '
"CLIENT=" + self._implementations[name]["image"] + " "
"docker-compose up --timeout 0 --abort-on-container-exit -V sim client"
"docker-compose up --timeout 0 --exit-code-from client -V sim client"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’re starting multiple containers, and we can’t know which one exits first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does sim container exit before client container? I think a compliant client exits first with exit code 127, then docker-compose kills sim.

@tatsuhiro-t
Copy link
Contributor Author

Why would the string not be present in rare cases?

I do not know. docker-compose might do some funny stuff, and/or due to race condition.

@tatsuhiro-t
Copy link
Contributor Author

The manifestations of this issue in the recent run are:

https://github.com/marten-seemann/quic-interop-runner/actions/runs/3169239332/jobs/5161033391

Saving logs to logs.
ngtcp2 server not compliant.
Not compliant, skipping
Run took 0:00:06.716867
+------+--------+
|      | ngtcp2 |
+------+--------+
| neqo |        |
|      |        |
|      |        |
+------+--------+

https://github.com/marten-seemann/quic-interop-runner/actions/runs/3169239332/jobs/5161033427

Saving logs to logs.
ngtcp2 server not compliant.
Not compliant, skipping
Run took 0:00:05.832533
+--------+--------+
|        | ngtcp2 |
+--------+--------+
| msquic |        |
|        |        |
|        |        |
+--------+--------+

But ngtcp2 server is fully compliant in the other combinations.
This happens in the other implementation and not specific to ngtcp2 server.

@larseggert
Copy link
Contributor

I see the same issue too, when I run locally. Spurious "non-compliant" errors that usually go away next run.

@marten-seemann
Copy link
Collaborator

@larseggert Does this PR fix the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants