Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression tests getting stuck with jetty upgrade of v9.4.33 #5922

Closed
4devwithgit opened this issue Jan 27, 2021 · 28 comments
Closed

regression tests getting stuck with jetty upgrade of v9.4.33 #5922

4devwithgit opened this issue Jan 27, 2021 · 28 comments
Labels
More Info Required Question Stale For auto-closed stale issues and pull requests

Comments

@4devwithgit
Copy link

4devwithgit commented Jan 27, 2021

Jetty version -v9.4.33

Java version - IBM JDK 8 SR6 FP20

**After we upgraded the jetty in our product from v9.4.26 to 9.4.33/9.4.35, the regression tests cases are consistently getting stuck after running for couple of hours at different tests cases. But if we downgrade the jetty version back to 9.4.26, the tests continues as usual.
We do have some stack trace where jetty classes are getting into some exceptions but yet to locate a clear test case which we can consistently reproduce.
We have also reviewed thread dumps etc when the tests cases were hung, but they dont indicate and issues. Neither the VM nor the DB connectivity were found to be problematic, so are at loss ideas to understand the issue

So, any ideas to come out of this issue or resolve it is highly appreciated. We think this behavior to be jetty upgrade related, but we have not got clear log errors to log a bug, so that's where we need help.**

@gregw
Copy link
Contributor

gregw commented Jan 27, 2021

Not much there for us to go on! Perhaps share your thread dumps and stack traces?
Also if you can get a jetty server dump as well?

@gregw
Copy link
Contributor

gregw commented Jan 27, 2021

Also, tell us about your app. Is it using async servlets? async IO? websocket? JDBC?????

@4devwithgit
Copy link
Author

Here is the thread dump, though we dont see much with jetty threads
swathi_regression_jvm_dumps.zip

@4devwithgit
Copy link
Author

We are using jetty with the http server for the product Sterling B2B integrator. We dont use asyncServlets, but it does have JDBC, websocket.
I believe you mentioned about jetty server dump. But, since, we dont really see error with jetty threads, will it be really useful here to get jetty server dump?

@4devwithgit
Copy link
Author

Any update or findings on this issue?

@4devwithgit
Copy link
Author

We have 30k test cases, so will the jetty server dump going to help? Its getting stuck after 13.5k tests, just wondering if it fills up the machine, without generating the key data point that we need?

@janbartel
Copy link
Contributor

The thread dumps don't even mention jetty, and I don't even see any jetty classes listed in the java classpath: looks like all that is listed is just cruisecontrol, not what cruisecontrol is running. I would try running these tests outside of cruisecontrol, maybe that will give you some better thread dumps.

When you say the tests "hang" what does that mean? Is there a garbage collect in progress? Are there enough server resources - file descriptors, memory etc? Is cruisecontrol itself experiencing some problem?

BTW the suggestion of doing a server dump was so that we could see what your jetty configuration and deployment look like.

@4devwithgit
Copy link
Author

Do you have any suggestion, if our tests are hung, how can we troubleshoot it wrt jetty?

@4devwithgit
Copy link
Author

The VM where the CI tests are running is at 75 % CPU
free -g
total used free shared buff/cache available
Mem: 45 8 1 0 36 36
Swap: 12 3 8

@4devwithgit
Copy link
Author

we are using junit-4.4

@4devwithgit
Copy link
Author

@janbartel do u have any response based on my previous messages?
This is a critical issue for us, and all the customers of Sterling B2B Integrator, and we cant really upgrade from 9.4.26 to higher secure version (9.4.33 psirt fix)given our regressions tests are not completing.

@jmcc0nn3ll
Copy link
Contributor

If time is an issue you may want to consider support through webtide.com since open source support is on an as available basis, especially if you are hesitant to share information. This sort of triage is a normal aspect of that support and is typically isolated or specific enough to a situation like yours as to be ill-suited for support in this project forum. It would be different if you could point to a specific commit or issue that is causing your problem but asking for triage is a nebulous ask.

@gregw
Copy link
Contributor

gregw commented Feb 3, 2021

@4devwithgit sorry but we just don't have enough information. We don't even known what "getting stuck" means in your context? Is it jetty not responding? Or just a test that doesn't complete?

Of your 30k tests, you say it is getting "stuck" after 13.5k of them. Can you identify that individual test that it get's stuck on? Can you run just that test by itself? Does it pass? If you remove that test then do the 29.999k tests remaining pass? or do you just get stuck at the 13.501k test?

Ultimately we need to see something that is actually stuck, with a description of what it is stuck waiting for. Ideally then with a thread dump and a server dump to match. If you can provide us some of this information here, then we can assist in the open source project. But if you can't provide us with any more information publicly and this is time critical, then please do consider commercial support.

@joakime @lachlan-roberts Can you think of any websocket changes since 9.4.26 that could cause an app to become stuck?

@joakime
Copy link
Contributor

joakime commented Feb 3, 2021

Changes in websocket since 9.4.26

And lots of new tests, javadoc updates, and documentation updates.

@joakime
Copy link
Contributor

joakime commented Feb 3, 2021

@gregw if their code is using InputStream or Reader as a message delivery option with the javax.websocket API and the OP has implemented some kind of workaround for message deliver (or message order) because of how the threading works with the streaming delivery options in 9.4.26 then those workarounds will likely be the cause of the issues they are experiencing now.

Keep in mind that InputStream and Reader options in the API are not designed for delivery of lots of messages on the websocket connection, they are designed for those users that need a single, long term, stream of data over the connection. Think video transfer, audio transfer, games, etc. Those that use it to deliver many messages are often surprised by the need dispatch each and every message to a new thread (per the API spec). This has, historically, resulted in users of the javax.websocket API not understanding that because of the dispatch nature of using of the Stream API the messages can appear to arrive out of order to the application, but actually arrived in order on the connection. 9.4.26 had this behavior (and many projects aware of this behavior worked around this in their own code). 9.4.36 does not have this behavior anymore. We changed it to not read/parse the next message until the active onMessage(InputStream) call (or equiv) has exited. This change was done for two reasons, to make things easier for the users of the API, and also to alleviate the thread usage spikes that occur when applications receive lots of small ("small" in this context is under 40MB) messages on the connection.

Finally, for this specific issue, we have no details on what the "stuck" is or means or how it manifests.

@4devwithgit
Copy link
Author

4devwithgit commented Feb 4, 2021

Thanks for the above explanation. @joakime @gregw @jmcc0nn3ll

Just to clarify further,
9.4.26 - There is no issue seen, and this is what the product is using right now.
9.4.33 - This had the psirt fix, and so we wanted to upgrade to minimum this version. But, we see the issue here.
9.4.35 - We see the same issue here as well.
9.4.36 - We have not tested it yet.
Latest tests - 9.4.27 - its completing all the tests.

So, does the above explanations holds good with the behavior seen above versions used in our product?
Is there a work around possible for the hung state like some flag or code change, which we can try out in our product to overcome this issue?

We are not really hesitant to give more information on the issue, but, we really don't have any concrete information.

  1. logs - we cant enable jetty logs, as we have 30k tests, so machine will go out of space, before fetching us relevant data point.
  2. thread dumps - already shared. But as you too shared, we really dont have anything pointing out to jetty threads.
  3. junit test case - the issue is not specific to one test case. If we remove that specific tests, it gets stuck with some other tests after crossing 13k or 14k mark.
  4. Please also note, the hung behavior is seen when it moves from one test suite to another test suite. It not seen WHILE running a certain test case, but when it transitions from one test suite to another test suite, and probably while loading new test suite it gets hung. So, when its hung, the current tests will have 0 tests cases run, while the previous will have all the tests cases executed.
  5. I will try to share the jetty server dumps as soon as I can.
  6. To narrow down the issue with jetty versions, we are trying out with different versions between 9.4.26 and 9.4.36, will update here as soon as I have the results.

@gregw
Copy link
Contributor

gregw commented Feb 4, 2021

So it is hanging between tests. Potentially jetty is leaking something or filling something up? But still hard to say without your test frame work.

Is this possible: use you test framework to start a jetty server how you currently start it and deploy the simplest webapp possible. Then have a really simple test that you some how duplicate 20k times. This might tickle the same problem and hang the test framework after 13k to 14k tests, in which case you can give us the whole code as it will not have your application in it.

Even if that is impossible, if you can give us something that shows how you: start jetty; deploy webapps; send test request; stop the server after the test. We can then try the same.

Do you use the websocket client at all?

@gregw
Copy link
Contributor

gregw commented Feb 4, 2021

The other thing to do is to get get your application and stop/stop it 15k times in a similar environment to your test setup and see what happens.

@4devwithgit
Copy link
Author

Thanks for the suggestions @gregw . I will see if I can get to generate data using the approaches you suggested.

Meanwhile, I ran tests using 9.4.27 and 9.4.29, and both passed i.e. no hung state for tests.

We are testing it now with the versions,
9.4.30
9.4.36

Does 9.4.36 have any known issues that we need to be aware?

@4devwithgit
Copy link
Author

I see regressions tests are getting completed, with 9.4.30 and 9.4.36. So, most likely the issue is introduced in jetty v9.4.31/32.

We will evaluate, if we can upgrade to 9.4.36, since its quite new, we need to review.

Thanks
Dev

@joakime
Copy link
Contributor

joakime commented Feb 24, 2021

9.4.37.v20210219 has been released.

9.4.38 is in progress as well.

The OP still has not provided any actionable information about his reported regression.
No other user of these features (and we have some exceedingly aggressive users of the websocket features) have reported a regression.

@4devwithgit
Copy link
Author

with 9.4.36, our 6.0.3.4 release is working fine.
However, our next release 6.1.0.2 is giving the same behavior of tests getting stuck. We are planning to run some Performance tests, and if we see anything wrt jetty, I will keep it posted here.

Is it possible to port the fix from 9.4.33 version to 9.4.26?
CVEs: (details as of the time of ADV creation)
CVEID: CVE-2020-27216
Description: Eclipse Jetty could allow a local authenticated attacker to gain elevated privileges on the system, caused by a race condition in the creation of the temporary subdirectory. By sending a specially-crafted request, an authenticated attacker could exploit this vulnerability to gain elevated privileges.
CVSS Base Score: 7.8
CVSS Temporal Score: https://exchange.xforce.ibmcloud.com/vulnerabilities/190474 for more information
CVSS Vector: (CVSS:3.0/AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H)

Also, is it possible to reveal the details of the fix?

Thanks

@gregw
Copy link
Contributor

gregw commented Feb 25, 2021

@4devwithgit backporting fixes to specific versions is a service that we provide for our commercial support clients. We can't do that on an open source basis, else we'd end up with infinite versions to support.

The details of the fix are in #5452, so you can build your own version.

Alternately, use one of the work arounds and wait until a recent release is mature enough for you.

@4devwithgit
Copy link
Author

"Alternately, use one of the work arounds and wait until a recent release is mature enough for you." what is the work around you are referring to?

@joakime
Copy link
Contributor

joakime commented Feb 25, 2021

@joakime
Copy link
Contributor

joakime commented Feb 25, 2021

Note, there are 2 followup PRs that addresses issues within Multipart and PutFilter that are also impacted by the CVE you listed.
See PRs #5453 and #5458 as well.

@github-actions
Copy link

This issue has been automatically marked as stale because it has been a
full year without activity. It will be closed if no further activity occurs.
Thank you for your contributions.

@github-actions github-actions bot added the Stale For auto-closed stale issues and pull requests label Feb 26, 2022
@github-actions
Copy link

This issue has been closed due to it having no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
More Info Required Question Stale For auto-closed stale issues and pull requests
Projects
None yet
Development

No branches or pull requests

5 participants