Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocket + Role-based authentication stopped working with Quarkus 3.9.x: Security Identity is not available #40307

Closed
lyrst opened this issue Apr 26, 2024 · 30 comments · Fixed by #40408

Comments

@lyrst
Copy link

lyrst commented Apr 26, 2024

Describe the bug

I hava a WebSocket-endpoint that is secured with the "RolesAllowed"-annotation.
In order to secure the websocket endpint I followed the approach that is described here #29919

The issue that the Test testFoo1Endpoint does not even run with a NullPointerException.

As soon as I switch to Quarkus 3.8.4 the tests fails with an assertion-related error.

Expected behavior

The testFoo1Endpoint-tests is able to run without the NullPointerException.

Actual behavior

The testFoo1Endpoint-tests is fails with a NullPointerException.

How to Reproduce?

Run the following example code.

quarkus-sec-issue.zip

Output of uname -a or ver

Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:25 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6030 arm64

Output of java -version

No response

Quarkus version or git rev

3.9.4

Build tool (ie. output of mvnw --version or gradlew --version)

mvn

Additional information

No response

@lyrst lyrst added the kind/bug Something isn't working label Apr 26, 2024
@quarkus-bot
Copy link

quarkus-bot bot commented Apr 26, 2024

/cc @sberyozkin (security)

@sberyozkin
Copy link
Member

@lyrst Thanks for the reproducer
@mkouba @michalvavrik Are you aware of what might have changed that causes Websockets authenticator and configurator in the attached reproducer failing to provide the identity info with the token ? WebSockets next is an independent project AFAIK so should not have impacted the code which was running in 3.8.4

Thanks

@sberyozkin
Copy link
Member

I don't recall us providing any doc-ed recommendations along those lines, so may be we can recommend to switch to WebSockets Next to achieve the expected RBAC ?

@mkouba
Copy link
Contributor

mkouba commented Apr 26, 2024

Are you aware of what might have changed that causes Websockets authenticator and configurator in the attached reproducer failing to provide the identity info with the token ?

I have no idea.

WebSockets next is an independent project AFAIK so should not have impacted the code which was running in 3.8.4

Yes, I've created #40312 to track our security-related efforts for WebSockets Next.

@lyrst
Copy link
Author

lyrst commented Apr 26, 2024

I tried to use the new WebSocketNext extension as I thought that the RolesAllowed-annotation would work out of the box as it is based on the HttpServer (quote from documentation "The WebSocket handling reuses the main HTTP server.") but this does not seem to work ;-) But I guess I was lacking more information and as @mkouba created a new issue, it seems that more code changes are needed so make "WebSocketNext" work with the @RolesAllowed

@mkouba
Copy link
Contributor

mkouba commented Apr 26, 2024

I tried to use the new WebSocketNext extension as I thought that the RolesAllowed-annotation would work out of the box as it is based on the HttpServer (quote from documentation "The WebSocket handling reuses the main HTTP server.") but this does not seem to work ;) But I guess I was lacking more information ;) and as @mkouba created a new issue, it seems that more code changes are needed so make "WebSocketNext" work with the @RolesAllowed

Yes, the integration with the security extension is currently WIP.

@sberyozkin
Copy link
Member

Hi Martin, @mkouba, as far as this issue is concerned, can you please have a quick look at the #29919, as you helped there as well, and see if something rings a bell with respect to possible changes to the code referenced in #29919, between 3.8 and 3.9, np if nothing obvious springs to mind

@michalvavrik
Copy link
Contributor

NPE is thrown in Quarkus REST handler, I'll have a look.

@michalvavrik
Copy link
Contributor

I'm afraid this will take a CDI expert, at the very least to give me a hint. I'll write you an email.

@mkouba
Copy link
Contributor

mkouba commented Apr 29, 2024

Hm, so if I understand it correctly the failing test (ExampleResourceTest) does not involve a WebSocket connection at all. It's merely testing the JAX-RS resource, right?

@michalvavrik
Copy link
Contributor

Hm, so if I understand it correctly the failing test (ExampleResourceTest) does not involve a WebSocket connection at all. It's merely testing the JAX-RS resource, right?

That's correct. The HTTP request comes into the WebSocket handler, but leaves it as it does not detect the WS header.

@michalvavrik
Copy link
Contributor

When Reactive Routes are used together with Quarkus REST and some @RouteFilter is applied before the Quarkus REST begin processing, CDI request context comes active to the Quarkus REST. However Quarkus REST creates a new CDI request context state with io.quarkus.arc.ManagedContext#activate(). I verified that when invoked on the same thread, the CDI request context state keeps changing. Here are some of my debug logging, not sure if useful for anyone else:

io.quarkus.vertx.runtime.VertxCurrentContextFactory is fallback
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (null) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
TestIdentityAssociation.setIdentity deferred - thread 124 context state io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
///// routing context  /hello/foo1
|||||||||||||||| end rt ctx
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
cdi request scope activation in REST thread 124 context state io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c
TestIdentityAssociation.setIdentity sync - thread 124 context state io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
TestIdentityAssociation.getDeferredIdentity thread 124 context state io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c

Interesting is also why the state has changed, I added printing of a stack trace, however it's too long to put here. See the attachment.

What fixes (or works around?) the issue is for example this: bf7880ab3
test-log.txt
65734ebe7aaa265dae724cd0010cd20

I found other ways to work around the issue, but it really requires a feedback from a CDI experts to determine the one that should be used. Use reproducer from the description. Thank you

cc @mkouba @manovotn @Ladicek @geoand

@michalvavrik
Copy link
Contributor

My simplistic understanding is that the context state changes when subscribed Uni is created when the old context state is still in place, even though subscription is happening when the new context state is in place.

@manovotn
Copy link
Contributor

manovotn commented Apr 29, 2024

It seems like AbstractResteasyReactiveContext#requireCDIRequestScope() completely omits the case where you already have active req. context that you didn't start up yourself and instead always tries to enforce new, blank instance through ThreadSetupAction#activateInitial.
While I do not claim I understand the whole chain happening there (not even close in fact), this seems like an issue.

What fixes (or works around?) the issue is for example this: bf7880ab3

Yes, my debugging attempts landed at pretty much the same thing to resolve it - manovotn@7923d22

That being said, I am still unsure about:

  • Is it expected that you can get a req, context that's already active?
    • Seems like yes, because in this case, it's Vertx RouteHandler that does the activation and that seems perfectly valid?
  • Given that it is permitted scenario, should we have some special handling for its deactivation given that someone else activated it and might want to deactive/destroy later?
    • From my observations in this reproducer, the deactivation and destruction of context is handled by Vertx RouteHandler prior to the logic implemented in AbstractResteasyReactiveContext which then does no-op in this regard

@michalvavrik
Copy link
Contributor

michalvavrik commented Apr 29, 2024

Very nice @manovotn , thank you for looking at it! I might not be a good adept to answer the questions raised, maybe someone else will, but I'll try.

Yes, my debugging attempts landed at pretty much the same thing to resolve it - manovotn@7923d22

lgtm, but tests will be needed. I can prepare them.

Is it expected that you can get a req, context that's already active?
Seems like yes, because in this case, it's Vertx RouteHandler that does the activation and that seems perfectly valid?

I was asking myself a same thing, but considering that Reactive Routes are "finished product" (the way I see it; it's removed from RHBQ at least and recently dropped from SR JWT) I don't want to propose important changes there.

Given that it is permitted scenario, should we have some special handling for its deactivation given that someone else activated it and might want to deactive/destroy later?

My thinking was that we should not accept that users activate CDI request context. If user code or extensions activate CDI request context prior to Quarkus REST processing, they should destroy it as well. This is very early in processing and user code often needs Quarkus REST to prepare the request context for example with current identity and RoutingContext. We know how to handle and detect case when Reactive Routes activate it, but we can't really keep checking if someone destroyed active context during the processing inside Quarkus REST, can we? Wouldn't it require loop constantly checking as it can be async process?

So my thinking was that we allow it for Reactive Routes, but consider it "illegal state" when it comes in other cases. What should be done I'm not sure. Can we destroy the activate request context ourself and activate a new one? It is enough for cases where concrete state is kept, but not enough when someone simply calls requestContext.terminate(). It should be simply illegal to activate request context without termination so early. WDYT @geoand ?

From my observations in this reproducer, the deactivation and destruction of context is handled by Vertx RouteHandler prior to the login implemented in AbstractResteasyReactiveContext which then does no-op in this regard

+1

@manovotn
Copy link
Contributor

lgtm, but tests will be needed. I can prepare them.

That would be awesome, thank you.
I didn't mean to propose this as a final fix, just what I came up with while discovering what's going on :)

My thinking was that we should not accept that users activate CDI request context.

There are CDI APIs allowing you to programmatically activate req. context and even more of them in Arc itself. However, I don't think you'd be able to get that far without first activating it from within some other Quarkus part.

... but we can't really keep checking if someone destroyed active context during the processing inside Quarkus REST, can we? Wouldn't it require loop constantly checking as it can be async process?

Yea, if you go on tempering with context activation and destruction forcefully, you are IMO on your own.
I don't think we should assume it's being manipulated with; doing so would break user app expectations anyway (i.e. state of existing user defined beans for example).

So my thinking was that we allow it for Reactive Routes, but consider it "illegal state" when it comes in other cases. What should be done I'm not sure. Can we destroy the activate request context ourself and activate a new one? It is enough for cases where concrete state is kept, but not enough when someone simply calls requestContext.terminate(). It should be simply illegal to activate request context without termination so early. WDYT @geoand ?

I'd say it is generally easier to respect if the context is already active when your framework/integration needs it and just reuse that one. Whoever creates the context is also responsible for its destruction.
My original question was more along the lines of whether we can be sure that the context that we get here will outlast the window in which we need it to work.

@geoand
Copy link
Contributor

geoand commented Apr 30, 2024

Let me try and summarize so as to check if I have understood the issue correctly:

There is some piece of code that runs in a Vertx route that runs before Quarkus REST and that route activates the CDI context which then Quarkus REST completely ignores and starts a new one?
Also, I have not understood from the discussion what has changed between 3.8 and 3.9.

@michalvavrik
Copy link
Contributor

(in short :-)) @manovotn I agree with your conclusions / answers. Thank you.

My original question was more along the lines of whether we can be sure that the context that we get here will outlast the window in which we need it to work.

So far I believe we can in case of Reactive Routes.

There is some piece of code that runs in a Vertx route that runs before Quarkus REST and that route activates the CDI context which then Quarkus REST completely ignores and starts a new one?

yes

Also, I have not understood from the discussion what has changed between 3.8 and 3.9.

I think it comes from refactoring of the EagerSecurityHandler as if you move request context activation few lines above, it works. Problem is that we can't just accept that is the cause. There is a race. It would make for a fragile code that noone can touch. That's why I asked for help from CDI experts to find a root cause.

@geoand
Copy link
Contributor

geoand commented Apr 30, 2024

It's definitely possible to use the existing request scoped state if it exists, but is that really something we want to do in all cases?

@manovotn
Copy link
Contributor

It's definitely possible to use the existing request scoped state if it exists, but is that really something we want to do in all cases?

Can you think of a case where that would be an issue?
I would put it the other way and rather say that we shouldn't forcibly destroy previous context and put up a new one. That can easily break expectations of other code (be it user or extension) that activated it.

@geoand
Copy link
Contributor

geoand commented May 1, 2024

Can you think of a case where that would be an issue?

Not off hand

@geoand
Copy link
Contributor

geoand commented May 1, 2024

What I think can become problematic is how we handle the end of the request processing in Quarkus REST.

Currently as we always start the context, we always terminate it as well. If however now we move to not starting the context, what is the expectation going to be about closing it?

@geoand
Copy link
Contributor

geoand commented May 1, 2024

#40388 is a draft that reuses the request context which you folks can use to test.

@manovotn
Copy link
Contributor

manovotn commented May 1, 2024

What I think can become problematic is how we handle the end of the request processing in Quarkus REST.

Currently as we always start the context, we always terminate it as well. If however now we move to not starting the context, what is the expectation going to be about closing it?

That it is handled by whoever started it. Which is the case that I observed in this reproducer as well (see last bit of #40307 (comment))

#40388 is a draft that reuses the request context which you folks can use to test.

+1, this is very similar to what I proposed earlier (manovotn@7923d22) and to what Michal has (bf7880ab3).
Although I'd prefer if the activateInitial method actually does unconditional context activation but that's just minor comment :)

@geoand
Copy link
Contributor

geoand commented May 1, 2024

That it is handled by whoever started it

Theoretically yes, but is that actually the case in practice.
If so, we also need to update the patch to not destroy the context

@manovotn
Copy link
Contributor

manovotn commented May 1, 2024

That it is handled by whoever started it

Theoretically yes, but is that actually the case in practice. If so, we also need to update the patch to not destroy the context

A repeated invocation of context destruction is a no-op (as is the case in the reproducer).
I see your point though; we could instead alter it so that if we detect an active context, we don't even attempt to destroy it.
We should be able to do that through the requestScopeActivated boolean attribute with some additional checks in place.
I could take a look tomorrow as I am not in front of my laptop now (bank holiday here).

@geoand
Copy link
Contributor

geoand commented May 1, 2024

Right, that's my point.

I'll update my draft PR soon

@geoand
Copy link
Contributor

geoand commented May 1, 2024

I see your point though; we could instead alter it so that if we detect an active context, we don't even attempt to destroy it.
We should be able to do that through the requestScopeActivated boolean attribute with some additional checks in place

I didn't really find a way to do that, because I don't see how I can attach anything to the RequestContextState

@manovotn
Copy link
Contributor

manovotn commented May 2, 2024

I see your point though; we could instead alter it so that if we detect an active context, we don't even attempt to destroy it.
We should be able to do that through the requestScopeActivated boolean attribute with some additional checks in place

I didn't really find a way to do that, because I don't see how I can attach anything to the RequestContextState

@geoand you cannot attach anything to it.
We should be able to use the property inside AbstractResteasyReactiveContext to track whether the context originated here or elsewhere and perform cleanup conditionally.
Here's what I mean - #40408

@geoand
Copy link
Contributor

geoand commented May 2, 2024

We can do that I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants