Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identity API at /api/me #671

Merged
merged 2 commits into from Apr 29, 2022
Merged

Identity API at /api/me #671

merged 2 commits into from Apr 29, 2022

Conversation

minrk
Copy link
Contributor

@minrk minrk commented Jan 20, 2022

Draft implement of identity API (closes #638). Includes #165.

Key issues:

  • Jupyter Server has never put any requirements on the return of get_user other than truthiness - this poses a challenge in coercing to a user model. I'm aware of two patterns:

    1. current_user is a string username (the default), and
    2. current_user is a dict with 'username' field or 'name' field (JupyterHub uses 'name')

    so these are supported, and others result in username: "unknown".
    We could take this opportunity to enforce a structure on .current_user, or we could require custom User objects to use a corresponding Authorizer class to define the new Authorizer.user_model method to adapt to the dict format.

  • checking permissions - the design in Add authorization layer to server request handlers #165 doesn't allow a priori listing of available resource+action combinations, so permissions to check are inputs to the API request. Currently, permissions input is two lists - resources and actions, but it could be a JSON dict of resource: [action,] (matching output).

TODO:

@SylvainCorlay
Copy link
Contributor

Thanks for opening this.

I would be super happy to see a minimal version of this merged (the change to get_current_user). I would like to build upon it to create a new handler in the JupyterLab server extension listing all users connected to documents through the RTC WebSocket.

{
    "notebook:Untitled1": [
        USER1, USER3
    ], 
    "notebook:example": {
        USER2, USER3, USER4
   }
}

Whenever a user connects or disconnects to a document via the RTC WebSocket, they would add or removes themselves from that dict.

@SylvainCorlay
Copy link
Contributor

Hey @minrk, I think that it may be relevant to look at the ICurrentUser API that was added in issues and PRs (jupyterlab/jupyterlab#11443, jupyterlab/jupyterlab#11657, jupyterlab/jupyterlab#11774), in which it was discussed what fields should be included in the user model.

@minrk
Copy link
Contributor Author

minrk commented Jan 27, 2022

@SylvainCorlay yes, absolutely!

IUser currently has these fields:

Readonly anonymous: boolean

    Whether the user is anonymous or not.

Optional Readonly avatar_url?: string

    User's avatar url. The url to the user's image for the icon.

Readonly color: string

    User's cursor color and icon color if avatar_url is undefined (there is no image).

Optional Readonly cursor?: Cursor

    User's cursor position on the document.

    If undefined, the user is not on a document.

Readonly displayName: string

    User's last name.

Readonly initials: string

    User's name initials.

Readonly name: string

    User's name.

Readonly username: string

User's Username.

First question: what fields should come from the server. At first glance, it seems like username, name, and displayName make sense. Should the server also be responsible for picking avatar url and color? What about anonymous?

Second question: what should be done when some fields are undefined? In almost all cases, username is going to be the only piece of information available, so the result is presumably: displayName=name=username, initials=username[0]. Should it be the responsibility of the server to fill out some or all of these, or should they be defined and null to let the UI make its own choices for generating defaults from username?

My first impression:

  • username is the only field required to be defined and non-null
  • name, displayName, initials must be present, but should be null (or empty strings?) if the identity provider does not have an authoritative answer (i.e. server should not derive these from username, but Lab may)
  • other fields in IUser may be present, and should be used by JupyterLab if they are (color, avatar_url), but they rarely will be, if ever. Or we could just say no here, at least for now.
  • not quite sure what to do for the 'anonymous' field - is that an implementation detail in Lab, derived from e.g. username == 'anonymous'?

@hbcarlos
Copy link
Contributor

Hi @minrk, thank you soo much for working on this!

If I understood correctly, the get_user method is a hook that the server extensions that provide the user's identity must overwrite. If this is correct, in my opinion, username, name, display_name, initials, and avatar_url should come from the server. The reason is that all this information usually comes from the identity provider, and we can not assume displayName=name=username or initials=username[0]. The extension overwriting the get_user method should define this logic depending on which identity provider the information comes from.

  • username is the unique identifier of the user.
  • name is the complete name of the user. Some identity providers split this information into name and family name or last name.
  • display_name is the name that users want to show to others and how they want to display it. Some identity providers provide this field because, depending on the region, names are written in a different order, or some parts are omitted (family name, name or name family name, etc.). There is a long discussion with some references here Drop familyName from the ICurrentUser interface? jupyterlab/jupyterlab#11657.
  • initials depend on how the name or display name is composed, so we can not assume anything.
  • avatar_url is the URL to the user's profile image. This URL is different depending on the identity provider and could be null.
  • color is the cursor color or the color used to identify the user on the UI. Even though it is pure frontend related, it is information that should be stored with the other fields, this way it doesn't change every time the user opens a new session. Not sure how to do it, I suggest leaving it as a frontend problem for now and we can iterate later to find a better solution.
  • cursor pure frontend stuff and we will handle that in the frontend.

In my opinion, we should enforce the extensions to specify the username, name, display_name, initials by making them non-null and writing proper documentation about each field. In the case of not having one of the fields, they can always use an empty string.

@minrk
Copy link
Contributor Author

minrk commented Jan 31, 2022

Thanks for filling in!

in my opinion, username, name, display_name, initials, and avatar_url should come from the server. The reason is that all this information usually comes from the identity provider, and we can not assume displayName=name=username or initials=username[0].

Yes, I agree the server should provide this information when available. But in the ~90% of cases where username is all the identity provider gives us, my question is whether the server should communicate to JupyterLab directly that this information is absent by leaving empty strings (as JupyterLab assumes here), or should the server implement its own defaults for empty values?.

I don't really understand the difference between name and displayName from JupyterLab's perspective. When would you use name instead of displayName? Where does JupyterLab use .name and not .displayName?

  • color is the cursor color or the color used to identify the user on the UI. Even though it is pure frontend related, it is information that should be stored with the other fields, this way it doesn't change every time the user opens a new session. Not sure how to do it, I suggest leaving it as a frontend problem for now and we can iterate later to find a better solution.

I think that makes sense. It makes sense to me to follow the same pattern as avatar_url: implement it in the server model so it can be provided if the IdP has such information, but it will usually be null which means Lab still has to choose.

@minrk minrk added the feature label Jan 31, 2022
@hbcarlos
Copy link
Contributor

But in the ~90% of cases where username is all the identity provider gives us, my question is whether the server should communicate to JupyterLab directly that this information is absent by leaving empty strings (as JupyterLab assumes here), or should the server implement its own defaults for empty values?

If you mean the default implementation, we should follow the anonymous pattern implemented at the moment in the frontend, creating random usernames from the list of Jupiter's moons. But this should probably be a JupyterLab server extension. In any other case (Hub, or other solutions), if the IdP doesn't have the information, then those fields will be an empty string (the collaborative experience will be completely broken in those cases, but we can not do anything, that is the responsibility of who is deploying the application).

I don't understand the difference between name and displayName from JupyterLab's perspective. When would you use name instead of displayName? Where does JupyterLab use .name and not .displayName?

On the UI, we only use displayName. The difference is that some deployments could show the username, others only the name or name and family name. It is just a way of being more flexible because we can have infinite possibilities. This is an excellent reference to understand the problem with the name https://www.w3.org/International/questions/qa-personal-names.

@minrk
Copy link
Contributor Author

minrk commented Jan 31, 2022

If you mean the default implementation,

I mean JupyterHub - normally, only a username is provided, no other information. So in the vast majority of fully authenticated deployments of Jupyter, a username is all we have to work with, so it's an important case to handle.

if the IdP doesn't have the information, then those fields will be an empty string (the collaborative experience will be completely broken in those cases, but we can not do anything, that is the responsibility of who is deploying the application).

Why is it broken? I don't see any reason that more than a username is needed for anything to function, only aesthetic refinement. This is going to be close to all cases in practice, so it probably shouldn't be broken, and it seems to be handled just fine by using name=display_name=username. I'm mainly asking whether you think it's useful to the frontend to be able to make those decisions when the information is unavailable (as the code already does today), or if it should be the responsibility of the backend to duplicate username into these fields (i.e. JupyterLab shouldn't know the duplication is happening).

we should follow the anonymous pattern implemented at the moment in the frontend

Moving that to the server makes sense, whether as an extension or in the default server itself. Probably for another PR, though, at least. What is the anonymous field used for in JupyterLab? Is that something that would be meaningful to expose from the server, or an internal detail?

@hbcarlos
Copy link
Contributor

Why is it broken? I don't see any reason that more than a username is needed for anything to function, only aesthetic refinement. This is going to be close to all cases in practice, so it probably shouldn't be broken, and it seems to be handled just fine by using name=display_name=username. I'm mainly asking whether you think it's useful to the frontend to be able to make those decisions when the information is unavailable (as the code already does today), or if it should be the responsibility of the backend to duplicate username into these fields (i.e. JupyterLab shouldn't know the duplication is happening).

What I meant with broken is that, If we leave the displayName empty, we don't see the user's name in the UI.
It is possible to use name=display_name=username, but we can not decide on the front end or the backend. That logic should be implemented by who overrides the get_user method. For example, I suppose that JupyterHub overrides that method to provide its user information, then Hub is the one deciding what should be on each of those fields. Because if we decide how to fill those fields, that will affect every IdP.

Moving that to the server makes sense, whether as an extension or in the default server itself. Probably for another PR, though, at least.

Sure, I can work on this later in another PR. That will help me to get into the server source code.

What is the anonymous field used for in JupyterLab?

That field is used to know whether the user is randomly generated on the frontend or is an identified user whose name comes from an IdP. The idea is to allow users to rename themself. The problem is that if the user information comes from an identity provider, we can not rename the user because we need to PUT the info to the IdP, and should not rename the user because the identity is verified by the IdP, if we allow renaming on the frontend anyone can impersonate someone else.

Is that something that would be meaningful to expose from the server, or an internal detail

We can remove it from now because we can not rename users yet.

@codecov-commenter
Copy link

codecov-commenter commented Feb 1, 2022

Codecov Report

Merging #671 (dbe1109) into main (73a03a6) will increase coverage by 0.16%.
The diff coverage is 80.12%.

@@            Coverage Diff             @@
##             main     #671      +/-   ##
==========================================
+ Coverage   69.96%   70.13%   +0.16%     
==========================================
  Files          62       63       +1     
  Lines        7368     7493     +125     
  Branches     1223     1251      +28     
==========================================
+ Hits         5155     5255     +100     
- Misses       1841     1857      +16     
- Partials      372      381       +9     
Impacted Files Coverage Δ
jupyter_server/auth/utils.py 93.10% <ø> (ø)
jupyter_server/gateway/handlers.py 31.87% <0.00%> (ø)
jupyter_server/serverapp.py 65.06% <50.00%> (-0.06%) ⬇️
jupyter_server/base/handlers.py 64.13% <55.55%> (-1.31%) ⬇️
jupyter_server/log.py 95.23% <83.33%> (-4.77%) ⬇️
jupyter_server/services/api/handlers.py 93.22% <84.61%> (-6.78%) ⬇️
jupyter_server/auth/identity.py 90.19% <90.19%> (ø)
jupyter_server/auth/__init__.py 100.00% <100.00%> (ø)
jupyter_server/auth/authorizer.py 88.88% <100.00%> (+1.38%) ⬆️
jupyter_server/auth/decorator.py 89.65% <100.00%> (+8.40%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73a03a6...dbe1109. Read the comment docs.

@minrk
Copy link
Contributor Author

minrk commented Feb 1, 2022

What I meant with broken is that, If we leave the displayName empty, we don't see the user's name in the UI

Ah, sure. A value needs to be set for these fields eventually, by the time they are actually used. My question is about the (majority of) cases where some fields aren't available, which piece of code is responsible for filling in the derivative values, in the order in which they occur:

  1. The IdP adapter (e.g. JupyterHub) - this means duplicating the same logic in every adapter, and making backward-incompatible requirements for all current implementations of get_user.
  2. jupyter-server - single implementation, consistent spec, values always defined in the model retrieved by Lab (same as 1. from Lab's perspective, but much simpler and more broadly compatible)
  3. Lab and other clients - able to make decisions with the most available information (e.g. when using fallback displayName=username, render it differently somehow because it is known to be a 'username' and not a for-humans 'display name'.)

I think 2. is probably the simplest, while 3. allows the UI to make the most informed display decisions because it knows what fields are actually authoritative vs generated from fallbacks. 1. offers no benefits over 2, so I don't think it makes sense.

we can not decide on the front end or the backend.

I don't think that follows. We can't define display_name=username, but we can have:

if not display_name:
    display_name = username # or name, etc.

without any issues. JupyterLab already has exactly this logic: displayName !== '' ? displayName : name.

I've updated this PR to implement 2 to show you what I mean - these are purely fallbacks for when authoritative info for some fields is unavailable; the IdP adapter may return any and all fields it has an answer for (this includes computing its own derivative values, so 2. covers a strict superset of 1.), so this is just filling in the missing values when they are missing.

@echarles
Copy link
Member

echarles commented Feb 1, 2022

This PR builds on top of the authorization branch, which makes perfectly sense.

To get the gist of the changes related to identity, I use this link Zsailer/jupyter_server@authorization...minrk:identity

@hbcarlos
Copy link
Contributor

hbcarlos commented Feb 1, 2022

is probably the simplest, while 3. allows the UI to make the most informed display decisions because it knows what fields are actually authoritative vs generated from fallbacks. 1. offers no benefits over 2, so I don't think it makes sense.

I can't entirely agree with this part because, in the server or the frontend, we do not know if the IdP is using a UUID, email, or specific combination of the name to create a username (to create a unique identifier). I don't think it is good to display a UUID as the name for users. We are not the most informed to make a display decision.

Maybe, the best we can do, is introduce the anonymous attribute and a mandatory with the username. The IdP has two options:

  1. sets anonymous=true, which means the user is authenticated but doesn't have an identity. Then the IdP adapter provides the username, and we create an anonymous name from a random Jupyter's moon (the same logic we are currently using on the frontend).
  2. sets anonymous=false, which means the user is authenticated and has an identity. Then the IdP adapter provides at least the username and display_name.

I've updated this PR to implement 2 to show you what I mean - these are purely fallbacks for when authoritative info for some fields is unavailable; the IdP adapter may return any and all fields it has an answer for (this includes computing its own derivative values, so 2. covers a strict superset of 1.), so this is just filling in the missing values when they are missing.

As a fallback, this is fine. But still, if the username is a UUID, we will be displaying UUIDs in the UI.

JupyterLab already has exactly this logic: displayName !== '' ? displayName : name.

That's not 100% true. The logic you are talking about is here and is a temporary solution to allow anonymous users to rename themself by introducing the name, color, and initials as a parameter in the URL. And for example, we are not extracting the initials from the name, as you can see on this code. Instead, we keep the old initials forcing the user to introduce them. The reason is there is a lot of different cultures where names are entirely different, and we should never assume anything.

let name = decodeURIComponent(env.getParam('--username', ''));
let color = env.getParam('--usercolor', '');
let initials = decodeURIComponent(env.getParam('--initials', ''));
.
.
.
this._username = user.username as string;
this._name = name !== '' ? name : (user.name as string);
this._displayName = name !== '' ? name : (user.displayName as string);
this._initials = initials !== '' ? initials : (user.initials as string);

@minrk
Copy link
Contributor Author

minrk commented Feb 2, 2022

I understand the concern about showing UUIDs to the user, that makes sense. I think it's reasonable, then, to keep the implementation of fallbacks I have here and put that requirement on the IdP adapter that if all it has for a username is a UUID, it should provide a (generated) real name as well. That can be the responsibility of the IdP and a suggestion to always provide a 'nice' name, since everything works correctly with any username, but the experience is more pleasant for UUID usernames. If users should have the right to edit anonymous names, it makes sense that this would be a server-side feature, as it will have server-side storage components.

I think we can work out the details of who generates/tracks these anonymous names in a second round and/or extension. If anonymous providers are common (or we just need it for the default case), an implementation of anonymous utilities may well make sense in this repo.

The reason is there is a lot of different cultures where names are entirely different, and we should never assume anything.

I agree that's true - the problem is that we are implementing every step of this process, and initials aren't provided anywhere. So it's still up to 'us' to do it, just not JupyterLab. If JupyterLab forces Jupyter Server to do it, and if Server doesn't do it itself, it's forcing JupyterHub to do it. At no point in that process are initials available from a trustworthy source. So what should we do, then?

Can we eliminate the requirement for initials, since there's ~never a safe way to generate them? If the user should be able to pick them, then I think it should be JupyterLab's responsibility, as that's the only place they are used, and also the only place UI can feasibly be presented to do so. Maybe Server needs to be involved, too, for storage purposes, but I think this is principally a task for Lab.

@minrk
Copy link
Contributor Author

minrk commented Feb 2, 2022

How's this for a general design:

  1. Identity provider should provide as much information as it has available.
  2. If only a username is available and usernames aren't very human friendly, IdP may derive name and/or display_name if there's an appropriate scheme based on the structure of usernames (e.g. email.split('@')[0] if only a single email domain is allowed, or an Anonymous Moon name picked based on the hash of a UUID username), but such transforms will not be done by Jupyter Server itself.
  3. Server will fill some derivative fields if left empty, but will never transform names due to a lack of localization info. That is:
     if not name:
         name = username
     if not display_name:
         display_name = name
  4. if there is no authoritative source for initials, it will be left as an empty string, because we do not do automatic transforms of names anywhere.

The result is that server can guarantee these fields are present and non-empty: username, name, display_name, but initials may be an empty string, and avatar_url, color may be null.

@hbcarlos
Copy link
Contributor

hbcarlos commented Feb 3, 2022

Can we eliminate the requirement for initials, since there's ~never a safe way to generate them? If the user should be able to pick them, then I think it should be JupyterLab's responsibility, as that's the only place they are used, and also the only place UI can feasibly be presented to do so. Maybe Server needs to be involved, too, for storage purposes, but I think this is principally a task for Lab.

That's true. Usually, IdPs don't provide initials so I think we can deal with the Initials in the frontend for now and will see what to do with them. I was thinking maybe to change initials for random images of jupyter's moons a la Google Docs with animals, but we can think about it on a second round.

The result is that server can guarantee these fields are present and non-empty: username, name, display_name, but initials may be an empty string, and avatar_url, color may be null

I like the design you are presenting, just one thing, I think would be better to have initials as null instead of empty string.
So, username, name, display_name non-empty and initials, avatar_url and color may be null.

@minrk
Copy link
Contributor Author

minrk commented Feb 3, 2022

I think would be better to have initials as null instead of empty string.

👍 so null for all unspecified values. Make sense, I'll get on that. I think I have all the info to get a complete draft implementation. Thank you!

This PR builds on top of the authorization branch, which makes perfectly sense.

If it would help with review, I can remove the permissions field from this PR and rebase without #165. Then I could add the permissions field in a third PR once this and #165 are merged.

Configuration/implementation question: do folks have a preference for whether the identity-provider-adapter and Authorizer should be different classes or just one? I initially thought just one (currently reflected in this PR) since I suspect they will ~always go together, but it might be better for them to be separate classes to allow for one identity provider to be mixed with different authorization schemes (e.g. oauth + read-only, etc.).

  • pro single class: most 'real' cases (e.g. jupyterhub) will be a single implementation, and these can be selected with a single config value. Harder to have a non-sensible config by selecting only one of the two.
  • pro multiple class: every valid combination of identity provider and authorization pattern would need to be a separate class (or support class-level configuration to modify authorization).

@echarles
Copy link
Member

echarles commented Feb 3, 2022

If it would help with review, I can remove the permissions field from this PR and rebase without #165. Then I could add the permissions field in a third PR once this and #165 are merged.

With the link I have shared Zsailer/jupyter_server@authorization...minrk:identity I am fine having this PR as it is and avoid additional work to you.

Authorization and Authentication (aka identity) are Auth related and looking at them at the same time is an added value to me. I would understand if other think they are separated concerns and deserves separated PR.

To me, both can remain in this PR.

Configuration/implementation question: do folks have a preference for whether the identity-provider-adapter and Authorizer should be different classes or just one?

They sounds to me like separated classes, but I need to makes sense of what you say in the above comment: Harder to have a non-sensible config by selecting only one of the two.

@minrk
Copy link
Contributor Author

minrk commented Feb 3, 2022

They sounds to me like separated classes

That's what I'm thinking, too (jupyter_server.auth.IdentityProvider atm).

They sounds to me like separated classes, but I need to makes sense of what you say in the above comment

I mean, for example, to configure Jupyter Server to run under JupyterHub, it must always set both the authorizer and identity provider, e.g.

c.ServerApp.authorizer_class = JupyterHubAuthorizer
c.ServerApp.identity_provider_class = JupyterHubIdentityProvider

If one made the mistake of only setting authorizer_class, it wouldn't work at all because the JupyterHubAuthorizer assumes the user object is what was returned from the JupyterHubIdentityProvider (it doesn't make sense to use JupyterHubAuthorizer with any other IdentityProvider).

On the other hand, setting only the JupyterHub identity provider would not result in properly applying permissions - it would appear to work because authentication would be correct. But authenticated requests that shouldn't have permission will be allowed, too (it's easy to overlook a configuration error when things that are supposed to work do, but things that aren't supposed to work do, too). This is technically correct because that's what the default authorizer is meant to do, but is likely not the configurer's intention.

In the specific case of JupyterHub, I don't expect this to be an issue because it will be taken care of by the jupyterhub-singleuser wrapper, but that's the kind of error I was thinking of.

@echarles
Copy link
Member

echarles commented Feb 3, 2022

In the specific case of JupyterHub, I don't expect this to be an issue because it will be taken care of by the jupyterhub-singleuser wrapper, but that's the kind of error I was thinking of.

OK, I see, thx for clarifying. There are so many configuration to setup that I don't see that as an issue to ask the platform administrator to define needed configuration... as long (in capitals)...as long as the system is secured by default.

@ellisonbg
Copy link
Contributor

Thanks for working on this @minrk ! I would like to make sure I understand the overall model (at least parts of it). It looks like the identity API itself defers to the Authorizer subclass to return a user model. By providing a custom Authorizer, any deployment of of Jupyter Sever can integrate with 1) whatever service/API they are using for identity, and 2) the actual permissions/access system about who can access what. They key that I am trying to understand are which parts are standardized, and which are intended to be customized in a particular deployment. On the frontend, is the plan to have ICurrentUser in JupyterLab pull information from the identity API?

@minrk
Copy link
Contributor Author

minrk commented Feb 7, 2022

It looks like the identity API itself defers to the Authorizer subclass to return a user model. By providing a custom Authorizer, any deployment of of Jupyter Sever can integrate with 1) whatever service/API they are using for identity, and 2) the actual permissions/access system about who can access what.

Yes, though I think the latest consensus is that we'll have separate IdentityProvider to provide the user model, and Authorizer to make authorization decisions based on that model and the resource being requested. I haven't made those changes yet. These are loosely coupled in that the Authorizer may make assumptions about the user model from the identity provider. Applications that provide both (like JupyterHub) will implement both APIs, while others may only provide one or the other (e.g. an OAuthIdentityProvider with an EverybodysReadOnly Authorizer).

On the frontend, is the plan to have ICurrentUser in JupyterLab pull information from the identity API?

Yes, that's the goal, so we want to make sure we have all the fields ICurrentUser may want defined, and be clear what to expect/require when certain fields aren't available from the IdentityProvider.

@echarles
Copy link
Member

echarles commented Feb 7, 2022

Yes, though I think the latest consensus is that we'll have separate IdentityProvider to provide the user model, and Authorizer to make authorization decisions based on that model and the resource being requested.

This is indeed what has been discussed recently.

It looks like the identity API itself defers to the Authorizer subclass to return a user model

Looking at the diff, there is indeed a def user_model(self, user: Any) -> Dict: method on the jupyter_server/services/auth/authorizer.py file which confuses me are the user model is a responbility of the identity aspect. Would it be make sense to create a jupyter_server/services/auth/identity.py file and move that method there?

@minrk
Copy link
Contributor Author

minrk commented Feb 9, 2022

Would it be make sense to create a jupyter_server/services/auth/identity.py file and move that method there?

Yup, that's the idea. I have done it locally already, hope to have a draft up by the end of the week.

@minrk
Copy link
Contributor Author

minrk commented Feb 10, 2022

Rebased now that #671 is merged. Not quite ready for a full review, but close enough that folks can take a look.

Overview:

IdentityProvider defines the Authentication API to be combined with Authorizer for authorization.

  • IdentityProvider.get_user(Handler) is the new API for returning a user to be stored in .current_user for @authenticated handlers. This is the object passed to Authorizer.is_authorized().
    The default implementation preserves existing behavior by calling LoginHandler.get_user(), but if a subclass defines this method,
    LoginHandler.get_user is ignored.
  • IdentityProvider.identity_model casts the above object to a standard model, suitable for consumption by JupyterLab's IUser. The default implementation works with common known implementations (the default in jupyter server, and JupyterHub), but will likely want to be overridden in most cases.
  • /api/me is served by IdentityHandler which returns the identity model with a permissions key. permissions may be passed as a JSON-serialized dict of {resource: [action]}. The dict returned will have the exact same form, only including authorized permissions.

Remaining tasks:

  • define IdentityModel
  • implement default IdentityProvider
  • implement /api/me returning identity model, including IUser fields and permissions
  • docs
  • example
  • tests

Design decisions:

  • should the default identity_model fail if it cannot determine a username, or use 'unknown' or an otherwise generated value? This is a backward-compatibility issue, as currently any opaque truthy object is allowed for a custom LoginHandler.get_user implementation.
  • Should additional fields be allowed in the identity model if the identity provider has them (jupyterhub user models have fields like 'kind', group membership, etc.), and if so, how should they be included (an identity.ext black-box dict?)
  • Should there be a field identifying the identity provider itself?
  • Should we use the IdentityModel, and/or enforce our own User class directly in .current_user - this may be hard without breaking compatibility.

Things that I think are good to do, but ideally not in this PR:

  • deprecate LoginHandler.get_user and other authentication customization options on LoginHandler in favor of consolidating authentication customization on IdentityProvider (IdentityProvider.get_user already has higher priority, but default implementation for backward-compatibility still calls LoginHandler.get_user where the default authentication implementation still resides)
  • (as suggested by @hbcarlos) implement new default anonymous user scheme with generated names, instead of using UUIDs, since these may now be visible

@minrk
Copy link
Contributor Author

minrk commented Feb 11, 2022

I added some docs and tests. Could still use an example, but I think there's enough to review for now.

@minrk
Copy link
Contributor Author

minrk commented Mar 28, 2022

I've run a successful test with this PR and jupyterhub/jupyterhub#3833

With this, I'm able to run a fully authenticated JupyterLab server with read-only access for some users without the JupyterHub wrappers at all, purely an implementation of the IdentityProvider and Authorizer APIs. That tells me we are pretty close to my goal of being able to replace the jupyterhub-singleuser wrapper and its extensive monkeypatching with a public-api-consuming ExtensionApp. The main thing I needed was a get_handlers() (as seen in JupyterHub.Authenticator and in ContentsManager) to register the oauth callback handler.

Config files: https://gist.github.com/e5f47d6f4253532635e9716076db0fb4

@Zsailer
Copy link
Member

Zsailer commented Apr 14, 2022

We've created a 1.x branch during this week's contributing hour. main now represents our 2.x branch. We're going to merge this PR here as part of this new branch.

Thanks for all the great work here @minrk!

@Zsailer Zsailer mentioned this pull request Apr 14, 2022
9 tasks
@minrk
Copy link
Contributor Author

minrk commented Apr 22, 2022

I can make async get_user work, I'm just thinking through the deprecation process.

The 'right' way to check the current user is to access handler.current_user. This has always been the case, but sometimes the underlying .get_current_user() is called directly, instead. This has been harmless until now, because .current_user is by default a cached proxy to .get_current_user(). If .get_current_user is allowed to be async, it can return a truthy coroutine, even when the request is not authenticated, granting unauthorized access if code isn't updated.

Assuming nobody overrides .get_current_user, this is not hard to handle: we can populate .current_user in prepare, and raise or show a deprecation warning if .get_current_user() is called instead (e.g. warn and return if .current_user is already set, raise if it isn't e.g. due to .prepare() being overridden without super, or unusual inheritance).

The trickier case is what JupyterHub does currently, which is override .get_current_user() itself (the standard way to override auth in Tornado). If we ignore that and call IdentityProvider.get_user directly, JupyterHub auth will be bypassed and all requests will be allowed without warning. This is the main situation I want to avoid. Deprecation would be best, error second choice, but it shouldn't silently ignore the deprecated auth override in favor of allowing everything.

So far, I think:

  1. base class .get_current_user raises/warns
  2. detect if using default IdentityProvider and overridden get_current_user: warn and call overridden method
  3. otherwise, call identityProvider.get_user directly

That way, setting a custom IdentityProvider takes precedence, but the deprecated approach still works without issue. Only downside is the somewhat convoluted check for the deprecated case on each request.

@minrk
Copy link
Contributor Author

minrk commented Apr 22, 2022

FWIW, I'd also be okay if the JupyterHub approach errored in a simple, reliable way, as long as the failure isn't allowing requests it shouldn't. That just means JupyterHub users need to wait for a release before upgrading to server 2.0. I'm fine saying that, but I want to be very sure that the incompatible-versions failure mode is not ignoring auth altogether.

@minrk
Copy link
Contributor Author

minrk commented Apr 22, 2022

Has warn_disabled_authorization been released in 1.x? If so, I can backport the fix removing that from this PR.

@minrk
Copy link
Contributor Author

minrk commented Apr 25, 2022

Same issue comes up again in check_xsrf_cookie, called in _execute, needs token_authenticated -> needs current_user -> before prepare -> cannot be async. Need to think a bit on that one.

@Zsailer
Copy link
Member

Zsailer commented Apr 27, 2022

Has warn_disabled_authorization been released in 1.x? If so, I can backport the fix removing that from this PR.

Yes, it was released in v1.15.5 and above.

@minrk
Copy link
Contributor Author

minrk commented Apr 29, 2022

Worked out the async get_user. I suspect this may come up as a sticky issue a few times in the future because some of tornado's 'pre-flight' checks run pre-prepare, and many of them now require running post-prepare.

identity includes fields:

- username: str
- name: Optional[str]
- display_name: Optional[str]
- initials: Optional[str]
- avatar_url: Optional[str]
- color: Optional[str]

permissions is sibling to identity,
with the form: {"resource": ["action", ],}

where permissions are only populated _by request_,
because the server cannot know what all resource/action combinations are available.

Defines new jupyter_server.auth.IdentityProvider API for implementing authorization

- IdP.get_user(Handler) returns User dataclass
- IdP.identity_model renders dataclass to standard JSON dict model
- IdP.get_handlers returns possible custom RequestHandlers (e.g. OAuth callback)
- get user_id from cookie, even when token-authenticated,
  for stable random user ids in the default implementation, even for token-authenticated requests.

Default get_user implementation still resides in LoginHandler.get_user,
to be moved and deprecated in subsequent PR.

Ensures authorizer, identity_provider are defined
in case of custom Application, load default Authorizer/IdentityProvider if none is available in settings

For CI: Fix numpy-style docstring format
careful to deprecate overridden get_current_user without ignoring auth

Needs some changes due to early steps that are called before prepare,
but must now be moved to prepare due to the reliance on auth info.

- setting CORS headers (set_default_headers)
- check_xsrf_cookie
- check_origin

now that get_user is async, we have to re-run these bits in prepare after user is authenticated
@Zsailer
Copy link
Member

Zsailer commented Apr 29, 2022

Worked out the async get_user

🎉 Great work, Min!

I suspect this may come up as a sticky issue a few times in the future because some of tornado's 'pre-flight' checks run pre-prepare, and many of them now require running post-prepare.

Do you mind expanding on this a little bit? I'm don't immediately see the issue, but I'm sure your experience from JupyterHub makes you much more keen to see these issues.

This looks good to go in my mind! 👍 🚀

@Zsailer
Copy link
Member

Zsailer commented Apr 29, 2022

I'm going to merge this now, since we've had multiple rounds of review. I think it's better to get this merge and start using it. We can iterate in future PRs if issues arise.

Thanks so much, @minrk! Super excited for this work!

@Zsailer Zsailer merged commit 6d84507 into jupyter-server:main Apr 29, 2022
@minrk
Copy link
Contributor Author

minrk commented May 2, 2022

Do you mind expanding on this a little bit?

It's mostly references to .token_authenticated, or possibly direct calls to .get_current_user() in the various tornado preflight methods If any extension handlers call some of these methods before the user is authenticated, they will see an error. The likely failures I can think of:

  • custom set_default_headers that take this into account (probably unlikely, though the base class did)
  • calling super().prepare() either without awaiting it now that it's async, or at the end of a custom prepare, which will no longer work if it references the current user, since it won't be defined yet.

I think these will be rare, but I don't know exactly how rare.

We could make this work far more often if the failures only occurred when get_user is async, rather than always. That's likely to hide bugs, though, as the default auth is always going to be sync and JupyterHub probably shouldn't be (even though it is right now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

identity API