Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement metadata support in server (to replace Graphql) #2048

Closed
npalaska opened this issue Dec 17, 2020 · 10 comments · May be fixed by #2049
Closed

Implement metadata support in server (to replace Graphql) #2048

npalaska opened this issue Dec 17, 2020 · 10 comments · May be fixed by #2049
Assignees
Projects
Milestone

Comments

@npalaska
Copy link
Member

Right now we have GraphQL server running with Prisma and Apollo server, however, in an efforts to create more robust Pbench server APIs we need to move the implementation on the server side. Therefore we need to re-define our GraphQL types, schema and mutations in server code.

So the GraphQL implementation should sit in the server where as the front-end request the required data for the dashboard using only the exposed APIs. Making a GraphQL queries from front-end shouldn't be required. The APIs would receive a simple JSON request which server then parses after passing all the authentication and make a valid GraphQL schema for execution. Upon completing the schema execution server returns the results back to the front-end.

@portante portante added this to the v0.72 milestone Dec 17, 2020
@portante portante added this to To do in Server via automation Dec 17, 2020
@portante portante added this to To do in v0.72 via automation Dec 17, 2020
@dbutenhof
Copy link
Member

Just a clarifying note: the motivation for using GraphQL in the dashboard was to allow the efficient access and mutation of a small subset of a potentially large volume of user metadata stored in the backend DB (currently postgresql).

In migrating the implementation from dashboard to server, the greatest benefit specific to a GraphQL "backend" would be that dashboard code could be supported without change. While that's not without value, it's also not essential; and, in particular, your proposal above doesn't achieve that because it hypothesizes a simple JSON REST API model. If we're not exposing a GraphQL API, then using GraphQL internally may become 100% overhead vs simply using the right SQL queries directly. (GraphQL on top of postgresql can't do anything we can't do with SQL, and if we're tying the operational range down by exposing a non-GraphQL API, our client doesn't benefit from the flexibility of the GraphQL engine.)

The original concern is that we might end up with a GET metadata / PUT metadata API model where the entire payload of user metadata is all in one bundle. But there are other alternatives; for example, a GET and PUT where specific parts (e.g., keywords) are specified.

@npalaska
Copy link
Member Author

Summary of my discussion with @dbutenhof on this issue:

We figured there are 2 possible approaches to move forward:

  1. Expose a graphql endpoint where an end user would make a graphql query and server just act as an authentication gate and if query is authenticated we return the graphql result as it is to the user or with some additional post-processing.
  2. Expose an endpoint that only accept a JSON data just like any other API endpoint we have and it returns the the desired results after making a SQL query against the Postgres. Requesting a JSON and then server doing the post-processing of converting to graphql query and executing it would not benefit us given the amount of data we have currently related to a User and a corresponding UrlSession models and it would be an extra overhead as mentioned in earlier Dave's comment.

With 1st approach we would not have to change the dashboard code much as it can continue to make graphql queries against the a respective graphql endpoint, however, this could create some inconsistency in our API architecture where we would have certain endpoints that needs a JSON input and certain endpoints that would need a graphql query and dashboard needs to know which endpoint need what type of data.

With 2nd approach we would have to change the dashboard code a bit specifically about how it makes the graphql query. However, this would create a consistency among all the server APIs definition without adding any extra overhead. Also our dashboard code needs some changes anyway once all the APIs and authentication mechanism is baked out.

The url/sessionUrl model data that we fetch or query using graphql right now is not significantly large at this point that the REST would become inefficient. However, it is possible that the data we handle right now with user/url/sessionUrl models grows significantly where we need to make efficient queries to get/post data.

Therefore we decided to possibly keep the graphql schema definitions in the code but just not using it for now instead making a SQL query to handle the requested data and therefore essentially adopting a 2nd approach but keeping the graphql schema definition in the code so if we have to make a transition, its quicker (but this is in the anticipation that if we ever have to).

Any comments/suggestions?

@dbutenhof
Copy link
Member

One problem with exposing a GraphQL endpoint is that we can't just "check authorization" up front; we need to be sure either that the GraphQL schema doesn't allow any possible mutation that might be restricted, or we need to analyze each GraphQL query payload to be sure that it doesn't try to mutate (or in fact even examine) anything we wouldn't allow for the user.

I conceive of our metadata model as something like a 2-level per-user NOSQL query. That is, our metadata is structured somewhat like this, conceptually: (visualized as JSON, though we're really talking about one or more SQL tables and we probably only manifest the user's individual metadata values as JSON)

"username": {
    "metadata-key1": {
        "individual-id1": {
            "created": "timestamp",
            "modified": "timestamp",
            "value": "user-data-blob"
        },
        "individual-id2": {
            "created": "timestamp",
            "modified": "timestamp",
            "value": "user-data-blob"
        }
    },
    "metadata-key2": { ... }
}

E.g., the current dashboard "URL" metadata might be represented as "metadata key 1" above, with a name like dashboard-saved-view and multiple "session ID" values for the "individual IDs", and the dashboard's config and description being part of the "value" data blob. (Or we could make "description" part of the schema, possibly both for each metadata key and for each instance of metadata; but "config" is too specific... that's just an opaque blob (whether JSON or more generic) for the server APIs and DB.)

@portante portante linked a pull request Mar 16, 2021 that will close this issue
@portante portante removed this from To do in v0.72 Mar 16, 2021
@portante portante added this to To do in v0.71 via automation Mar 16, 2021
@portante portante moved this from To do to In progress in Server Mar 16, 2021
@portante portante moved this from To do to In progress in v0.71 Mar 16, 2021
@portante
Copy link
Member

portante commented Mar 18, 2021

Right now we have GraphQL server running with Prisma and Apollo server, however, in an efforts to create more robust Pbench server APIs we need to move the implementation on the server side. Therefore we need to re-define our GraphQL types, schema and mutations in server code.

Can you enumerate how the pbench-dashboard uses GraphQL? What is the current schema today?

I believe it is defined at: https://github.com/distributed-system-analysis/pbench-dashboard/blob/main/prisma/datamodel.prisma

For convenience, copied here:

type User {
  id: ID! @id
  firstName: String!
  lastName: String!
  username: String! @unique
  password: String!
}

type Session {
  id: ID! @id
  createdAt: DateTime! @createdAt
  updatedAt: DateTime! @updatedAt
  config: String!
  description: String
}

The User model above should be dropped entirely since that is already implemented.

The Session module just stores an arbitrary config string, and its description.

Instead of writing an entire API, why not use something like memcached [1] or redis [2] to just store the blobs with a timeout, instead of creating a backend infrastructure?

[1] https://memcached.org/
[2] https://redis.io/

I am personally very fond of the simplicity, performance, and flexibility of memcached, as I have used it for past web sites to make them scale well. One of the use cases we used it for was for session tokens and associated stored data.

@npalaska
Copy link
Member Author

Right now we have GraphQL server running with Prisma and Apollo server, however, in an efforts to create more robust Pbench server APIs we need to move the implementation on the server side. Therefore we need to re-define our GraphQL types, schema and mutations in server code.

Can you enumerate how the pbench-dashboard uses GraphQL? What is the current schema today?

I believe it is defined at: https://github.com/distributed-system-analysis/pbench-dashboard/blob/main/prisma/datamodel.prisma

For convenience, copied here:

type User {
  id: ID! @id
  firstName: String!
  lastName: String!
  username: String! @unique
  password: String!
}

type Session {
  id: ID! @id
  createdAt: DateTime! @createdAt
  updatedAt: DateTime! @updatedAt
  config: String!
  description: String
}

The User model above should be dropped entirely since that is already implemented.

The Session module just stores an arbitrary config string, and its description.

Instead of writing an entire API, why not use something like memcached [1] or redis [2] to just store the blobs with a timeout, instead of creating a backend infrastructure?

[1] https://memcached.org/
[2] https://redis.io/

I am personally very fond of the simplicity, performance, and flexibility of memcached, as I have used it for past web sites to make them scale well. One of the use cases we used it for was for session tokens and associated stored data.

The session module from the dashboard is same as the metadata object module (sqlalchemy table) defined here. This table has a value column that can store a text blob and this blog can be anything dashboard/client wants it to be along with a user defined key to segregate different types of metadata objects. The reason for using a sql table is to persist the metadata. These metadatas can be saved/favorite view/sessions that user wants to save so that they can get consistent view when they login everytime. You seem to be suggesting that we use the cashing mechanism instead of persistent storage. I am not sure how and why the metadata objects should be short lived. A user would want to persist their choice of favorite/saved views to get a consistent dashboard experience isn't it?

Earlier the dashboard was using the Graphql interface to create/query the metadata objects (session module from dashboard). However the only difference is we decided to implement it using the REST protocol keeping everything else same.

@portante
Copy link
Member

The existing Session is not managed in any way, and unbounded, and while this PR provides the same functionality from GraphQL, we still don't have a mechanism to manage the sessions created.

To that end, we need to convert these URL sharing sessions to something temporary in the short term (memcached or redis), until we can provide the user with an interface to the list of their shared URLs, where there is a boundary on either how many they can create and persist, or on how long they last.

All this mechanism for a 1-to-1 feature compatibility with GraphQL does not seem worth it. We might as well keep using the existing GraphQL and add a timeout for now.

Before we continue with this work, we really need to work out the full use case in the Dashboard for the functionality of the feature, and then implement from there.

@npalaska
Copy link
Member Author

npalaska commented Mar 18, 2021

It isn't just sessions, it can be any metadata text blob that dashboard on behalf of a user would want to store on the server. However, you raised a valid points how long they need to be persisted and how many should a user be allowed to create. My understanding is that we do not have complete answer for it yet as we never discussed around it and that's why the PR implements these details unbounded.

We certainly need to discuss these more in detail since implementing it via cashing mechanism would completely change the design workflow.

@dbutenhof
Copy link
Member

The existing Session is not managed in any way, and unbounded, and while this PR provides the same functionality from GraphQL, we still don't have a mechanism to manage the sessions created.

The API Nikhil has implemented certainly provides for "management" in a manual sense, in that we can display all metadata keys and values, and delete any we want. An administrator (when we have such a think) should be able to do this on behalf of any user. We'd just need a management console view.

To that end, we need to convert these URL sharing sessions to something temporary in the short term (memcached or redis), until we can provide the user with an interface to the list of their shared URLs, where there is a boundary on either how many they can create and persist, or on how long they last.

So you're proposing that "saved views" (and "favorites") continue to be strictly a front-end concept that couldn't (easily) be used by another client and would be completely invisible to the backend? (Our current "saved view" wouldn't be useful to anything but our dashboard anyway since my understanding is that it's essentially an abridged Redux dump, but in principle, user's "favorites" sounds like a core server concept that ought to be more visible.)

All this mechanism for a 1-to-1 feature compatibility with GraphQL does not seem worth it. We might as well keep using the existing GraphQL and add a timeout for now.

A timeout is interesting, but seems to lessen the utility of the saved view concept. And to have "favorites" time out while the datasets still live seems... weird.

I actually think the concepts we discussed in the meeting would be a much better alternative for the "saved view" ... the dashboard (possibly in conjunction with the server) generates a URI that you can copy and paste anywhere; email, chat, embed in a document, and click to restore. The state in the current saved view metadata would be embedded in the URI as a query tag, along (possibly) with a role-based server authorization token (which may not be necessary if it only applies to "published" data visible to everyone anyway). That way neither the server nor the dashboard needs to store saved views at all.

But while this makes a lot of sense for "saved view", I don't think it makes sense for "favorites" ... and I like the idea of letting users "tag" their datasets to organize them. In fact, I think "favorites" is overly simplistic, and we should consider allowing users to add arbitrary tags on which they can search and sort in the dashboard. That implies persistent metadata with a long and reliable lifetime. Our user metadata would easily allow that without any additional changes.

Of course that's still user-specific tags, and if we want to add arbitrary metadata through the UI to the datasets, we can either use my Postgresql dataset_metadata, or we can add fields to the Elasticsearch run documents. Either way, this sort of persistent (and, yes, manageable, given the proper UI work on top of it) seems valuable.

Before we continue with this work, we really need to work out the full use case in the Dashboard for the functionality of the feature, and then implement from there.

Agreed. We've always been focusing on replacing what the dashboard does internally with server APIs, based on the implicit design criteria manifested by the existing implementation ... often without really pushing at the parameters to determine how much of this was really well-considered and deliberate vs what just "happened" and might not necessarily be the best way forward. The whole issue of "anti-users" falls into that bucket (because "anti-user" is cooler and easier to say than "non-logged-in user").

@portante
Copy link
Member

Filed distributed-system-analysis/pbench-dashboard#133, where we should really re-consider the implementation of the dashboard session sharing feature, re-implementing it as a URL for all pages that can be shared so that existing bookmarks / url shortening capabilities can be leveraged instead of a server side database.

@portante portante removed this from In progress in v0.71 Apr 27, 2021
@portante portante added this to To do in v0.72 via automation Apr 27, 2021
@portante portante moved this from To do to In progress in v0.72 Apr 27, 2021
@portante portante added this to To do in Sprint 000 Jul 26, 2021
@webbnh webbnh changed the title Graphql schema implementation in server Implement metadata support in server (to replace Graphql) Jul 26, 2021
@portante portante moved this from To do to In progress in Sprint 000 Jul 26, 2021
@portante portante removed this from In progress in Sprint 000 Jul 29, 2021
@dbutenhof dbutenhof added this to To do in Sprint 006 via automation Oct 18, 2021
@dbutenhof dbutenhof removed this from To do in Sprint 006 Nov 29, 2021
@dbutenhof dbutenhof added this to To do in Sprint 007 via automation Nov 29, 2021
@dbutenhof dbutenhof removed this from To do in Sprint 007 Dec 13, 2021
@dbutenhof dbutenhof added this to To do in Backlog via automation Dec 13, 2021
@portante portante removed this from In progress in Server May 10, 2022
@portante portante added Backlog and removed Backlog labels May 10, 2022
@portante portante removed this from To do in Backlog May 10, 2022
@portante
Copy link
Member

portante commented Aug 5, 2022

From @dbutenhof: "We'll handle server-side Dataset and User metadata completely differently."

@portante portante closed this as completed Aug 5, 2022
v0.72 automation moved this from In progress to Done Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
v0.72
  
Done
Development

Successfully merging a pull request may close this issue.

4 participants