Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Option in pbench-move-results command to output results direct link #3120

Open
1 task done
pravins opened this issue Dec 19, 2022 · 16 comments
Open
1 task done
Labels
Agent API Of and relating to application programming interfaces to services and functions Backlog Server

Comments

@pravins
Copy link
Member

pravins commented Dec 19, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Description

As a user, I want a direct link from where i can download results uploaded to the server so i can do further analysis on the uploaded results to find required information.

As of now, we do not have any such option and end user need to construct direct link from config-file+ results + user + benchmark directory

To Reproduce

Run pbench-move-results

Actual Results

It does not output direct link for results.

Expected Results

Some option to get direct link to the uploaded results.

Additional information

None

@pravins pravins added the Agent label Dec 19, 2022
@dbutenhof
Copy link
Member

It's worth noting here that with "ownership" and "curation" this becomes a bit more nuanced than it was for 0.69. Yes, if one knew the benchmark name, the config string, and the run timestamp, plus the hostname and port of the Pbench Server, one could generate a URL to pull the tarball or any artifact from the tarball via unauthenticated Apache file sharing.

We're moving away from all that. Unless the dataset is "public", only the authenticated user will be able to do this. And, of course, we're changing the structure of the URL as it'll go through http[s]://<hostname>:port/api/v1/inventory/<resource_id> with authorization via the Pbench Server APIs. (The resource_id is just the tarball MD5 as this time, but we're also talking about switching to SHA256 or something, and we don't want to lock ourselves in too tightly to representation.)

So given pbench-results-move --access=public --token=<token>, then, yes, we could generate a URL to access to the tarball. One possibility would be to return a Location: <URI> response header. This is commonly a REST GET URI to return the full REST resource just created. I'm not entirely sure what that would mean as we don't really visualize our datasets as a "full resource" at this time... so, possibly, we could "fudge" by returning the inventory API for the tarball. Knowing relative paths within the tarball, you could easily append that path to get individual artifacts, and that might be useful.

But, again, unless you use --access=public, the GET will require authentication as the owning user.

@pravins
Copy link
Member Author

pravins commented Dec 20, 2022

Agree.
With data protection perspective, we have to make sure, user will only able to access his own tarballs or public tarballs.

@jhutar
Copy link
Contributor

jhutar commented Dec 22, 2022

Hello. I'm uploading with:

pbench-move-results --user jhutar --prefix spc

and data seems to be publicly available

@pravins
Copy link
Member Author

pravins commented Dec 22, 2022

@jhutar New server is still not deployed. With new implementation of user notion, user has to generate a token and use that token to upload the data.
Once its uploaded, only authenticated user will able to see the data. User will be able to mark their data public, so anyone can access it using direct link.
With present 0.69 server, all uploaded data is public.

@portante portante added Backlog Server API Of and relating to application programming interfaces to services and functions labels Dec 23, 2022
@dbutenhof
Copy link
Member

pbench-move-results --user jhutar --prefix spc

As Pravin said, the --user from 0.69 has nothing to do with dataset ownership: it's effectively metadata. (In fact, if you're accessing results on 0.69 you'll only find any trace of this if you open metadata.log; what it does is control a link under the users results tree.) When you move to Pbench Server 1.0, you'll specify an API authentication token to pbench-results-move --token which identifies the user which will own the data. (You'll also be able to specify --access public if you want your data to initially be accessible to everyone rather than private.)

Also, note that --prefix goes away completely: it's not supported by pbench-results-move.

@webbnh
Copy link
Member

webbnh commented Jan 3, 2023

@pravins, I'm not sure that this is a direction that we want to take.

My understanding of the model is that pbench-results-move "simply" takes an existing file tree from the current system and uploads it to the Pbench Server. Thus, the user already has access to their result on the local system, and doesn't need a URL from the Server in order to find it (locally). If a user wants to download a result somewhere else, they can (or will be able to) do that via the Pbench Dashboard (which will handle all of the authentication and access issues). I don't see how it will improve things to add generation of a URL to pbench-results-move.

Now, if the issue is that a user is having trouble locating their result locally, that seems like something we might consider creating a command for. (Such a task should be simple, well-bounded, and not subject to implementation details of the Pbench Server.)

@dbutenhof
Copy link
Member

A little more nuanced. First, the default behavior of pbench-results-move deletes the local tarball after successfully uploading it. (Although there is a --no-delete approximating the behavior of the old pbench-results-copy.)

We also don't currently have dashboard mechanism to return dataset inventory. (Either individual files, which might be displayed inline, or the full tarball, which would presumably be downloaded to the browser's local filesystem.)

There's no reason that anyone can't just use curl or equivalent to pull files from a Pbench Server. It might be perfectly reasonable for PUT, for example, to return a Location response header with the http://<host>/api/v1/inventory/<resource_id> URL; which, if used directly on a GET, would return the entire dataset tarball. And, if we did that, there's no reason that pbench-results-move couldn't display that header for whatever good it might do the user.

So ultimately, whether or not we do this, and however we prioritize it, I don't think there's anything inherently wrong with the request.

@webbnh
Copy link
Member

webbnh commented Jan 3, 2023

Diving into the nuance, the RFE requests the ability for a user to "do further analysis on the uploaded results to find required information." There are two ways for the user to engage with that: either locally or on some other system.

If the user wants to work locally, then there is no reason to involve pbench-results-move or the Pbench Server -- the user already has the results (assuming that pbench-results-move doesn't remove them) -- and the issue devolves to whether the user can readily find them (which I think it makes sense to help with, if needed).

However, if the user wants to work on another system, then having the result's URL made available on the system where they ran pbench-results-move is somewhat inconvenient, since that's not where they want the results to be downloaded to. Also, contrary to your assertion, a user won't necessarily be able to "just use curl", because there is (or will, or should be!) an authentication and authorization process involved in accessing the result on the Pbench Server.

Thus, I think it would be better to refer the user to the Pbench Dashboard:

  • it will handle the authentication and authorization details
  • it can collaborate with the Pbench Server to hide any details around the URI, access, and download (particularly as these change over time)
  • and, for specific analysis tasks, it will evolve to offer support which frees the users from having to do it themselves.

That is, I think that using the Pbench Dashboard (in collaboration with the Pbench Server) should (and will eventually) be better than accessing the result directly...and so I don't see any value to trying to help the user to avoid using the Dashboard.

@dbutenhof
Copy link
Member

Yes, obviously one needs an authorization header; but once we adopt the "API token" mechanism that's not onerous. We don't want to position our dashboard as the only gateway into the server APIs, for lots of reasons, including the fact that we don't expect our dashboard to be able to do everything someone might want. Making it easy for them to get at the raw archived data provides a hook for all sorts of expansions.

This is all way beyond "exposing the URI in pbench-move-results" (and, by extension, our PUT API), but all I'm arguing is that I can see viable uses.

@webbnh
Copy link
Member

webbnh commented Jan 3, 2023

We don't want to position our dashboard as the only gateway into the server APIs

I concur -- we want to facilitate the development of other client programs, as well. But, equally, I don't think we should expect our (human) end-users to be using the Pbench Server APIs directly (e.g., via curl), and so handing out URLs (especially if they are returned on "the wrong node") isn't likely to produce as good a UX as the Dashboard would/should (and, using the Dashboard will facilitate covering the arcana like how to provide the API token, etc. in the access request to the Server).

we don't expect our dashboard to be able to do everything someone might want. Making it easy for them to get at the raw archived data provides a hook for all sorts of expansions.

No, we don't...but wouldn't it be cool if it did?! And, if the Dashboard is the first point of contact, then we're much more likely to find out when it doesn't do something which a user needs. (If the default is that we give users a URL for their result, then, instead of getting RFE's for the Dashboard, we'll be getting requests for help making curl work. 😛)

I would really like to see "access to the result" be a Pbench Server API request, e.g., for timeseries data, possibly facilitated by the Dashboard, and not a process where the user pulls the tarball, unpacks it, and then does something like manually ingest the guts into Elasticsearch.... 😢

@jhutar
Copy link
Contributor

jhutar commented Jan 4, 2023

Hello @webbnh . I need that URL because my use-case is "run the test, upload results and cleanup" and I need to add that URL to the metadata I store.

@jhutar
Copy link
Contributor

jhutar commented Jan 4, 2023

Also reading the discussion: note I do not know the difference between pbench server and pbench dashboard (and I do not care too much). I just want a direct link to the results I just uploaded to be able to record it together with my metadata so it is available long-term. If it is tarball, unpacked tarball directory or some UI dashboard full of fancy graphs, I do not care at the moment. If this is not possible or does not fit your stack, feel free to close this - I can deal with it.

@webbnh
Copy link
Member

webbnh commented Jan 4, 2023

Hi @jhutar,

The Pbench Dashboard is a web-based GUI for accessing uploaded Pbench results; the Pbench Server provides the RESTful API on which the Dashboard (and, someday, other clients) is built.

In v0.69, the Pbench Server deployment includes an Apache file service which is used to serve the results as tarball unpacked into a file tree which mirrors the results tree on the system where the workload was run. As Pbench evolves, we would like to move in a direction which is more data-focused and less tied to the representation of that data -- e.g., to replace the tarball and original filesystem view with APIs which allow the user to query the results and receive configuration, summary, and timeseries data as, e.g., JSON blobs which are more amenable to visualization and data science investigation.

I don't know if such a transition is actually possible, since people seem to have a long and continuing tradition of treating the Pbench back end as a file server from which they pull bits of their results in order to do custom processing. But, the idea that pbench-move-results (or, its follow-on, pbench-results-move) would return a URL for the tarball would just further resist the transition, since it would continue to promulgate the notion that "a Pbench result" is a tarball; and, worse, it might result in people making assumptions about the format of the URL and how the result is stored (e.g., at the very least, we want to move away from the file system view to an object store).

That said, if what you are looking for is actually some sort of ID which would be used later to identify and retrieve a result, we could probably do something useful there. The Pbench Server has a notion of a "result ID" -- we could make that available from pbench-results-move, and you could store that with your metadata and use it later to request results from the Pbench Server. (@dbutenhof?)

@jhutar, the concept of having pbench-results-move report back to the user (on success...) some sort of identifier for the uploaded result makes very good sense. I just think that a URL which would allow the user to re-download it again somewhere else doesn't provide the appropriate value (e.g., if the user just wants to package up their result and move it, the Agent has support for that without involving the Pbench Server!); however, returning the Server's ID for the result might be a good feature.

@jhutar
Copy link
Contributor

jhutar commented Jan 4, 2023

Hello. I agree, except that not just ID, but whole URL (with that ID I assume). Thank you!

@webbnh
Copy link
Member

webbnh commented Jan 4, 2023

not just ID, but whole URL

But...what exactly do you expect that URL to produce for you? Should it produce a web page for your browser (which will enable you to explore your data interactively and possibly visualize it)? Should it produce some sort of CSV file? Or, are you expecting it to produce a tarfile? And, if it's the last of those, what are you expecting to find in it?

That is, what do you conceive of "a Pbench result" as being?? And, how/when/where do you expect to access it? And, what do you hope to do with it?

@jhutar
Copy link
Contributor

jhutar commented Jan 4, 2023

I used pbench as a tool that allows me to record some metrics and system info, store it and forgot. I just need a URL that allows me to start digging into the result 3 months later in case I decide I have to. So all of what you named is most probably OK - all of that will give me some starting point.

By "a Pbench result" I mean: whatever was uploaded there. I assume (given what I have seen on e.g. http://pbench.perf.lab.eng.bos.redhat.com/results/f09-h26-b02-5039ms.rdu2.scalelab.redhat.com/spc/pbench-user-benchmark__2023.01.04T20.02.19/) it contains some monitoring data and some system info.

Hope it sounds as a reasonable use-case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agent API Of and relating to application programming interfaces to services and functions Backlog Server
Projects
None yet
Development

No branches or pull requests

5 participants