Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pineappl plot Python interaction #249

Open
alecandido opened this issue Oct 24, 2023 · 6 comments
Open

Improve pineappl plot Python interaction #249

alecandido opened this issue Oct 24, 2023 · 6 comments

Comments

@alecandido
Copy link
Member

Currently, pineappl plot relies on Python and Matplotlib. This is actually quite nice, because people using PineAPPL are often familiar with Python, and know how to plot the information, once given.

A current mild pain-point is that there is a lot of string manipulation involved in the call, since a Python script is generated everytime, embedding the data:

print!(
include_str!("plot.py"),
inte = if bins == 1 { "" } else { "# " },
nint = if bins == 1 { "# " } else { "" },
pdfs = if pdfs == 1 || bins == 1 { "# " } else { "" },
xlabel = xlabel,
ylabel = ylabel,
xlog = xlog,
ylog = ylog,
title = title,
output = output.to_str().unwrap(),
data = data_string,
metadata = format_metadata(&vector),
);

However, dumping the script is a design featured, in such a way that people are able to customize it.

I see a few possible ways of improving:

  1. split the two pineappl plot operations: dump the script and the data separately (in a common data format, like CSV)
  2. leverage pineappl_py in the script, and directly extract the data from the grid (requires the installation of the PineAPPL python extension)
  3. move the plotting feature to pineappl_py itself (possibly with a pineappl-plot CLI, but it would be installed separately, at least until Distribute CLI as a Python package #176 will be completed)
  4. use PyO3 to make the call from the CLI (i.e. from Rust) - in this way people will be able to customize the script and pass it again, but there won't be any need of dumping the data any longer (most likely this is an overkill)
@alecandido
Copy link
Member Author

alecandido commented Oct 24, 2023

This is not urgent at all, it is just a consideration that came to my mind reading #248, nothing more.

@cschwan
Copy link
Contributor

cschwan commented Oct 24, 2023

I had similar considerations in the past, but

  1. I like having the data and the script in one file, because then I can easily move/copy/rename the single file. This will become more complicated if we split things into more files.
  2. Using pineappl_py in the Python plotting script is possible, but then we introduce two more dependencies: pineappl_py and LHAPDF. Especially the latter, particularly the Python version, is a pain in the neck to get working. Another downside is that the script will take much longer, which we probably want to avoid, given that we often work on the Python script to modify the plot, without having to regenerate the data.
  3. Having the functionality available from Python we should definitely consider, although my plan was actually to eventually expose the entire CLI functionality in another, higher-level, crate, for which we then can write Python bindings, of course.
  4. This suggestion I don't understand.

That being said I believe the plotting script(s) and functionality can be significantly improved. I was thinking of using a templating engine in such a way that the template is really kept in one file. This mostly is the case right now (in this file: https://github.com/NNPDF/pineappl/blob/master/pineappl_cli/src/plot.py), but writing out the data could be improved, for instance.

@alecandido
Copy link
Member Author

alecandido commented Oct 24, 2023

Regarding the data handling, I see your point. However, I prefer to keep things separate, and rather gather through other means (e.g. folders or archives), since it makes it easier to consume them even for further purposes.
Fine to cache the results anyhow, but it could be implemented really as cache: if this file exists, use it, otherwise recompute it. The file name would stem from the grid, with the option to manually pass a file.

The idea of moving to Python is really to avoid the template (with or without a suitable engine). This is because I appreciate more libraries than code generators (less copy-paste, version management, and similar benefits). They are even easier to test and debug, because they are native in the language they are written in (while you'd need to use Python to test the script generated from Rust).

Since this part of the CLI functionality is already mostly Python, I would expose the interface through Python.
We could add the bindings for the rest in the same package of the plotting CLI.

P.S.: regarding point 4., the proposal was to write a Python function main(data, metadata), and make the call to main from Rust, through PyO3, such that the data are produced by Rust (with its own native data types) and marshalled to Python by PyO3, without the need of dumps in any file (not the source, nor anything else) - as I said, it's not needed, and it's also against your desiderata (i.e. caching)

@cschwan
Copy link
Contributor

cschwan commented Oct 24, 2023

The idea of moving to Python is really to avoid the template (with or without a suitable engine). This is because I appreciate more libraries than code generators (less copy-paste, version management, and similar benefits). They are even easier to test and debug, because they are native in the language they are written in (while you'd need to use Python to test the script generated from Rust).

I think the crucial point that you dispute is whether we want to directly output the generated plot or a plotting script that in turn generates the plot (where the code sits is another question). I chose the latter for several reasons:

  1. if the generated script is fine, I can run pineappl plot ... | python and directly get the plot; I don't need to bother with any additional files
  2. however, very often, I want to modify the script a bit because because every plot has some quirks, and colleagues often have special wishes. In that case I can save the plotting script to a file, modify the file and run it to generate the plot. Alternatively, I can make the modifications automatically using tools that use the pipe (see @andreab1997's example here: https://github.com/NNPDF/papers/blob/master/nnpdf40mhou/plots/pheno/generate_pheno_plots.sh). Often, I have to iterate this several times until the plots look OK.

I don't think you can avoid having a code generator, unless you invent an API that sits on top of matplotlib, for instance (is that what you meant?). However, I don't see the added value of that - for me matplotlib IS the API to plot the data. Which advantages do you see?

@alecandido
Copy link
Member Author

I don't think you can avoid having a code generator, unless you invent an API that sits on top of matplotlib, for instance (is that what you meant?). However, I don't see the added value of that - for me matplotlib IS the API to plot the data. Which advantages do you see?

Actually, this is the whole point, and nothing else.

Consider that partly you're already making such an API:

def percent_diff(a, b):
return (a / b - 1.0) * 100.0
def set_ylim(axis, save, load, filename):
# extract the y limits *not* considering margins
margins = axis.margins()
axis.margins(y=0.0)
ymin, ymax = axis.get_ylim()
axis.margins(y=margins[1])
inc = 1.0
if (ymax - ymin) > 100.0:
ymin = -50.0
ymax = 50.0
inc = 25.0
elif (ymax - ymin) > 30.5:
inc = 10.0
elif (ymax - ymin) > 20.5:
inc = 5.0
elif (ymax - ymin) > 10.5:
inc = 2.0
elif (ymax - ymin) < 3.0:
inc = 0.5
ymin = math.floor(ymin / inc) * inc
ymax = math.ceil(ymax / inc) * inc
if save:
with open(filename, "wb") as f:
pickle.dump([ymin, ymax, inc], f)
if load:
resave = False
with open(filename, "rb") as f:
[saved_ymin, saved_ymax, saved_inc] = pickle.load(f)
if saved_ymin < ymin:
ymin = saved_ymin
resave = True
if saved_ymax > ymax:
ymax = saved_ymax
resave = True
if saved_inc > inc:
inc = saved_inc
resave = True
if resave:
with open(filename, "wb") as f:
pickle.dump([ymin, ymax, inc], f)
axis.set_yticks(np.arange(ymin, ymax + inc, inc))
space = 0.05 * (ymax - ymin)
axis.set_ylim((ymin - space, ymax + space))

and the functions present have already enough arguments to be very customizable, without the need of patching them
def plot_rel_ewonoff(axis, **kwargs):
x = kwargs["x"]
y = percent_diff(kwargs["y"], kwargs["qcd_y"])
qcd_y = percent_diff(kwargs["qcd_y"], kwargs["qcd_y"])
qcd_ymin = percent_diff(kwargs["qcd_min"], kwargs["qcd_y"])
qcd_ymax = percent_diff(kwargs["qcd_max"], kwargs["qcd_y"])
ymin = percent_diff(kwargs["ymin"], kwargs["qcd_y"])
ymax = percent_diff(kwargs["ymax"], kwargs["qcd_y"])
pdf_min = abs(percent_diff(kwargs["pdf_results"][0][2], kwargs["pdf_results"][0][1]))[:-1]
pdf_max = abs(percent_diff(kwargs["pdf_results"][0][3], kwargs["pdf_results"][0][1]))[:-1]

The idea is that, if you want to do something very basic, your script could be a regular Python script, possibly with 3-4 lines only, importing pineappl.plot and just stating what do you want.
Instead, if you want to do something more convolute, you can avoid seding a Python script, but rather make loops in Python and work with the data themselves (rather than their code representation). It's one layer less.

@alecandido
Copy link
Member Author

The API would be just the script, maybe slightly rearranged, but no more. I.e.:

  1. default constants
  2. helper functions: just percent_diff and set_ylim
  3. the main handler (that puts in the usual column the desired panels)
  4. the frequent plots definitions
  5. a data/metadata loader

Content of 2. and 3. are most likely always repeated.
Instead, 1. is intended as a default, but the customization requires code manipulation, rather than passing different values.
For 4. you would not be able to customize the code, but my guess is that you frequently don't need, since small modifications can be done by passing different values. If you instead need a quite different plot, you could make your own, even copying from the source one of the existing ones as an example (to me, it would be an improvement, but I expect this to rarely happen anyhow).

The last is just what we were discussing at the beginning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants