Improve `pineappl plot` Python interaction #249

alecandido · 2023-10-24T12:38:20Z

Currently, pineappl plot relies on Python and Matplotlib. This is actually quite nice, because people using PineAPPL are often familiar with Python, and know how to plot the information, once given.

A current mild pain-point is that there is a lot of string manipulation involved in the call, since a Python script is generated everytime, embedding the data:

pineappl/pineappl_cli/src/plot.rs

Lines 508 to 521 in 3476629

    
           print!( 
        
               include_str!("plot.py"), 
        
               inte = if bins == 1 { "" } else { "# " }, 
        
               nint = if bins == 1 { "# " } else { "" }, 
        
               pdfs = if pdfs == 1 || bins == 1 { "# " } else { "" }, 
        
               xlabel = xlabel, 
        
               ylabel = ylabel, 
        
               xlog = xlog, 
        
               ylog = ylog, 
        
               title = title, 
        
               output = output.to_str().unwrap(), 
        
               data = data_string, 
        
               metadata = format_metadata(&vector), 
        
           );

However, dumping the script is a design featured, in such a way that people are able to customize it.

I see a few possible ways of improving:

split the two pineappl plot operations: dump the script and the data separately (in a common data format, like CSV)
leverage pineappl_py in the script, and directly extract the data from the grid (requires the installation of the PineAPPL python extension)
move the plotting feature to pineappl_py itself (possibly with a pineappl-plot CLI, but it would be installed separately, at least until Distribute CLI as a Python package #176 will be completed)
use PyO3 to make the call from the CLI (i.e. from Rust) - in this way people will be able to customize the script and pass it again, but there won't be any need of dumping the data any longer (most likely this is an overkill)

The text was updated successfully, but these errors were encountered:

alecandido · 2023-10-24T12:38:49Z

This is not urgent at all, it is just a consideration that came to my mind reading #248, nothing more.

cschwan · 2023-10-24T13:09:28Z

I had similar considerations in the past, but

I like having the data and the script in one file, because then I can easily move/copy/rename the single file. This will become more complicated if we split things into more files.
Using pineappl_py in the Python plotting script is possible, but then we introduce two more dependencies: pineappl_py and LHAPDF. Especially the latter, particularly the Python version, is a pain in the neck to get working. Another downside is that the script will take much longer, which we probably want to avoid, given that we often work on the Python script to modify the plot, without having to regenerate the data.
Having the functionality available from Python we should definitely consider, although my plan was actually to eventually expose the entire CLI functionality in another, higher-level, crate, for which we then can write Python bindings, of course.
This suggestion I don't understand.

That being said I believe the plotting script(s) and functionality can be significantly improved. I was thinking of using a templating engine in such a way that the template is really kept in one file. This mostly is the case right now (in this file: https://github.com/NNPDF/pineappl/blob/master/pineappl_cli/src/plot.py), but writing out the data could be improved, for instance.

alecandido · 2023-10-24T14:15:05Z

Regarding the data handling, I see your point. However, I prefer to keep things separate, and rather gather through other means (e.g. folders or archives), since it makes it easier to consume them even for further purposes.
Fine to cache the results anyhow, but it could be implemented really as cache: if this file exists, use it, otherwise recompute it. The file name would stem from the grid, with the option to manually pass a file.

The idea of moving to Python is really to avoid the template (with or without a suitable engine). This is because I appreciate more libraries than code generators (less copy-paste, version management, and similar benefits). They are even easier to test and debug, because they are native in the language they are written in (while you'd need to use Python to test the script generated from Rust).

Since this part of the CLI functionality is already mostly Python, I would expose the interface through Python.
We could add the bindings for the rest in the same package of the plotting CLI.

P.S.: regarding point 4., the proposal was to write a Python function main(data, metadata), and make the call to main from Rust, through PyO3, such that the data are produced by Rust (with its own native data types) and marshalled to Python by PyO3, without the need of dumps in any file (not the source, nor anything else) - as I said, it's not needed, and it's also against your desiderata (i.e. caching)

cschwan · 2023-10-24T15:01:08Z

The idea of moving to Python is really to avoid the template (with or without a suitable engine). This is because I appreciate more libraries than code generators (less copy-paste, version management, and similar benefits). They are even easier to test and debug, because they are native in the language they are written in (while you'd need to use Python to test the script generated from Rust).

I think the crucial point that you dispute is whether we want to directly output the generated plot or a plotting script that in turn generates the plot (where the code sits is another question). I chose the latter for several reasons:

if the generated script is fine, I can run pineappl plot ... | python and directly get the plot; I don't need to bother with any additional files
however, very often, I want to modify the script a bit because because every plot has some quirks, and colleagues often have special wishes. In that case I can save the plotting script to a file, modify the file and run it to generate the plot. Alternatively, I can make the modifications automatically using tools that use the pipe (see @andreab1997's example here: https://github.com/NNPDF/papers/blob/master/nnpdf40mhou/plots/pheno/generate_pheno_plots.sh). Often, I have to iterate this several times until the plots look OK.

I don't think you can avoid having a code generator, unless you invent an API that sits on top of matplotlib, for instance (is that what you meant?). However, I don't see the added value of that - for me matplotlib IS the API to plot the data. Which advantages do you see?

alecandido · 2023-10-24T15:14:27Z

I don't think you can avoid having a code generator, unless you invent an API that sits on top of matplotlib, for instance (is that what you meant?). However, I don't see the added value of that - for me matplotlib IS the API to plot the data. Which advantages do you see?

Actually, this is the whole point, and nothing else.

Consider that partly you're already making such an API:

pineappl/pineappl_cli/src/plot.py

Lines 104 to 161 in e71b39a

    
           def percent_diff(a, b): 
        
               return (a / b - 1.0) * 100.0 
        
           def set_ylim(axis, save, load, filename): 
        
               # extract the y limits *not* considering margins 
        
               margins = axis.margins() 
        
               axis.margins(y=0.0) 
        
               ymin, ymax = axis.get_ylim() 
        
               axis.margins(y=margins[1]) 
        
               inc = 1.0 
        
               if (ymax - ymin) > 100.0: 
        
                   ymin = -50.0 
        
                   ymax = 50.0 
        
                   inc = 25.0 
        
               elif (ymax - ymin) > 30.5: 
        
                   inc = 10.0 
        
               elif (ymax - ymin) > 20.5: 
        
                   inc = 5.0 
        
               elif (ymax - ymin) > 10.5: 
        
                   inc = 2.0 
        
               elif (ymax - ymin) < 3.0: 
        
                   inc = 0.5 
        
               ymin = math.floor(ymin / inc) * inc 
        
               ymax = math.ceil(ymax / inc) * inc 
        
               if save: 
        
                   with open(filename, "wb") as f: 
        
                       pickle.dump([ymin, ymax, inc], f) 
        
               if load: 
        
                   resave = False 
        
                   with open(filename, "rb") as f: 
        
                       [saved_ymin, saved_ymax, saved_inc] = pickle.load(f) 
        
                   if saved_ymin < ymin: 
        
                       ymin = saved_ymin 
        
                       resave = True 
        
                   if saved_ymax > ymax: 
        
                       ymax = saved_ymax 
        
                       resave = True 
        
                   if saved_inc > inc: 
        
                       inc = saved_inc 
        
                       resave = True 
        
                   if resave: 
        
                       with open(filename, "wb") as f: 
        
                           pickle.dump([ymin, ymax, inc], f) 
        
               axis.set_yticks(np.arange(ymin, ymax + inc, inc)) 
        
               space = 0.05 * (ymax - ymin) 
        
               axis.set_ylim((ymin - space, ymax + space))

and the functions present have already enough arguments to be very customizable, without the need of patching them

pineappl/pineappl_cli/src/plot.py

Lines 275 to 284 in e71b39a

    
           def plot_rel_ewonoff(axis, **kwargs): 
        
               x = kwargs["x"] 
        
               y = percent_diff(kwargs["y"], kwargs["qcd_y"]) 
        
               qcd_y = percent_diff(kwargs["qcd_y"], kwargs["qcd_y"]) 
        
               qcd_ymin = percent_diff(kwargs["qcd_min"], kwargs["qcd_y"]) 
        
               qcd_ymax = percent_diff(kwargs["qcd_max"], kwargs["qcd_y"]) 
        
               ymin = percent_diff(kwargs["ymin"], kwargs["qcd_y"]) 
        
               ymax = percent_diff(kwargs["ymax"], kwargs["qcd_y"]) 
        
               pdf_min = abs(percent_diff(kwargs["pdf_results"][0][2], kwargs["pdf_results"][0][1]))[:-1] 
        
               pdf_max = abs(percent_diff(kwargs["pdf_results"][0][3], kwargs["pdf_results"][0][1]))[:-1]

The idea is that, if you want to do something very basic, your script could be a regular Python script, possibly with 3-4 lines only, importing pineappl.plot and just stating what do you want.
Instead, if you want to do something more convolute, you can avoid seding a Python script, but rather make loops in Python and work with the data themselves (rather than their code representation). It's one layer less.

alecandido · 2023-10-24T15:23:42Z

The API would be just the script, maybe slightly rearranged, but no more. I.e.:

default constants
helper functions: just percent_diff and set_ylim
the main handler (that puts in the usual column the desired panels)
the frequent plots definitions
a data/metadata loader

Content of 2. and 3. are most likely always repeated.
Instead, 1. is intended as a default, but the customization requires code manipulation, rather than passing different values.
For 4. you would not be able to customize the code, but my guess is that you frequently don't need, since small modifications can be done by passing different values. If you instead need a quite different plot, you could make your own, even copying from the source one of the existing ones as an example (to me, it would be an improvement, but I expect this to rarely happen anyhow).

The last is just what we were discussing at the beginning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `pineappl plot` Python interaction #249

Improve `pineappl plot` Python interaction #249

alecandido commented Oct 24, 2023

alecandido commented Oct 24, 2023 •

edited

cschwan commented Oct 24, 2023

alecandido commented Oct 24, 2023 •

edited

cschwan commented Oct 24, 2023

alecandido commented Oct 24, 2023

alecandido commented Oct 24, 2023

Improve pineappl plot Python interaction #249

Improve pineappl plot Python interaction #249

Comments

alecandido commented Oct 24, 2023

alecandido commented Oct 24, 2023 • edited

cschwan commented Oct 24, 2023

alecandido commented Oct 24, 2023 • edited

cschwan commented Oct 24, 2023

alecandido commented Oct 24, 2023

alecandido commented Oct 24, 2023

Improve `pineappl plot` Python interaction #249

Improve `pineappl plot` Python interaction #249

alecandido commented Oct 24, 2023 •

edited

alecandido commented Oct 24, 2023 •

edited