Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatically merging cells in tables #970

Open
stevenlis opened this issue Oct 19, 2023 · 10 comments
Open

Add automatically merging cells in tables #970

stevenlis opened this issue Oct 19, 2023 · 10 comments
Assignees

Comments

@stevenlis
Copy link

Please explain your intent
It is common to encounter merged cells in a table. Currently, you need to manually create a merged cell using colspan in row.cell(). It would be helpful if merged cells could be handled automatically, eliminating the need to loop over your data and add multiple nested if-else conditions.

Describe the solution you'd like
Assuming our table data appears as follows.

table_data = [
    ['0', '1', '2', '3'],
    ['A1', 'A2', '-', 'A4'],
    ['B1', '-', 'B3', 'B4'],
]

Here, we use '-' (or any other user-defined placeholder) to indicate an empty cell.

  • Empty cells can be merged with the cell before:
with pdf.table(width=120, auto_merging='before', empty_cell='-') as table

CS-zwUA0SLZ@2x

  • Or they can be merged with the cell after:
with pdf.table(width=120, auto_merging='after', empty_cell='-') as table

CS-hMdlh819@2x

@gmischler
Copy link
Collaborator

Welcome to fpdf2, @stevenlis !

Your suggestion is interesting, but raises a few tricky questions:

Is a specific text string really the optimal trigger? The same string might appear in the data as legitimate cell content, after all.

I expect that we'll eventually support row spans as well. How should we distinguish between the two cases in auto merging?

More generally speaking: Column and row spans are essentially metadata of the table. Is it really wise trying to derive those from the cell contents? While it might be possible to get it to work in a very specific set of circumstances, a system like that seems very prone to suffering from unintended consequences.

@stevenlis
Copy link
Author

@gmischler Thank you for response.

Indeed, there are definitely cases where you may want to leave an empty cell as it is, which is why I mention "any other user-defined placeholder" to provide users with flexibility.

Regarding row spans, perhaps we can use two different characters/strings to define the rules. For example, "-" for colspan and "|" for rowspan, or any other characters that users prefer.

I'm new to fpdf2 and my main challenge is that it requires looping over the table instead of taking the data as a whole. This approach provides flexibility in defining the style of each cell, but it would be easier if we had a function like write_table accompanied by a styler object for styling purposes only. An example of such API design can be found in pandas.

As for the table layout, I took inspiration from matplotlib's mosaic, which allows you to easily define the layout of a figure.

Maybe we can have a function that returns table_data and colspan mapping for each cell.

from fpdf import FPDF

pdf = FPDF()
pdf.set_font(family='helvetica')
pdf.add_page()

def remove_empty_map_colspan(table_data: list, empty_cell: str) -> tuple[list, list]:
    """
    Remove the table cells with the specified value for empty_cell,
    and maps the cleaned data based on the following rules:
    - If the item/cell is not empty, it is mapped to 1.
    - If the item/cell is empty, it is mapped to 2, and the next item in the row is skipped.

    Args:
        table_data (list): Nested list representing the table data.
        empty_cell (string): Value representing an empty cell.

    Returns:
        tuple: A tuple containing the mapped colspan data and the cleaned data.

    Example:
        table_data = [
            ['0', '1', '2', '3'],
            ['A1', 'A2', '', 'A4'],
            ['B1', '', 'B3', 'B4'],
        ]
        empty_removed_table, colspans = remove_empty_map_colspan(table_data)
        print(empty_removed_table)
        # Output: [['0', '1', '2', '3'], ['A1', 'A2', 'A4'], ['B1', 'B3', 'B4']]
        print(colspans)
        # Output: [[1, 1, 1, 1], [1, 1, 2], [1, 2, 1]]
    """

    colspans = []
    skip_next = False

    for row in table_data:
        new_row = []
        for item in row:
            if skip_next:
                skip_next = False
                continue
            if item != empty_cell:
                new_row.append(1)
            else:
                new_row.append(2)
                skip_next = True
        colspans.append(new_row)

    empty_removed_table = [
        [item for item in row if item != empty_cell] for row in table_data
    ]

    return empty_removed_table, colspans

table_data = [
    ['0', '1', '2', '3'],
    ['A1', 'A2', '-', 'A4'],
    ['B1', '-', 'B3', 'B4'],
]
table_data, colpans = remove_empty_map_colspan(table_data, empty_cell='-')

with pdf.table(width=120) as table:
    for data_row, colspan_row in zip(table_data, colpans):
        row = table.row()
        for cell, colspan in zip(data_row, colspan_row):
            row.cell(cell, colspan=colspan)

pdf.output('test-file.pdf')

CS-VHAWXuyd@2x

@gmischler
Copy link
Collaborator

The longer I think about it, the less comfortable I am with the idea of encoding structural information about a table in individual data items. This just looks loke bad coding and data management practise to me. Where do you encounter such data?

When fpdf2 started, it described itself as "simplistic". It has gained quite some additional functionality since then, so we have now arrived at "simple and straightforward", essentially following the Python motto of "make simple tasks easy, and complex tasks possible".

Tables are a very recent addition, and there's still room for improvement. But for simple data (without spans), you already can feed them in all at once. What you'd like to have though, is not simple. It's still possible, but you'll have to do some of the legwork yourself.

If there is no way to acquire your data in a more resonable form, then the best workaround is probably to write a wrapper for FPDF.table() which does more or less what your example above illustrates. This will make it straightforward for you to use, while keeping the data/metadata confusion away from the fpdf2 codebase

@stevenlis
Copy link
Author

Where do you encounter such data?

If you perform data analysis to generate reports or check simple statistics using crosstabs, it's a part of your daily routine. I understand that fpdf2 is not specifically designed for report generation but has a more general purpose. Nonetheless...

Thank you for the explanation. I don't think overwriting a class every time for simple things is very pythonic. Also, there's no need to overcomplicate things by using a context manager to create a table when it's already in a table structure. Anyway, that's just my opinion.

@gmischler
Copy link
Collaborator

Where do you encounter such data?

If you perform data analysis to generate reports or check simple statistics using crosstabs, it's a part of your daily routine.

Are you saying that data where column spans are defined through special cell contents are an inherent and necessary result of statistical analysis? Or are we really talking about different things here?

@stevenlis
Copy link
Author

To be honest, I'm not quite sure, but here's an example where I would like to generate summary statistics for different group breakdowns.

import seaborn as sns
df = sns.load_dataset('tips')
df_result = (
    df.groupby(['sex', 'smoker', 'time'])[['tip', 'size', 'total_bill']]
    .mean().round(1).reset_index()
    .pivot(index=['smoker', 'time'], columns=['sex'])
)
df_result

CS-wID7WoFX@2x

The one above is perhaps the most readable format, but I think we can also accept the following if rowspan is a challenge.

df_result.reset_index(inplace=True)
df_result.columns.names = [None, None]
df_result

CS-3s5YlqJM@2x

Then the issue is that how do we represent this structure, I think we one way is:

table_data = [
    ['smoker', 'time' , 'tip', '-', 'size', '-', 'total_bill', '-'],
    [' ', ' ', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'], 
    ['Yes', 'Lunch', 2.8, 2.9, 2.2, 2.3, 17.4, 17.4],
    ['Yes', 'Dinner', 3.1, 2.9, 2.6, 2.2, 23.6, 18.2],
    ['No', 'Lunch', 2.9, 2.5, 2.5, 2.5, 18.5, 15.9],
    ['No', 'Dinner', 3.2, 3.0, 2.8, 2.7, 20.1, 20.0]]

]

Each row represents a list with 8 items. An empty cell is denoted by a ' ' and a merging to the next cell is indicated by a '-'.

Of course, you can always convert it into a standard table structure. However, when including it in a PDF report, it becomes less readable.

df_result.columns = ['_'.join(col) for col in df_result.columns.values]
df_result

CS-K9qfsLf3@2x

I closed this because I don't want you to feel like I'm requesting something just for myself. However, it's a very common use case in my experience when you create any kind of report that includes a table.

@gmischler
Copy link
Collaborator

gmischler commented Oct 20, 2023

I guess I should have made the connection to #969 myself, eh? 🙄

So we're not talking about detecting spans in the actual grid data, but rather about how to get the labelling formatted correctly. That makes it a bit less weird.

Actually, in the context of a jupyter notebook, that data format makes perfect sense. The trick there is that jupyter simply doesn't draw any border lines. That way it doesn't have to do anything special with the empty cells, and they'll still give the visual appearance of a column or row span. What you're looking at there is essentially ASCII art...

When you try to put the same data in a PDF for print (and with border lines), the requirements change quite a bit. In an ideal world, pandas would offer an option to give you the data in a more explicitly structured form (maybe HTML?). I've never looked at it in any detail, so I have no idea if it does. If it doesn't though, then the next necessary step is "data cleanup", to transform the ASCII art hack into something more palatable. You were hoping for fpdf2 to handle this cleanup for you. I don't think that is the right place for it (I may not have the last word on it, so that is just my take).

From a software architecture perspective, I think this sort of problem would be most reasonably solved with a system of "data adapters". This could be a subclass as previously suggested, but a less tightly coupled approach might be even more flexible. Let's say we have a class that knows how to talk to a fpdf2 table. Then you create a subclass that can process and cleanup pandas output. Other subclasses can be added to handle any weird data produced by other packages. This keeps fpdf2 itself simple, while still enabling data exchange with pretty much anything out there.

Don't worry about bringing this up. The topic raises some interesting questions that are clearly of general interest. While I personally don't agree with your initial approach, we do have a strong interest in enabling data transfer from other software (there's a whole section about that in the docs). I'll reopen this for now to see if someone else has a better idea than what I've come up with so far.

@gmischler gmischler reopened this Oct 20, 2023
@afriedman412
Copy link

This is less about generalized formatting and more specifically formatting multi-level indexes, and that feels like a pretty widely applicable use case!

I'm going to take a crack at building out a more robust header/index functionality for tables, with an eye towards fixing this problem specifically.

@gmischler
Copy link
Collaborator

I'm going to take a crack at building out a more robust header/index functionality for tables, with an eye towards fixing this problem specifically.

Both row span and row labels are features that we could use in any case, as they are necessary to create more complex tables than is currently possible. If you can come up with an implementation that offers those and is still reasonably easy to use, more power to you! 👍

The question of allowing special data items (ie. empty strings) to control the table formatting probably needs some more input and consideration. I'd recomment to postpone that aspect to a later phase.

@afriedman412
Copy link

afriedman412 commented Oct 25, 2023

I added code that automatically formats a multi-index/multi-header dataframe so it looks like it does in pandas.

So this...

import seaborn as sns
from fpdf import FPDF
from fpdf.table import format_dataframe

data = sns.load_dataset('tips')
df = (
    data.groupby(['sex', 'smoker', 'time'])[['tip', 'size', 'total_bill']]
    .mean().round(1).reset_index()
    .pivot(index=['smoker', 'time'], columns=['sex'])
)

# new function
data = format_dataframe(df) # converts data to string and formats column and index labels

pdf = FPDF()
pdf.add_page()
pdf.set_font("Times", size=10)
with pdf.table() as table:
    for data_row in data:
        table.row(data_row)
pdf.output("table_from_pandas.pdf")

Now renders into this...

Screen Shot 2023-10-25 at 12 20 52 PM

Also if you pass include_index=False to format_dataframe() it will ignore the index.

The update also adds multi-index formatting that matches the current implementation of multi-header formatting.

Probably worth noting that I'm using "index" as the label for rows and "header" as label for the columns, and using "column" the way everyone seems to be using "span" to align with Pandas terminology. It's whatever but worth keeping an eye on.

Anyways lmk how this all looks and I'll add tests and better documentation...

(oh also I guess this got lumped in with the last pull request -- dunno if that matters though?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment