Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table layout will break if the content of a cell requires more than one page #1142

Open
SMMousaviSP opened this issue Apr 4, 2024 · 14 comments

Comments

@SMMousaviSP
Copy link

When the content of one cell within a table is too long, the layout will break.
table.pdf

Minimal code

from fpdf import FPDF

WEIRD_LONG_TEXT = "Just generate a very long text with any lorem ipsum generator, make sure it is long enough to require more than one page"

TABLE_DATA = (
    ("First name", "Last name", "Age", "City"),
    ("Jules", "Smith", "34", "San Juan"),
    ("Mary", "Ramos", "45", WEIRD_LONG_TEXT),
    ("Carlson", "Banks", "19", "Los Angeles"),
)

pdf = FPDF()
pdf.add_page()
pdf.set_font("Times", size=16)
with pdf.table() as table:
    for data_row in TABLE_DATA:
        row = table.row()
        for datum in data_row:
            row.cell(datum)
pdf.output('table.pdf')

Environment
Please provide the following information:

  • Operating System: Fedora 39
  • Python version: 3.11.8
  • fpdf2 version used: 2.7.8
@Lucas-C
Copy link
Member

Lucas-C commented Apr 5, 2024

Hi @SMMousaviSP

When I tested your code with latest version of fpdf2 and long loremp ipsum text, I get this:
Capture d'écran 2024-04-05 141704

Given this size of your input text, it seems difficult to handle it "properly".
What did you expect? A cell break over a page break?
Could you please provide a bit more details on the behaviour you think would be best in this case?

@SMMousaviSP
Copy link
Author

Hello @Lucas-C
Thank you for your response.

What I had in mind was something like this. This is the default behavior of Microsoft Word.
table_in_word.pdf

I understand that if a text is too long, it may not make much sense to store it in a table. However, imagine a scenario where most of the time all cells of the table are going to be just one line or 2 to 3 lines and using the table would make sense and is a nice visual representation but once or twice a cell might become longer than expected. This will prevent the use of tables in automated tasks since we know it is possible for the table to break (even if it happens rarely).

As you saw in my first example, in the case of exceedingly long texts, it doesn't only break that specific row and the layout of the whole table gets broken.

@Lucas-C
Copy link
Member

Lucas-C commented Apr 9, 2024

Hi @SMMousaviSP

I had a closer look at the code, and I think this will be a difficult addition.

Basically, it would mean to A) detect this situation and B) split the row into several smaller rows, which is something I am bit relunctant to do, because a consequence would be for the table final rows count to be unpredictable for the user...

A first step / easier solution could be to A) detect this situation and B) raise an error in that case.
I opened PR #1143 to implement this.

What do you think about this strategy @SMMousaviSP?
Also pinging @mjasperse who recently worked on adding support for rowspan to tables.

@mjasperse
Copy link

I have some ideas for how to handle rendering a long cell across multiple pages, although I see the potential for a lot of edge cases where the desired behaviour is not necessarily clear. I'm happy to give it a go and see how far I get with a simplistic approach.

@SMMousaviSP
Copy link
Author

SMMousaviSP commented Apr 9, 2024

@Lucas-C Thank you for taking time and looking into this issue.

Would there be a reasonable solution to resolve this error when it happens? For example by somehow figuring out how much of the text is fitting to be able to either cut the rest or make a new row for that manually? Otherwise to be honest I prefer to have the broken table compared to not having any PDF at all.

Also I want to point out that I initially thought that the page break within a row is possible and this is just a bug cause I saw this comment in _render_table_cell function:

# draw the outer box separated by the gutter dimensions
# the top and bottom borders are one continuous line
# whereas the left and right borders are segments beause of possible pagebreaks

If there was not supposed to be any page break, why the left and right borders are segments and not continuous.

@Lucas-C
Copy link
Member

Lucas-C commented Apr 9, 2024

I have some ideas for how to handle rendering a long cell across multiple pages, although I see the potential for a lot of edge cases where the desired behaviour is not necessarily clear. I'm happy to give it a go and see how far I get with a simplistic approach.

You are very welcome to give it a try @mjasperse! 🙂
You can use PR #1143 as a starting point, or go your own way.

Would there be a reasonable solution to resolve this error when it happens? For example by somehow figuring out how much of the text is fitting to be able to either cut the rest or make a new row for that manually?

Good idea @SMMousaviSP, dropping extraneous text seems easier to implement than splitting the cell/row.
A first naive implementation i think of would be to use a dichotomic search in order to figure where in the text content to cut in order to maximize its length but still fit on the page.
If we adopt this approach, it would be nice to produce a log warning when such "cut" happens.

If there was not supposed to be any page break, why the left and right borders are segments and not continuous.

That's simply because we support page breaks inside tables, with rows spanning over several pages and we then repeat the heading row at the top of every page, but not inside cells 🙂

@mjasperse
Copy link

Since multi_cell already computes line breaks with word wrapping as part of the first layout pass, my idea was to use that to return the page break locations and recompute the table layout. I think it would be not so hard if there are no rowspans.

@Lucas-C
Copy link
Member

Lucas-C commented Apr 29, 2024

Since multi_cell already computes line breaks with word wrapping as part of the first layout pass, my idea was to use that to return the page break locations and recompute the table layout. I think it would be not so hard if there are no rowspans.

OK!
How do you suggest that fpdf2 should behave if rowspans are used then? A NotImplementedError would be raised?

@shoaib-moeen
Copy link

Hey @Lucas-C , Can I have a go at this one?

@Lucas-C
Copy link
Member

Lucas-C commented Apr 29, 2024

Hey @Lucas-C , Can I have a go at this one?

Well, it depends if @mjasperse wants to continue working further on this subject.

@mjasperse
Copy link

@shoaib-moeen you can give it a try! My simple idea broke rather quickly on the test suite because of scenarios like table cells with multiple "overflowing" cells, or images. If there is only one long cell, that is fairly easy to handle but I think the general-case complexity grows quickly.

Working on my initial implementation raised the following points that I would appreciate your thoughts on @Lucas-C

  1. When should a cell should be split versus when should the whole cell be moved to the next page? For example, if you had a cell that was 5 lines long where a pagebreak is anticipated after the first line, should it be broken? I think it would be better to move the whole cell to the next page in that case, so what is the cutoff? 3 lines? 5 lines? 50%? A user-specified parameter?
  2. Should the user have the ability to specify a cell cannot be split? This is relevant for cells containing images, which should never be broken and that has implications for whether the adjacent cells are split or moved.
  3. How should vertical alignment be handled? What does it mean to split a cell that is specified with vertical centering? Or a cell that now spans multiple pages because the adjacent cell is very large? Should as much as possible be put into the first page, or should it be centered in the "largest" part? I should check how Word handles this case.
  4. It think cells should not be split if a rowspan is involved because it adds too much complexity. If the combined rowspan exceeds the EPH I think throwing an exception might be OK. Can be annoying to debug since it doesn't generate any PDF output in that scenario - maybe it makes sense to truncate the table and throw the exception such that catching and calling pdf.output() will make an output that can be checked?

Maybe I'm overthinking the edge cases and the primary use-case here is handling one long cell and ensuring a sensible PDF is generated at all. However the first two points affect the existing test-suite cases.

Unfortunately I doubt I will have time to work on this implementation in the next two months because of work commitments, sorry that I wasn't able to come up with something faster.

@Lucas-C
Copy link
Member

Lucas-C commented May 2, 2024

  1. When should a cell should be split versus when should the whole cell be moved to the next page? For example, if you had a cell that was 5 lines long where a pagebreak is anticipated after the first line, should it be broken? I think it would be better to move the whole cell to the next page in that case, so what is the cutoff? 3 lines? 5 lines? 50%? A user-specified parameter?

IMHO, for now there is no need for fine-grained control there, but we should leave it open for further improvements later on:

  • for now, the behaviour could simply be naive, that is: in the case you mentioned, the 1st line would be rendered on the first page, and the other lines on the 2nd page, and that's it
  • the default behaviour should be the same as the current behaviour, i.e. NOT BREAK, to ensure backward compatibility
  • we could introduce a new optional parameter passed to Table constructor, maybe named break_cells with a default value of None and a valid enum value of IF_ROW_HEIGHT_REQUIRE_PAGE_BREAK
  • later on we will be able to add support more enum values, providing fine-grained control over cell-break logic
  1. Should the user have the ability to specify a cell cannot be split? This is relevant for cells containing images, which should never be broken and that has implications for whether the adjacent cells are split or moved.

No need for user-specified settings there, we could have a simple rule for now that cells containing images do not TRIGGER cell breaks, and if a cell break occurs on a row with images in other cells, those images will be rendered on the first line.
Again, finer control could be given later on to end users if need be, but aiming initially for a first simple implementation is often better IME.
What do you think of this plan @mjasperse?

  1. How should vertical alignment be handled? What does it mean to split a cell that is specified with vertical centering? Or a cell that now spans multiple pages because the adjacent cell is very large? Should as much as possible be put into the first page, or should it be centered in the "largest" part? I should check how Word handles this case.

Those are valid concerns, and I do not have a simple answer to them.
If a cell break occurs on a row, and some cells in that row have their content vertically aligned, we should consider two cases:

  • those cells are ALSO SPLITTED, which is probably the most complex case...
  • those cells are small enough to NOT REQUIRE SPLITTING, and in that case we could have simple rules: their content should 1) be rendered on the highest "cell portion", and 2) be vertically centered in the "cell portion" on the page where it is rendered.

The complexity of this scenario makes me wonder if we should really support this logic in fpdf2 or just A) detect the situation where a row cannot be rendered on a single page and B) raise an error in that case, like I suggested earlier and started implementing in PR #1143...
I'd be happy to have @gmischler opinion about his 🙂

  1. It think cells should not be split if a rowspan is involved because it adds too much complexity. If the combined rowspan exceeds the EPH I think throwing an exception might be OK. Can be annoying to debug since it doesn't generate any PDF output in that scenario - maybe it makes sense to truncate the table and throw the exception such that catching and calling pdf.output() will make an output that can be checked?

Agreed for throwing an exception, that's the best option.
To me, having a clear error message raised is the best solution to ease debugging!
The end user can than understand where the problem comes from, and change their own PDF rendereding logic to avoid those cases, for exemple by checking & splitting cell content prior to calling fpdf2.
I do not think that having pdf.output() still render a PDF document in that case is a good idea at all.

@gmischler
Copy link
Collaborator

  • When should a cell should be split versus when should the whole cell be moved to the next page? For example, if you had a cell that was 5 lines long where a pagebreak is anticipated after the first line, should it be broken? I think it would be better to move the whole cell to the next page in that case, so what is the cutoff? 3 lines? 5 lines? 50%? A user-specified parameter?

The typographical terms are "widows" (left-over line(s) after a page/column break) and "orphans" (left behind line(s) before a page/column break). High-end layout software usually allows to specify a minimum number for each (or no break if there are fewer lines than either limit).
This might become an issue for text_columns() as well, so we'll need a consistent solution in all relevant places. Given that table cells are also going to be converted into text regions, maybe we should postpone that question for after that has happened. In the end, it will probably rather apply to individual paragraph()s than columns or table cells.

  • Should the user have the ability to specify a cell cannot be split? This is relevant for cells containing images, which should never be broken and that has implications for whether the adjacent cells are split or moved.

This is an interesting idea, but will obviously only be enforcible if the cell contents fit on one page.
I don't think that images will really make any difference here though. The weird "image first, text after" rule currently in place will have to go away in text region based table cells anyway, since those will allow to add an arbitrary number of text chunks and images in any order.

  • How should vertical alignment be handled? What does it mean to split a cell that is specified with vertical centering? Or a cell that now spans multiple pages because the adjacent cell is very large? Should as much as possible be put into the first page, or should it be centered in the "largest" part? I should check how Word handles this case.

I would try to center content over the complete height of the column, no matter how many pages it spans. This makes the solution independent of whether the vertically centered content actually fits on any one of the involved pages. Top and bottom valigned content is obvious.

  • It think cells should not be split if a rowspan is involved because it adds too much complexity. If the combined rowspan exceeds the EPH I think throwing an exception might be OK. Can be annoying to debug since it doesn't generate any PDF output in that scenario - maybe it makes sense to truncate the table and throw the exception such that catching and calling pdf.output() will make an output that can be checked?

Let's cross that bridge when we get there... 😉
I'm not sure if that really needs to be a problem, though.

You may notice that I keep referring to a future conversion to text regions. This will involve pretty much a rewrite of the base mechanics of how tables are built, since text regions use different concepts than the currently used multi_cells. The new code will be more stringently structured, delegate more functionality to lower levels, and thus hopefully simplify the overall architecture of the module. I had started working on this conversion several months ago. But then the table code repeatedly sprouted new features left and right, which made it difficult to keep track and adapt the rewrite accordingly.
I think at some point we'll have to establish a feature-freeze on the table code until I had a chance to chatch up.

@Lucas-C
Copy link
Member

Lucas-C commented May 2, 2024

Thank you for your insight @gmischler 🙂

I think at some point we'll have to establish a feature-freeze on the table code until I had a chance to chatch up.

Sure, I agree with this idea, as long as it does not stall fpdf2 development, and that we still welcome contributions from GitHub developpers on other areas of fpdf2 source code 😊

You may notice that I keep referring to a future conversion to text regions. [...] I had started working on this conversion several months ago.

Maybe a next step regarding text regions could be to split the work and define clear GitHub issues that other contributors could pick up and work on?
This way the implementation of this feature would not only rely on your availability.

If I'm not mistaken, the initial design for text regions came from this discussion 2 years ago: #339
Maybe a new roadmap could be drafted based on the current state of things, in order to identify the steps required to get to the final set of features that you envision @gmischler?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants