Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long HTML list entries are not correctly indented #1073

Open
dmail00 opened this issue Dec 24, 2023 · 24 comments · May be fixed by #1170
Open

Long HTML list entries are not correctly indented #1073

dmail00 opened this issue Dec 24, 2023 · 24 comments · May be fixed by #1170

Comments

@dmail00
Copy link

dmail00 commented Dec 24, 2023

Describe the bug
When list enties which are larger than a line length, and wrap they are not indented.

Minimal code

from fpdf import FPDF

pdf = FPDF()
pdf.add_page()
pdf.write_html(f"<ul><li>{'A ' * 200}</li></ul>")
pdf.output("test_long_list_entries.pdf")

Environment
Please provide the following information:

  • Operating System: Windows
  • Python version: 3.11.3
  • fpdf2 version used:2.7.7
@dmail00 dmail00 added the bug label Dec 24, 2023
@Lucas-C Lucas-C added the html label Dec 30, 2023
@Lucas-C
Copy link
Member

Lucas-C commented Dec 30, 2023

@gmischler: this seems related to the text regions / paragraph rendering logic: would you like to have a look at this? 🙂

@gmischler
Copy link
Collaborator

@gmischler: this seems related to the text regions / paragraph rendering logic: would you like to have a look at this? 🙂

I think it has always behaved like this. Of course, that's no reason to keep it that way...

There are several possible solutions to this:

  • Create a dedicatet ListParagraph() subclass to Paragraph() with a new_item(bullet="\u25cf"). The bullet could be any unicode character, a number, or a string (eg. "22.7(5)"), or the int type for automatic numbering. That last feature would be the advantage of this approach. The disadvantage is the higher complexity of both the implementation and the API. It gets even more hairy if we want automatic hierarchical numbering.

  • Add a indent argument to Paragraph(). We probably need that anyway for other purposes. If we combine that with a bullet argument (like above minus the automatic counting), then we already have the building blocks to create all kinds of lists. The disadvantage here is that the client code needs to handle the item numbering of ordered lists. The big advantage is the simplicity and flexibility. It might actually be possible to build an automatic numbering system on top of this later.

I'm currently still busy with other stuff, but I don't think this should be very hard if anyone wants to try.

@dmail00
Copy link
Author

dmail00 commented Jan 3, 2024

There must be a reason, however I am not aware and maybe either of you could enlighten me.

<ol>, <ul> and <blockquote> all create paragraphs and pdf has an l_margin. Why not keep a stack of l_margins and push onto it each time you see one of these tags and call pdf.set_left_margin with pdf.l_margin + (indent * modifier). Each time you see the end tag, pop the l_margin off the stack and again call pdf.set_left_margin with this value,

This solves all of these issues and I have been using this code without encountering any issues* so far, however as I said maybe there is something I am not aware off that makes this a bad idea.

*It does require modification of li elements so that it continues to use the current indent method when not inside a list.

@gmischler
Copy link
Collaborator

<ol>, <ul> and <blockquote> all create paragraphs and pdf has an l_margin. Why not keep a stack of l_margins and push onto it each time you see one of these tags and call pdf.set_left_margin with pdf.l_margin + (indent * modifier). Each time you see the end tag, pop the l_margin off the stack and again call pdf.set_left_margin with this value,

The cleaner solution than manipulating the global l_margin would be to give Paragraph()s an indent option. Other than that, ie. on the HTML side, your idea makes a lot of sense.

@lcgeneralprojects
Copy link

Hello. I want to start contributing to the project. Is the issue still open? If so, what needs to be done?

@andersonhc
Copy link
Collaborator

Hello. I want to start contributing to the project. Is the issue still open? If so, what needs to be done?

At first glance you will need to modify Paragraph (include indent parameter) and TextRegion on text_region.py, and do some modifications on html.py to add the indent value on each paragraph.

On Paragraph you will need to add the indent value to the margins when calling MultiLineBreak and on TextRegion you need to apply the indent value to modify pdf.x before calling render_styled_text_line

@lcgeneralprojects
Copy link

Apologies that it's taking me so long. I could not dedicate as much time as I intended on this project, but I seem to have finally internalised what the relevant parts of the code do. I have partially implemented the suggested changes, and now I have two questions:

  1. Do we want for each line to contain information about its intended indentation for rendering? I.e. when handling bodies of HTML tags, we are to check how the indentation (including leading whitespaces) works for the first line involving the tag, and have each subsequent line have matching indentation, correct?
  2. I have tried to experiment with MultiLineBreak margins, but, in the case of the code given in the original comment in this thread, that doesn't seem to change anything. Is that behaviour expected?

@Lucas-C
Copy link
Member

Lucas-C commented Mar 29, 2024

Hi @lcgeneralprojects

Do not worry for the time it takes you, contributing to this project is voluunteer work, we all just do our best with the time we can dedicate to it 🙂

First, not that a first implementation of this tag indentation has been be added recently, in this PR: #1124

  1. [...] when handling bodies of HTML tags, we are to check how the indentation (including leading whitespaces) works for the first line involving the tag, and have each subsequent line have matching indentation, correct?

It also seems like the best behaviour to me!

  1. I have tried to experiment with MultiLineBreak margins, but, in the case of the code given in the original comment in this thread, that doesn't seem to change anything. Is that behaviour expected?

I'm not exactly sure of what you experimented exactly. You could open a draft PR with the changes you attempted, so that we can better answer your question "Is that behaviour expected?" 🙂

@lcgeneralprojects
Copy link

lcgeneralprojects commented Mar 30, 2024

First, not that a first implementation of this tag indentation has been be added recently, in this PR: #1124

I actually did forget that that should be accounted for. Thank you for reminding me.

I'm not exactly sure of what you experimented exactly

I have tried both editing the values of MultiLineBreak.margins directly at runtime during debugging, as well as increasing Paragraph.indent (which is added to MultiLineBreak.margins[0] in Paragraph.build_lines(...) as I have implemented). I primarily did the testing using the code from the original post of this thread, but I don't see any difference when running the 'Hello, World!' code from the tutorial and the tag_indent example from the thread that you linked.

You could open a draft PR with the changes you attempted, so that we can better answer your question "Is that behaviour expected?"

That question is more general, as I don't see the difference with or without the changes introduced by me. Initially, I did the experimentation with the current fpdf2:master version of the library.

EDIT: Apologies for disappearing - I got sick. I am better now and will be resuming work on the project.

@lcgeneralprojects
Copy link

lcgeneralprojects commented Apr 10, 2024

Current thoughts:

  1. I am going to have paragraphs keep information about indentation. In order to save time, I would like to ask how indentation, margins, and the x coordinate are currently handled when dealing with nested paragraphs, such as, for example, when making unordered lists within list items of other unordered lists. I want to know how I should handle paragraph.indents in these cases.
  2. I intend on modifying paragraph.build_lines by deducting paragraph.indent from multi_line_break.margins[0] after the first multi_line_break.get_line(). I don't see a particularly elegant way of doing this right now - will probably have to introduce a flag for this, - unless first_line = False can be put after multi_line_break.get_line(). I would like to ask for suggestions regarding how to handle this more elegantly. I think it will also be best if the lines are given an indent attribute.
  3. When dealing with TextRegion._render_column_lines(), I currently intend on changing self.pdf.x when text_line.indent changes with the new line. However, this seems very inelegant and I'd rather move s_start in TextRegion._render_styled_text_line() to the beginning of the method (or do something like x_coordinate = self.x and later initialise s_start = x_coordinate) and work with that, adding text_line.indent to it right away. However, when debugging before introducing any of my changes, I tried editing s_start for various lines, and couldn't notice any difference in the .pdf files that I got. Did I miss something?
  4. Currently, when calculating paragraph.indent, after writing a paragraph with code like self._write_paragraph("\u00a0" * self.tag_indents["blockquote"]) or self._write_paragraph(f"{indent_string}{bullet} "), I calculate the width of the last fragment of the paragraph, indent_width = self._paragraph._text_fragments[-1].get_width(). I intend on introducing a new method where such an indentation string is written into a paragraph and the width of the written fragment is used as the paragraph's indent value, but I fear that that might hurt readability. Tell me if I shouldn't do that.

I think I rubber-duckied myself out of the other questions and concerns that I had for now.

EDIT: And now I got problems with my OS becoming unresponsive on login. I will have to deal with that first.

@lcgeneralprojects
Copy link

lcgeneralprojects commented Apr 22, 2024

EDIT2: apologies, this is me being dumb. I did not notice that the relevant files have multiple pages. I see the issues now.

I seem to have implemented the feature, however, some of the old tests are not being passed right now. Visually, I am not noticing any difference, but the hash of the files does differ. qpdf is installed and its bin/ folders are in PATH, including for the current user, and it seems to run fine from the command line.

FAILED test/html/test_html.py::test_html_customize_ul - AssertionError: 42faf8eb2dba69a723f2eeeed4210fda != 17113d6d531ca70bc3b36e966a03387a       
FAILED test/html/test_html.py::test_html_ol_start_and_type - AssertionError: a3539c04323835c1f456d25c1245b93f != a185c6afa903dfa8d425e72f5e3c1d38  
FAILED test/html/test_html.py::test_html_ul_type - AssertionError: 54fd28b39dec3d75f65bde449bc73baa != a14d43290acc124877e63918bef64950
FAILED test/html/test_html.py::test_html_li_prefix_color - AssertionError: f0764680d6997107c40260259e29e577 != 412f8470aaa5abe6fb1de4d305a187a6    
FAILED test/html/test_html.py::test_html_ln_outside_p - AssertionError: 1f03de32572f3591e66229260c15f1d8 != a61515a838a16a6185876e0fb855c0fc       
FAILED test/html/test_html.py::test_html_li_tag_indent - AssertionError: 9d84fb22aae2a8de2c4fda65d67e35e2 != 1100de7dcbec06df8fd34e181ae99d9d      
FAILED test/html/test_html.py::test_html_ol_ul_line_height - AssertionError: 00b627bf3a30d0c63f3fb89094d52cbc != 88c018e1e06e5b152f7fec396df2ad26

There are also some jpype-related failed tests, but they fail without any changes done by me, as well.

I will try to examine what the issues are, but I would like to ask for help here. The relevant code is in the issue_1073 branch of the fork that I made.

Also, nested lists do not behave correctly, which seems to be caused by new paragraphs starting with a newline, meaning that a list nested within a previous one will be shifted by one line down from where it should be normally. I will be submitting a ticket shortly.

EDIT: qpdf seems to work now, but the issues persist. I am not noticing any differences so far:

FAILED test/html/test_html.py::test_html_features - AssertionError: assert [b'%PDF-1.3',...1 0 obj', ...] == [b'%PDF-1.3',...1 0 obj', ...]        
FAILED test/html/test_html.py::test_html_customize_ul - AssertionError: assert [b'%PDF-1.3',...1 0 obj', ...] == [b'%PDF-1.3',...1 0 obj', ...]    
FAILED test/html/test_html.py::test_html_ol_start_and_type - AssertionError: assert [b'%PDF-1.3',...1 0 obj', ...] == [b'%PDF-1.3',...1 0 obj', ...]
FAILED test/html/test_html.py::test_html_ul_type - AssertionError: assert [b'%PDF-1.3',...1 0 obj', ...] == [b'%PDF-1.3',...1 0 obj', ...]
FAILED test/html/test_html.py::test_html_li_prefix_color - AssertionError: assert [b'%PDF-1.3',...1 0 obj', ...] == [b'%PDF-1.3',...1 0 obj', ...] 
FAILED test/html/test_html.py::test_html_ln_outside_p - AssertionError: assert [b'%PDF-1.3',...1 0 obj', ...] == [b'%PDF-1.3',...1 0 obj', ...]    
FAILED test/html/test_html.py::test_html_li_tag_indent - AssertionError: assert [b'%PDF-1.3',...1 0 obj', ...] == [b'%PDF-1.3',...1 0 obj', ...]   
FAILED test/html/test_html.py::test_html_ol_ul_line_height - AssertionError: assert [b'%PDF-1.3',...1 0 obj', ...] == [b'%PDF-1.3',...1 0 obj', ...]

@Lucas-C
Copy link
Member

Lucas-C commented Apr 23, 2024

I'm currentlty on holidays, but unless @gmischler or @andersonhc beats me to it, I'll get back to you on Monday 🙂

@lcgeneralprojects
Copy link

I think I have a good idea of what the problem in my code is and will be solving it later tonight. I will likely need to design a way of checking if a relevant fragment of a paragraph is an 'indentation element', i.e. that it is a fragment which sets the indentation for the paragraph. The problem is that what I am checking seems to be a fragment with just a newline element, such as at the beginning of a paragraph, instead of things that precede HTML list items.

@gmischler
Copy link
Collaborator

I think supporting different "indentation elements" within a single paragraph makes things unnecessarily complicated.
You probably chose to do it this way because the html module currently creates a paragraph for the <ul> and <ol> elements. But it doesn't have to remain that way. This arrangement is more a shortcut I took when I adapting the module to using text regions.

I recommend to only set a fixed indent and an optional bullet/numbering symbol when creating a paragraph, and never to change those while writing to it.
Then in the html module, don't open a paragraph for <ul> and <ol>, only set the apropriate internal state as is already the case. And then you can create a separate paragraph for each <li>, with the current indent width and symbol. Note that this gives you the additional possibility to vary the top and/or bottom margin of each list item, which may also help with #1148.

Btw: This is essentially also the way we're used to think about paragraphs on paper: A homogenous block of text.

@lcgeneralprojects
Copy link

I think supporting different "indentation elements" within a single paragraph makes things unnecessarily complicated

An 'indentation element' in this context would be something like this string f"{indent_string}{bullet} " in HTML2FPDF.handle_starttag() (I renamed indent to indent_string for greater clarity, as I initially wanted to also introduce indent_width in that same context, tell me if I should refactor it back) when processing a <li> starting tag.

When the relevant fragment is written into the paragraph, it is marked as an 'indentation element' via a new attribute of the class Fragment and the paragraph.indent attribute is adjusted to match the sum of the widths of the fragments that are written as (the last, in the current version of the code (I would like further input on this matter)) 'indentation elements'.

Later, in MultiLineBreak.get_line(), I intend for every line that doesn't start with a fragment that is an indentation element to be assigned an indent attribute value matching its corresponding paragraph.indent. There are some issues with this that I need to tackle.

My intention with having some fragments be identified as 'indentation elements' is to make it so that parts of the code that would dictate relevant indentation could be edited without the need to touch the code that would calculate indents for lines and paragraphs.

In the case I will finish this code the way I see it, it should be easy to generalise for the other HTML elements that feature indentation.

I recommend to only set a fixed indent and an optional bullet/numbering symbol when creating a paragraph, and never to change those while writing to it

A problem that I see with assigning indentation on paragraph creation in the cases of starting tags, such as <ol> and <ul> is that at that point we do not have information about the relevant fragments - that dictate indentation of the lines in <li> elements - that could be used without the introduction of a need to edit the things in multiple places. A way to solve this would be, of course, introducing constants (or variables) with the relevant fragments.
However, there would still need to be a way to determine which lines should be indented, and which lines should not be. I do not see a good way of systematically approaching this without marking some of the fragments as 'indentation elements' and excluding lines that start with those from being given additional indentation.

Then in the html module, don't open a paragraph for <ul> and <ol>, only set the apropriate internal state as is already the case. And then you can create a separate paragraph for each <li>, with the current indent width and symbol. Note that this gives you the additional possibility to vary the top and/or bottom margin of each list item, which may also help with #1148.

I did want to ask for a permission to at least make it so that each <li> gets a corresponding paragraph after I was finished with the current issues in my code. That does seem to be a more intuitive way of approaching this, and would mean that, in case of <li> elements, there would be only one 'indentation element' per paragraph.

@gmischler
Copy link
Collaborator

An 'indentation element' in this context would be something like this string f"{indent_string}{bullet} "

This is exactly what we want to get rid of here. The paragraph should only receive the bullet and place that correctly during rendering, depending on the indent it has been given.

A problem that I see with assigning indentation on paragraph creation in the cases of starting tags, such as <ol> and <ul> is that at that point we do not have information about the relevant fragments

Firstly, I don't think there is any need to modify the Fragment code. That would only make things unnecessarily complicated.
Secondly, this is exactly why I recommended NOT to create a paragraph on <ol> and <ul>, but instead for each <li> separately. We know for certain that all the lines in a <li> will be indented the same.

I did want to ask for a permission to at least make it so that each <li> gets a corresponding paragraph after I was finished with the current issues in my code. That does seem to be a more intuitive way of approaching this, and would mean that, in case of <li> elements, there would be only one 'indentation element' per paragraph.

You don't need a special permission to fix bad code. Turning the <ol> and <ul> into paragraphs was a quick-and-dirty hack on my side, which made it possible to use text regions with minimal changes to the legacy code. Now that you have identified the problems resulting from that, and are trying to use a more structured approach, you should chose the most elegant and simple solution right away.

@lcgeneralprojects
Copy link

lcgeneralprojects commented May 2, 2024

I am implementing the indentation feature, with the input from the last reply taken into account. However, I want to ask for some advice:

  1. Should I instantiate a bullet fragment in order to render?
  2. If so, what should I take into account regarding the attributes of the fragment? In particular, how should the graphics_state and k attributes of the bullet fragment be determined?
  3. If not, how does one best determine the width of the bullet?
  4. How do I best determine the indentation between the bullet and the corresponding text line proper?

Currently, a piece of relevant code of mine in FPDF._render_styled_text_line() feels somewhat awkward

        if first_line and bullet:
            if text_line.align == Align.R:
                dx = w - l_c_margin - styled_txt_width
            elif text_line.align in [Align.C, Align.X]:
                dx = (w - styled_txt_width) / 2
            else:
                dx = l_c_margin
            bullet_normalized_string = self.normalize_text(bullet).replace("\r", "")
            # TODO: fix the following line
            bullet_fragment = self._preload_font_styles(bullet_normalized_string, markdown=False)
            bullet_width = bullet_fragment.get_width()
            sub_indent = 1
            bullet_text = bullet_fragment.render_pdf_text(
                0,
                current_ws,
                0,
                self.x - bullet_width - sub_indent + dx + s_width + indent,
                self.y + (0.5 * h + 0.3 * max_font_size),
                self.h,
            )
            if bullet_text:
                sl.append(bullet_text)

@lcgeneralprojects
Copy link

I will try to make a pull request tomorrow or a day after that. I will need to restructure some code, so that a bullet would be a separate attribute of a paragraph, of a Fragment class, and which is to be rendered separately from other fragments.

Unless I get further input, I will be setting default tag indents and default indentation between bullets and corresponding lines to match the old indentation. I also have another question, but that falls under another topic.

@lcgeneralprojects
Copy link

lcgeneralprojects commented May 12, 2024

Apologies for the delay. Had issues with both health and my primary IDE.

I have fixed the issue for <li> elements. The code can be seen at the 'issue_1073' branch of my fork. I'd like to ask for some advice.

  1. The way I implemented it, the indentation of <li> elements is now done using the same units that are used for Fragment.get_width(), instead of concatenating whitespace characters together. Also, tag_indent["li"] now denotes the distance not to the bullet, but to the line proper. This means that indentation of <li> elements is decoupled from font size, and multiple list items can predictably share the same indentation (ignoring their bullets). This behaviour is closer to how HTML seems to be usually rendered, but it conflicts with a few old tests. Tell me if this is unacceptable, or if I am allowed to remake the relevant tests.
  2. As a result of test_html_ol_ul_line_height, my code produces a bit less distance between some of the lines. I am yet to thoroughly delve into this, but the difference seems minor, visually.
  3. I have so far failed to consider HTML code like <p>this <dd>is</dd><blockquote>the</blockquote><li>same</li>paragraph, and testing shows that relevant code doesn't seem to be handled perfectly the way it is usually rendered. Unless somebody stops me, I will be making it so that FPDF objects remember that there are unclosed proper paragraphs before starting new ones dedicated to things like <dd> and <li>.
  4. My bullet-rendering code in FPDF._render_styled_text_line() is mostly a copy of how each fragment is handled, but extracting it into a method has thus far produced clunky results, as there are a lot of variables that need to be passed, and some of them are of immutable types, like last_used_color. I would like to ask how to best organise the relevant code into a method or a function.

edited by gmischler: fixed formatting

@gmischler
Copy link
Collaborator

  1. The way I implemented it, the indentation of <li>elements is now done using the same units that are used for Fragment.get_width(), instead of concatenating whitespace characters together.

It took me a while to figure out what you mean by that, but I think you're doing the right thing here. Although I'm not sure if 7.831666666666665 is really a good default value. How did you arrive at that?

Also, tag_indent["li"] now denotes the distance not to the bullet, but to the line proper. This means that indentation of <li> elements is decoupled from font size, and multiple list items can predictably share the same indentation (ignoring their bullets).

That sounds like what we want.

This behaviour is closer to how HTML seems to be usually rendered, but it conflicts with a few old tests. Tell me if this is unacceptable, or if I am allowed to remake the relevant tests.

Your goal is to improve on the behaviour of the current code. If you manage that, then of course it is not only acceptable but necessary to adapt the test files to the new and more correct result.

  1. As a result of test_html_ol_ul_line_height, my code produces a bit less distance between some of the lines. I am yet to thoroughly delve into this, but the difference seems minor, visually.

The way that we currently handle line heights in HTML is in many places rather random. If you can make it more systematic and predictable, by any means do so.
For example, there's currently still a self._ln(2) in the code for the <li>, which for one is a rather arbitrary "magic number", and also doesn't take into account the document units.

  1. I have so far failed to consider HTML code like <p>this <dd>is</dd><blockquote>the</blockquote><li>same</li>paragraph, and testing shows that relevant code doesn't seem to be handled perfectly the way it is usually rendered. Unless somebody stops me, I will be making it so that FPDF objects remember that there are unclosed proper paragraphs before starting new ones dedicated to things like <dd> and <li>.

This example is horribly invalid HTML. I recommend to only worry about the rendering of standards conforming HTML.
Anyway, in detail:

  • <p>this - is one paragraph. The closing </p> is optional according to the standard.
  • <dd>is</dd> - is invalid HTML without a preceding <dt></dt>. We don't currently handle this correctly, and only insert a newline before each. The <dt><dd> combination should really start a new paragraph, since they are not allowed within a <p>.
  • <blockquote>the</blockquote> - This starts and ends a seperate paragraph just normally.
  • <li>same</li> - This is invalid HTML outside of an <ul> or <ol>. But all the same, with your changes it will now start and end a paragraph of its own.

Despite its HTML defects, your snippet should result in 4 seperate paragraphs (if dt/dd are handled correctly). In our code, _new_paragraph() already closes any previous ones that still might be open. So I'm not quite certain what you are worried here, or if this explanation answers all of your questions.

  1. My bullet-rendering code in FPDF._render_styled_text_line() is mostly a copy of how each fragment is handled, but extracting it into a method has thus far produced clunky results, as there are a lot of variables that need to be passed, and some of them are of immutable types, like last_used_color. I would like to ask how to best organise the relevant code into a method or a function.

You do NOT want to modify FPDF._render_styled_text_line() or anything in the line_break module for this AT ALL!
Those are general purpose text rendering methods, and there's nothin inherently different in "bullet" text than in any other text, that would justify to spread special-case code all over the place.

Setting up the temporarily modified environment (current x/y etc.) correctly, creating a TextLine() with the Fragment()s for the bullet/numbering, and sending that to FPDF._render_styled_text_line() should all be handled within the text region code.
You need to keep these operations as seperate from the handling of the normal text flow as humanly possible, or you're going to make things much more complicated than they need to be.

Consider for example that in the future, we might want an option to place the bullet vertically centered next to the paragraph instead of at the top, and the need to keep things localized and seperate from each other will become much more obvious.

@lcgeneralprojects
Copy link

Although I'm not sure if 7.831666666666665 is really a good default value. How did you arrive at that?

I got that from Fragment.get_width() of default indentation strings for unordered <li>.

This example is horribly invalid HTML

Seems to be rendered elsewhere, but I will adhere to the provided clarification.

You do NOT want to modify FPDF._render_styled_text_line() or anything in the line_break module for this AT ALL!

Alright. Will correct my code.

@lcgeneralprojects
Copy link

Seems like I am mostly done. The current version of my code is in the issue_1073 branch of my fork.

What is left primarily is for me to go over the test conflicts again, and to make new ones.
However, there is one thing that I will need to deal with. In the current master branch of the actual repository, lists are intended to be singular paragraphs each, while, in preparation for dealing with the issue with nested paragraphs, I have made it so that every list item is a paragraph. This means that Paragraph.top_margin is added to pdf.y when rendering every individual list item, instead of at the beginning of every list. I will think of how to best handle this, but currently I am not seeing a more elegant solution than adding a flag attribute to every paragraph (at least, within lists) that would determine if a paragraph is the first one in a list.

Also, I did introduce a Bullet class for a Paragraph attribute to store the bullet's fragment, the bullet's text line, the information about the displacement of the bullet relative to the paragraph (rel_x_displacement for the distance between the rightmost point of the bullet to the lines of the paragraph, rel_y_displacement for the distance from the top of the paragraph to the top of the bullet), as well the flag for if the bullet has been rendered (as we only want to render one bullet per paragraph, and I couldn't find a more elegant solution). The Paragraph.bullet attribute is currently not protected, and neither are any of its own attributes, as they are currently used outside of the relevant classes and their methods. Please tell me if I should make any of the relevant attributes protected.

@gmischler
Copy link
Collaborator

Although I'm not sure if 7.831666666666665 is really a good default value. How did you arrive at that?

I got that from Fragment.get_width() of default indentation strings for unordered <li>.

So it is based on arbitrary character widths of an arbitrary font. Can you use something more straightforward like 8 or 10 mm?

This example is horribly invalid HTML

Seems to be rendered elsewhere, but I will adhere to the provided clarification.

Browsers will make a best-effort attempt at rendering even the most horrible tag soup, which is one reason why the rendering engines have become such bloated monsters nowadays.
To some degree we should do the same, but for the sake of simplicity we'll have to be much more restrictive.
And our design decisions will clearly have to be based on the handling of standards conforming HTML as much as possible.

Seems like I am mostly done. The current version of my code is in the issue_1073 branch of my fork.

Cool!
Can you please create a (possibly draft-) PR from it? This will make it much easier for us to see what has actually changed relative to our HEAD, without having to sift through individual commits.

This means that Paragraph.top_margin is added to pdf.y when rendering every individual list item, instead of at the beginning of every list.

You'll have to do "something" with that when the <ul> or <ol> starts, instead of for each <li>. Either just insert a newline of the appropriate height, insert an empty paragraph (without a bullet), or whatever else makes sense and gives the desired effect.

The Paragraph.bullet attribute is currently not protected, and neither are any of its own attributes,

We don't currently have a very stringent concept of what stuff should be proteced and what shouldn't.
For the moment it is fine to leave everything unprotected in your additions. That's a detail we can worry about at a later time and in a more general context.

@lcgeneralprojects lcgeneralprojects linked a pull request May 17, 2024 that will close this issue
5 tasks
@lcgeneralprojects
Copy link

lcgeneralprojects commented May 19, 2024

Apologies - I got sick again. Will try to finish with the tests tomorrow.

Currently, I do see an issue with my code. If a <ul>, or an <ol> item is the first item to be rendered, there will still be created a 'pseudo'-top margin for its first <li> child in the form of a paragraph with a '\n' line of a certain height. This means that if the first visible thing to be rendered in a document is a <li> child of a list, it will be displaced downward the way lists normally are when there are preceding elements. This also applies to new pages.

This displacement is not particularly noticeable with small margins for lists, but this does not sit well with me, and it might cause issues with great margins.

I'd like to ask for some piece of advice on the matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants