Try and preserve the structure of the html during a diff #350

lonetwin · 2022-09-12T18:11:53Z

There exists a bug in the current htmldiff code, where by the generated diff
changes the structure of the html (notice that the <div id="middle">
appears at the beginning instead of the middle):

>>> from lxml.html import diff
>>> a = "<div id='first'>some old text</div><div id='last'>more old text</div>"
>>> b = "<div id='first'>some old text</div><div id='middle'>and new text</div><div id='last'>more old text</div>"
>>> diff.htmldiff(a, b)
('<div id="middle"> <div id="first"><ins>some old text</ins></div><ins>and new</ins> <del>some old</del> text</div><div id="last">more old text</div>')
>>>

This patchset is an attempt to fix that issue.

There exists a bug in the current `htmldiff` code, where by the generated diff changes the structure of the html: ```python >>> from lxml.html import diff >>> a = "<div id='first'>some old text</div><div id='last'>more old text</div>" >>> b = "<div id='first'>some old text</div><div id='middle'>and new text</div><div id='last'>more old text</div>" >>> diff.htmldiff(a, b) ('<div id="middle"> <div id="first"><ins>some old text</ins></div><ins>and new</ins> <del>some old</del> text</div><div id="last">more old text</div>') >>> ``` This patchset is an attempt to fix that issue.

…s and string interpolation.

…_start ...also some minor clean up.

scoder · 2022-09-26T06:37:38Z

src/lxml/html/tests/test_diff.txt

-    <a href="http://google.com">search <ins> Link: http://google.com</ins>
-    <del> Link: http://yahoo.com</del> </a>
+    <a href="http://google.com">search <ins> Link: http://google.com
+    </ins> <del> Link: http://yahoo.com</del> </a>


Was this a necessary change? It doesn't feel right.

tbh, didn't feel right to me either but this test breaks without it, and I didn't have much luck trying to figure out why. Initial instinct was that it has something to do with the re.sub in the pwrapped definition at the top of the test file (possibly the + in the re over matches ?). Apologies, I didn't try digging deeper but I'll give it a shot now.

So the underlying cause for this appears to be an extra trailing whitespace that appears to get added to the href_token which pushes the line length of the string to over 70 (the line length limit of textwrap, used by pdiff). Here are the diffs with the current lxml code and one taken with this patchset:

current lxml diff:

<a href="http://google.com">search <ins> Link: http://google.com</ins> <del> Link: http://yahoo.com</del> </a>

diff with this patchset:
<a href="http://google.com">search <ins> Link: http://google.com </ins> <del> Link: http://yahoo.com</del> </a>

What I'm not entirely sure about is why the trailing whitespace which ideally should also be present in the current lxml diff (according to call to href_token in fixup_chunks), isn't.

In any case, would you consider this as a regression in behavior worthy of a fix or does this fix of the test case suffice ?

What strikes me is that there is a space before the closing </ins> but not before the closing </del>. Given that there is a space after both of the opening <ins> and <del>, it's debatable whether the new behaviour isn't really better than the old. But I would expect both tag types to be consistent, at least.

I agree. I'll investigate and attempt to make them consistent.

scoder · 2022-09-26T06:38:10Z

src/lxml/html/tests/test_diff.txt

@@ -87,6 +87,13 @@ Whitespace is generally ignored for the diff but preserved during the diff:
    second
     <ins>third</ins> </pre>

+Ensure we don't preserve the structure of the html during the diff:


Ah sorry, that was a brain fart ! Good catch. I think I might have wanted to add something like 'Ensure we don't mess with the structure ...' and instead decided to go with the opposite phrasing 'Ensue we preserve ...' but obviously messed up. I'll fix in the next commit.

scoder · 2022-09-26T06:40:42Z

src/lxml/html/diff.py

+    for chunk in ins_chunks:
+        if chunk in balanced:
+            doc.append(chunk)


This seems risky. I'm not sure that we can assume chunks to be unique enough to just do an "in" test here.

Maybe it could help to put the three parts back together, but give each chunk a marker like (chunk, "balanced") and (chunk, "unbalanced"), and then do something based on the marker. Or look at the index boundaries where we switch between the unbalanced and balanced parts.

Hmm, agreed that we can't assume the chunks to be unique enough to just do an 'in' test here. I'll think about your suggestion and give this another go. Thanks for your review.

We ran into an issue rising out of precisely the problem you caught ! The visible effect being that the <ins> tags were not being added at places where they ought to be.

I've updated the approach by changing split_unbalanced to optionally return a list of tuples containing the (<index-in-the-input=chunks-list>, <chunk>). This then serves to uniquely identify the chunk across the balanced/unbalanced lists ...which then makes the in operator reliable.

I've defaulted to retain the existing behavior of split_unbalanced since it is also called from cleanup_delete and I didn't want to touch that function.

Hopefully, you find the changes suitable.

…merge_insert` In the recently revised implementation of `merge_insert` we were checking whether a tag exists in the `balanced/unbalanced` tags list by referring to just the tag itself. This is unreliable since the same tag might exist in both list. Incorrect identification leads to skipping of the `<ins>` tags in some cases, resulting in incorrect diff being rendered. This pathset fixes the issue described and adds a test to validate it.

scoder · 2022-12-15T12:08:30Z

src/lxml/html/tests/test_diff.txt

-    <a href="http://google.com">search <ins> Link: http://google.com</ins>
-    <del> Link: http://yahoo.com</del> </a>
+    <a href="http://google.com">search <ins> Link: http://google.com
+    </ins> <del> Link: http://yahoo.com</del> </a>


What strikes me is that there is a space before the closing </ins> but not before the closing </del>. Given that there is a space after both of the opening <ins> and <del>, it's debatable whether the new behaviour isn't really better than the old. But I would expect both tag types to be consistent, at least.

scoder · 2022-12-15T12:14:38Z

src/lxml/html/diff.py

            balanced.append(None)
    start.extend(
-        [chunk for name, pos, chunk in tag_stack])
+        [chunk_to_add for name, pos, chunk_to_add in tag_stack])
    balanced = [chunk for chunk in balanced if chunk is not None]


The condition in this line probably doesn't work any more if with_idx=True.

Umm, wouldn't it ? Here balanced would be something like [(1, '<div>'), None, ...] if with_idx==True, so the if chunk is not None check still does the right thing.

scoder · 2022-12-15T12:28:29Z

src/lxml/html/tests/test_diff.txt

+    >>> a = "<div><p>Some text that will change</p><p>Some tags will be added</p></div>"
+    >>> b = "<div><div><p>Some text that has changed a bit</p><p>All of this is new</p></div></div>"
+    >>> pdiff(a, b)
+    <div><div><p>Some text that <ins>has changed a bit</ins> </p>
+    <p><ins>All of this is new</ins></p> </div> <del>will
+    change</del><p><del>Some tags will be added</del></p> </div>


This diff looks weird. Is it getting confused by the nested <div>s ?

Yes, that's correct. In this specific case, the rendered diff looks 'ok' to me (with the insertions preceding the deletions) but yeah, I imagine in larger context it might be less than ideal. I'm going to take another stab at this.

btw, fwiw, this isn't very different from the diff generated by the current code for the same input. The only difference being the closing </p> tag of the first input paragraph is placed after the </ins> instead of at the end of the changes for that paragraph.

current lxml diff:

<div><div><p>Some text that <ins>has changed a bit</ins><p><ins>All of this is new</ins></p> </p></div> <del>will change</del><p><del>Some tags will be added</del></p> </div>

diff generated with this patchset

<div><div><p>Some text that <ins>has changed a bit</ins> </p> <p><ins>All of this is new</ins></p> </div> <del>will change</del><p><del>Some tags will be added</del></p> </div>

lonetwin and others added 5 commits September 12, 2022 19:11

Attempt to be more consistent with rest of the code in terms of quote…

c4c504c

…s and string interpolation.

Missed one instance of consistent quoting in the last commit.

b340f0a

Slight cleanup

66e89a8

Use the same approach for unbalanced_end tags as we do for unbalanced…

f4c38c6

…_start ...also some minor clean up.

scoder reviewed Sep 26, 2022

View reviewed changes

lonetwin requested a review from scoder December 15, 2022 11:57

scoder reviewed Dec 15, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try and preserve the structure of the html during a diff #350

Try and preserve the structure of the html during a diff #350

lonetwin commented Sep 12, 2022 •

edited

scoder Sep 26, 2022

lonetwin Sep 26, 2022

lonetwin Dec 15, 2022 •

edited

scoder Dec 15, 2022

lonetwin Dec 15, 2022

scoder Sep 26, 2022

lonetwin Sep 26, 2022 •

edited

lonetwin Dec 15, 2022

scoder Sep 26, 2022

lonetwin Sep 26, 2022

lonetwin Dec 15, 2022

scoder Dec 15, 2022

scoder Dec 15, 2022

lonetwin Dec 15, 2022

scoder Dec 15, 2022

lonetwin Dec 15, 2022

lonetwin Dec 15, 2022 •

edited

Try and preserve the structure of the html during a diff #350

Are you sure you want to change the base?

Try and preserve the structure of the html during a diff #350

Conversation

lonetwin commented Sep 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lonetwin Dec 15, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lonetwin Sep 26, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lonetwin Dec 15, 2022 • edited

Choose a reason for hiding this comment

lonetwin commented Sep 12, 2022 •

edited

lonetwin Dec 15, 2022 •

edited

lonetwin Sep 26, 2022 •

edited

lonetwin Dec 15, 2022 •

edited