New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try and preserve the structure of the html during a diff #350
base: master
Are you sure you want to change the base?
Conversation
There exists a bug in the current `htmldiff` code, where by the generated diff changes the structure of the html: ```python >>> from lxml.html import diff >>> a = "<div id='first'>some old text</div><div id='last'>more old text</div>" >>> b = "<div id='first'>some old text</div><div id='middle'>and new text</div><div id='last'>more old text</div>" >>> diff.htmldiff(a, b) ('<div id="middle"> <div id="first"><ins>some old text</ins></div><ins>and new</ins> <del>some old</del> text</div><div id="last">more old text</div>') >>> ``` This patchset is an attempt to fix that issue.
…s and string interpolation.
…_start ...also some minor clean up.
<a href="http://google.com">search <ins> Link: http://google.com</ins> | ||
<del> Link: http://yahoo.com</del> </a> | ||
<a href="http://google.com">search <ins> Link: http://google.com | ||
</ins> <del> Link: http://yahoo.com</del> </a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a necessary change? It doesn't feel right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh, didn't feel right to me either but this test breaks without it, and I didn't have much luck trying to figure out why. Initial instinct was that it has something to do with the re.sub
in the pwrapped
definition at the top of the test file (possibly the +
in the re over matches ?). Apologies, I didn't try digging deeper but I'll give it a shot now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the underlying cause for this appears to be an extra trailing whitespace that appears to get added to the href_token
which pushes the line length of the string to over 70 (the line length limit of textwrap
, used by pdiff
). Here are the diffs with the current lxml code and one taken with this patchset:
current lxml diff:
<a href="http://google.com">search <ins> Link: http://google.com</ins> <del> Link: http://yahoo.com</del> </a>
diff with this patchset:
<a href="http://google.com">search <ins> Link: http://google.com </ins> <del> Link: http://yahoo.com</del> </a>
What I'm not entirely sure about is why the trailing whitespace which ideally should also be present in the current lxml diff (according to call to href_token
in fixup_chunks
), isn't.
In any case, would you consider this as a regression in behavior worthy of a fix or does this fix of the test case suffice ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What strikes me is that there is a space before the closing </ins>
but not before the closing </del>
. Given that there is a space after both of the opening <ins>
and <del>
, it's debatable whether the new behaviour isn't really better than the old. But I would expect both tag types to be consistent, at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I'll investigate and attempt to make them consistent.
src/lxml/html/tests/test_diff.txt
Outdated
@@ -87,6 +87,13 @@ Whitespace is generally ignored for the diff but preserved during the diff: | |||
second | |||
<ins>third</ins> </pre> | |||
|
|||
Ensure we don't preserve the structure of the html during the diff: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"don't" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, that was a brain fart ! Good catch. I think I might have wanted to add something like 'Ensure we don't mess with the structure ...' and instead decided to go with the opposite phrasing 'Ensue we preserve ...' but obviously messed up. I'll fix in the next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/lxml/html/diff.py
Outdated
for chunk in ins_chunks: | ||
if chunk in balanced: | ||
doc.append(chunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems risky. I'm not sure that we can assume chunks to be unique enough to just do an "in" test here.
Maybe it could help to put the three parts back together, but give each chunk a marker like (chunk, "balanced")
and (chunk, "unbalanced")
, and then do something based on the marker. Or look at the index boundaries where we switch between the unbalanced and balanced parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, agreed that we can't assume the chunks to be unique enough to just do an 'in' test here. I'll think about your suggestion and give this another go. Thanks for your review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ran into an issue rising out of precisely the problem you caught ! The visible effect being that the <ins>
tags were not being added at places where they ought to be.
I've updated the approach by changing split_unbalanced
to optionally return a list of tuples containing the (<index-in-the-input=chunks-list>, <chunk>)
. This then serves to uniquely identify the chunk across the balanced/unbalanced lists ...which then makes the in
operator reliable.
I've defaulted to retain the existing behavior of split_unbalanced
since it is also called from cleanup_delete
and I didn't want to touch that function.
Hopefully, you find the changes suitable.
…merge_insert` In the recently revised implementation of `merge_insert` we were checking whether a tag exists in the `balanced/unbalanced` tags list by referring to just the tag itself. This is unreliable since the same tag might exist in both list. Incorrect identification leads to skipping of the `<ins>` tags in some cases, resulting in incorrect diff being rendered. This pathset fixes the issue described and adds a test to validate it.
<a href="http://google.com">search <ins> Link: http://google.com</ins> | ||
<del> Link: http://yahoo.com</del> </a> | ||
<a href="http://google.com">search <ins> Link: http://google.com | ||
</ins> <del> Link: http://yahoo.com</del> </a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What strikes me is that there is a space before the closing </ins>
but not before the closing </del>
. Given that there is a space after both of the opening <ins>
and <del>
, it's debatable whether the new behaviour isn't really better than the old. But I would expect both tag types to be consistent, at least.
balanced.append(None) | ||
start.extend( | ||
[chunk for name, pos, chunk in tag_stack]) | ||
[chunk_to_add for name, pos, chunk_to_add in tag_stack]) | ||
balanced = [chunk for chunk in balanced if chunk is not None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition in this line probably doesn't work any more if with_idx=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, wouldn't it ? Here balanced
would be something like [(1, '<div>'), None, ...]
if with_idx==True
, so the if chunk is not None
check still does the right thing.
>>> a = "<div><p>Some text that will change</p><p>Some tags will be added</p></div>" | ||
>>> b = "<div><div><p>Some text that has changed a bit</p><p>All of this is new</p></div></div>" | ||
>>> pdiff(a, b) | ||
<div><div><p>Some text that <ins>has changed a bit</ins> </p> | ||
<p><ins>All of this is new</ins></p> </div> <del>will | ||
change</del><p><del>Some tags will be added</del></p> </div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff looks weird. Is it getting confused by the nested <div>
s ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. In this specific case, the rendered diff looks 'ok' to me (with the insertions preceding the deletions) but yeah, I imagine in larger context it might be less than ideal. I'm going to take another stab at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, fwiw, this isn't very different from the diff generated by the current code for the same input. The only difference being the closing </p>
tag of the first input paragraph is placed after the </ins>
instead of at the end of the changes for that paragraph.
current lxml diff:
<div><div><p>Some text that <ins>has changed a bit</ins><p><ins>All of
this is new</ins></p> </p></div> <del>will change</del><p><del>Some
tags will be added</del></p> </div>
diff generated with this patchset
<div><div><p>Some text that <ins>has changed a bit</ins> </p>
<p><ins>All of this is new</ins></p> </div> <del>will
change</del><p><del>Some tags will be added</del></p> </div>
There exists a bug in the current
htmldiff
code, where by the generated diffchanges the structure of the html (notice that the
<div id="middle">
appears at the beginning instead of the middle):
This patchset is an attempt to fix that issue.