Skip to content

quotes not rendered correctly in excerpt #981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
michaelburch opened this issue Feb 11, 2022 · 5 comments
Closed

quotes not rendered correctly in excerpt #981

michaelburch opened this issue Feb 11, 2022 · 5 comments

Comments

@michaelburch
Copy link

When using @Html.Raw(document.GetString("Excerpt")) to display excerpt content on an archive page, as in the simple-archive example HTML quotes are displayed as their encoded value, &quot.

This began with Statiq.Web 1.0.0-beta.35 and continues today with 1.0.0-beta.42.

Repro here: https://github.com/michaelburch/Statiq.Web

Example when using Statiq.Web 1.0.0-beta.34:

simple-archive-beta 34

Example when using Statiq.Web 1.0.0-beta.35+:

simple-archive-beta 35

@michaelburch michaelburch changed the title HTML quotes not rendered correctly in excerpt quotes not rendered correctly in excerpt Feb 11, 2022
@daveaglick
Copy link
Member

daveaglick commented Feb 11, 2022

Thanks for the repro - I can replicate on the Statiq examples page too as you noted. Looking into this now.

Note to self for context: this corresponds with Statiq Framework 1.0.0-beta.50, which is when all the Statiq.Html modules were moved into core. It's likely some behavior of AngleSharp may have regressed at that point (not in AngleSharp directly, more like in how it's being used).

@daveaglick
Copy link
Member

This one is getting interesting. I actually can't reproduce with the raw GenerateExcerpt module - it's leaving the HTML in the except exactly as it sees it, including quotes and other escapable content. Likewise, Html.Raw() is still working as expected too (I.e. it's not doing the escaping).

One possibility is that Markdig is actually doing the encoding before the excerpt is even generated. It was updated around the time this problem started. But that doesn't make complete sense either because even if it were the case, I'd expect the entity encoding just to flow right through and be rendered correctly in the browser. It's like it's being double-encoded (or least I'll guess the ampersand is).

Still investigating, but the easiest answer that it's the excerpt module appears to be out. It's likely some combination of modules in the Statiq Web pipeline, so I'll need to do some integration testing to get to the bottom of it. More to come.

@daveaglick
Copy link
Member

So my first hunch was correct, and is at least partially responsible - the RenderMarkdown module (and thus Markdig) is encoding the quotes when it renders Markdown content:

image

So when the document gets to the GenerateExcerpt module it's okay and contains encoded quotes, but that's valid HTML:

image

But then by the time AngleSharp has parsed the HTML content inside GenerateExcerpt to find the excerpt content, we've double-escaped the ampersand:

image

Now that I know where the problem is, it should be fairly simple to fix.

@daveaglick
Copy link
Member

...and now I know why it's happening and changed. This is an unfortunate regression caused by my attempts to deal with an annoying problem with @ encoding. The Razor engine uses @ as the delimiter for C# instructions. So sometimes we want @ to be a literal. But other times, like when I use @ inside a Markdown document for something like an email address or Twitter handle, we don't want @ to be a literal because when Razor gets it, it'll interpret that as an instruction delimiter. So in those cases we have to encode the @. And there's the problem: some @ are encoded and others aren't. To preserve which is which when we need to do DOM processing with AngleSharp (like getting an excerpt), I told AngleSharp not to "consume" character references and to treat them like text. But then AngleSharp gets all smart and sees the & of a character reference, says "oh, this was just text so I need to encode that &, and does so. And so we end up with double encoding.

(BTW - I know that was a lot, just wanted to document what's going on in case I ever end up back here)

@daveaglick
Copy link
Member

Fix confirmed:

image

I'll get a release out sometime this weekend. Thanks again for reporting this, turns out to have been a pretty major bug lurking around in the background!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants