Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Interactivity API store JSON encoding #6520

Closed
wants to merge 12 commits into from

Conversation

sirreal
Copy link
Member

@sirreal sirreal commented May 7, 2024

Improve JSON serialization of the Interactivity API store. Based on this conversation: #6433 (review)

There is a more detailed analysis below, but in summary:

  • This data is JSON encoded and printed in a <script type="application/json"> tag.
  • If we ensure that < is never printed inside the data, it should be impossible to break out of the script tag and the browser treats everything as the element's textContent.
  • All other escaping becomes unnecessary at that point, including unicode escaping if the page uses the UTF-8 charset (the same encoding as JSON).

By using as much utf-8 text without redundant escaping, the output generally becomes smaller and more readable.

Props: @anomiex

Trac ticket: https://core.trac.wordpress.org/ticket/61170


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

Copy link

github-actions bot commented May 7, 2024

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@sirreal
Copy link
Member Author

sirreal commented May 9, 2024

@westonruter brought up a point about JSON_UNESCAPED_SLASHES (#6433 (comment)). I'll reply here.

Thanks for flagging that! This is interesting feedback, do you have more information on what an attack might look like?

I decided to investigate more deeply, so I reviewed reviewing the standard and as far as I can tell to break out of the script you need a literal sequence like </script in order to break out of a script tag. Either JSON unicode escaping the < or escaping the / (as \/) seems sufficient based on my interpretation, but scrutiny would be very welcome.

To close a script tag, we must find:

  • start in script data state (we could start from another of the script states, but we'd need to return here somehow)
  • U+003C LESS-THAN SIGN (<)
    Switch to the script data less-than sign state.
  • U+002F SOLIDUS (/)
    Set the temporary buffer to the empty string. Switch to the script data end tag open state.
  • ASCII alpha
    Create a new end tag token, set its tag name to the empty string. Reconsume in the script data end tag name state (anything else adds </ to the script data and continues)
    My interpretation: ASCII characters are lowercased and added to the "end tag" name. When the end tag name ends with a non-ascii character the following step checks if it matches the opening tag (script).
  • 13.2.5.17 Script data end tag name state
    • ASCII upper and lower alpha are added to the tag name (lowercased)
    • Space, linefeed, formfeed, tab, / and > do what they do in normal tags if the tag name was matched (script in any upper/lowercase combination)
    • Anything else (including the spacing, /, > characters if the tag name wasn't matched) treats the characters as script data and continues to parse in script data state.

Based on the above, my interpretation is that we need to make sure that </script{{ non-ascii alpha }} cannot be produced. Minimally, ensuring </ cannot be produced seems sufficient.

Here are the relevant JSON encoding flags:

If my analysis so far is correct, either omitting JSON_UNESCAPED_SLASHES (so / is escaped as \/) OR including JSON_HEX_TAG (so < is escaped as \u003c) is sufficient. These conditions should print the dangerous</script as <\/script or \u003c/script respectively, both of which should be harmless.

That is to say, it seems that neither flag (0) OR both flags JSON_HEX_TAG | JSON_UNESCAPED_SLASHES seems sufficient to ensure the dangerous </script is not produced.

My intuition is to use JSON_HEX_TAG | JSON_UNESCAPED_SLASHES. I suspect / is more common in encoded data like URLs than that < and > characters, so / can remain unencoded and < and > are encoded.

One thing that concerns me is a sequence of states in a script tag that begins <!--, this enters an "escaped" flow of states, but I can't find any difference with the normal flow, it still seems to try to match </script and close the script tag at that point. Nonetheless, escaping < alone seems sufficient to prevent entering this state because without < we can never enter Script data less-than sign state.


JSON_HEX_AMP doesn't seem to have any value, character references like &amp; are not decoded in a script tag. We can remove it. &lt;/script&gt; is not dangerous in a script tag.

JSON_UNESCAPED_UNICODE seems good to include if we're confident the page encoding is appropriate, so we can include it if blog_charset is "UTF-8".

JSON_UNESCAPED_LINE_TERMINATORS also seems good to include. \u2028 (LINE SEPARATOR) and \u2029 (PARAGRAPH SEPARATOR) do not need to be escaped. It's tricky to find documented, but it seems that these are the only characters that are handled.

@sirreal sirreal marked this pull request as ready for review May 9, 2024 10:23
Copy link

github-actions bot commented May 9, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell, dmsnell, westonruter, bjorsch, sabernhardt.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@anomiex
Copy link

anomiex commented May 9, 2024

If my analysis so far is correct, either omitting JSON_UNESCAPED_SLASHES (so / is escaped as \/) OR includeing JSON_HEX_TAG (so < is escaped as \u003c) is sufficient. These conditions should print the dangerous</script as <\/script or \u003c/script respectively, both of which should be harmless.

Exactly this.

My intuition is to use JSON_HEX_TAG | JSON_UNESCAPED_SLASHES. I suspect / is more common in encoded data like URLs than that < and > characters, so / can remain unencoded and < and > are encoded.

My intuition as well.

One thing that concerns me is a sequence of states in a script tag that begins <!--, this enters an "escaped" flow of states, but I can't find any difference with the normal flow, it still seems to try to match </script and close the script tag at that point. Nonetheless, escaping < alone seems sufficient to prevent entering this state because without < we can never enter Script data less-than sign state.

I've not looked too much into this either, but I've also had that in the back of my head when considering JSON_HEX_TAG | JSON_UNESCAPED_SLASHES over neither.

JSON_HEX_AMP doesn't seem to have any value, character references like &amp; are not decoded in a script tag. We can remove it. &lt;/script&gt; is not dangerous in a script tag.

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

@sirreal
Copy link
Member Author

sirreal commented May 9, 2024

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

Good thought. I'm unable to get any modern browser to interpret a page as anything other than HTML5 so this may be a non-issue. There's a related ticket that suggests the same:

WordPress still officially supports HTML4 and XHTML, but the browsers it serves and the broader web effectively don't.

The appearance of serving HTML4 or XHTML stems from the fact that it's very rare to serve actual XHTML content, and perhaps impossible to serve HTML4 content, to any supported browser or environment.

@westonruter
Copy link
Member

westonruter commented May 9, 2024

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

Good thought. I'm unable to get any modern browser to interpret a page as anything other than HTML5 so this may be a non-issue. There's a related ticket that suggests the same:

@sirreal Did you try sending Content-Type: application/xhtml+xml? This is the only way to trigger it. Here's a test page which is served as actual XHTML with a link to trigger a parse error to prove whether the XML parser is in use: https://xhtml-test-page.glitch.me/

Chrome, Safari, and Firefox all show parse errors when going to https://xhtml-test-page.glitch.me/?breakxml

$expected = <<<"JSON"
{"config":{"myPlugin":{"chars":"&\\u003C\\u003E/"}},"state":{"myPlugin":{"ampersand":"&","less-than sign":"\\u003C","greater-than sign":"\\u003E","solidus":"/","line separator":"\\u2028","paragraph separator":"\\u2029","flag of england":"\\ud83c\\udff4\\udb40\\udc67\\udb40\\udc62\\udb40\\udc65\\udb40\\udc6e\\udb40\\udc67\\udb40\\udc7f","malicious script closer":"\\u003C/script\\u003E","entity-encoded malicious script closer":"&lt;/script&gt;"}}}
JSON;
$this->assertEquals( $expected, $interactivity_data_string[1] );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: This could use the HTML Processor to step over the tokens in $interactivity_data_markup as an additional check to ensure that the malicious script closer does not prematurely close the script. Maybe use the PHP's DOM API as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same thing as I worked with these tests. However, in this case I think it makes sense to match and work with literal strings. I don't have much confidence in any of the parsers to do what I expect and not try to interpret any of this as HTML markup. I want to know exactly what characters are output and don't want any entities to be transformed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTML Processor doesn't support SCRIPT tags right now, and the tag processor is much more rudimentary. I'm not sure either is ready to handle what you suggest. @dmsnell thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case the Tag Processor should be inscrutable for testing SCRIPT elements, because it does consume the entire thing in one go. you can compare get_modifiable_text() to your expectation to see what was inside the script

@anomiex
Copy link

anomiex commented May 9, 2024

Did you try sending Content-Type: application/xhtml+xml? This is the only way to trigger it. Here's a test page which is served as actual XHTML with a link to trigger a parse error to prove whether the XML parser is in use: https://xhtml-test-page.glitch.me/

Based on that, here's a PHP file that illustrates a script injection that would be prevented by JSON_HEX_AMP:

<?php

header('Content-Type: application/xhtml+xml; charset=UTF-8');

$data = "&quot;;alert('eek');&quot;";

?><?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">
  <head>
   <script>
    var data = <?php echo json_encode( $data, JSON_HEX_TAG | JSON_UNESCAPED_SLASHES ); ?>;
   </script>
  </head>
  <body>
  </body>
</html>

Attacking the <script type="application/json"> used here would be harder, of course, as you can't directly inject code there. But you might be able to override the data structure in some manner that exposes a problem in whatever uses the data. Or at least you could break the page.

@westonruter
Copy link
Member

as you can't directly inject code there

But you could still if your malicious code looked like </script><img src="bad" onerror="alert('evil')"> or even have it open another non-JSON script tag.

@anomiex
Copy link

anomiex commented May 9, 2024

If you can figure out a way to get that </script> past JSON_HEX_TAG there, please give an example.

@sirreal
Copy link
Member Author

sirreal commented May 9, 2024

Fascinating! This was helpful. I'll add JSON_HEX_AMP for XHTML, detailed reasoning below.


In XHTML the entities will be decoded in script tags, although this seems to be a source of data corruption more than anything else. I don't think they can be abused to break out of the script tag - that's their purpose: &lt; is not the start of a tag like < is 🙂

However, it can be dangerous not to escape & because:

  • it's easy to break the xhtml
  • html entities will be transformed

Here's the demo script I was playing with:

<?php
$xhtml = isset($_GET['xhtml']);
header( 'Content-Type: '
	. ( $xhtml ? 'application/xhtml+xml' : 'text/html' )
	. '; charset=UTF-8'
);

// PLAY WITH THE VALUE HERE
$s = "&lt;";

$flags = 0
	// always these
	| JSON_HEX_TAG
	| JSON_UNESCAPED_SLASHES

	// these with UTF-8
	| JSON_UNESCAPED_UNICODE
	| JSON_UNESCAPED_LINE_TERMINATORS
	| 0 ;

if ( isset( $_GET['amp'] ) ) {
	$flags |= JSON_HEX_AMP;
}

if ( $xhtml ) {
	echo '<!DOCTYPE html>'
	. PHP_EOL
	. '<html xmlns="http://www.w3.org/1999/xhtml">';
} else {
	echo '<!DOCTYPE html>'
	. PHP_EOL
	. '<html>';
}
?>
  <body>
	<script id='1' type="application/json"><?php echo json_encode($s,$flags); ?></script>
	<script>
		const j = JSON.parse(document.getElementById('1').textContent);
		console.log("%o", j);
	</script>
  </body>
</html>
  • HTML seems to work fine with these flags: JSON_HEX_TAG | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_LINE_TERMINATORS
  • XHTML will print < to the JavaScript console, we expected &lt;!
  • XHTML breaks if we JSON encode and print &lt - it doesn't like that character reference without the ; termination!
  • Using JSON_HEX_AMP in XHTML seems to fix both of these issues.

@sirreal
Copy link
Member Author

sirreal commented May 9, 2024

Does anyone know of themes to test that don't support html5 😅

I'm wondering if a check like this is sufficient:

if ( ! current_theme_supports( 'html5' ) ) {
	$json_encode_flags |= JSON_HEX_AMP;
}

Of if this need to check whether in WP Admin as well 🤔

@westonruter
Copy link
Member

Given that pages written XHTML is almost never served as actual XHTML (see GoogleChromeLabs/wpp-research#74), I think we should always assume HTML5.

@westonruter
Copy link
Member

The WP Admin is considered HTML5, at least according to wp_get_inline_script_tag().

@dmsnell
Copy link
Contributor

dmsnell commented May 9, 2024

This is a great discussion, and I think I'd need far more time to digest what's going on, but here are a few initial thoughts.

we need to make sure that </script{{ non-ascii alpha }} cannot be produced

I'm not sure where this comes from. It's hard to succinctly say other than to say we cannot produce a closing SCRIPT tag. </script{{ non-ascii alpha }} is safe to produce because it is not going to be a closing tag unless what follows is whitespace or / or >. What this rule is saying, is that once we start parsing a tag, if the tag name is SCRIPT and then we continue to complete the tag then it closes out of the SCRIPT state. However, if it's not a tag name match (e.g. SCRIPT😈 or DIV or SCRIPT:EVIL) then the entire tag parsing concludes and the consumed characters are flushed out as plaintext within the script.

You can see this with the Tag Processor, and note that it mirrors what you see in a browser.

$p = new WP_HTML_Tag_Processor( '<script>This is </script😄> and this is </script>' );
$p->next_token();
echo $p->get_modifiable_text();
// Output: "This is </script😄> and this is "

One final note is that once we match the SCRIPT tag name and enter the tag parsing, attributes may exist on the closing tag, even though they are ignored.

$p = new WP_HTML_Tag_Processor( '<script>This is a </script closing tag="</script>">' );
$p->next_token();
echo $p->get_modifiable_text();
// Output: "This is a "

The double-escaped script state can be very confusing, largely because of the wording around it. It's specifically there to allow printing SCRIPT elements from JavaScript. You not only need to open an HTML comment with <!-- but you also need to encounter a script opening tag.

These two discussions cover it better than the spec does IMO.

$p = new WP_HTML_Tag_Processor( '<script>This <!-- <script> </script> --> </script>' );
$p->next_token(); echo $p->get_modifiable_text();
// Output: "This <!-- <script> </script> --> "

$p = new WP_HTML_Tag_Processor( '<script>This <!-- <script> --> </script> --> </script>' );
$p->next_token(); echo $p->get_modifiable_text();
// Output: "This <!-- <script> --> "

You can see here how either a closing --> or a </script> tag exists the double-escaped state. The important thing that I see is that if we can prevent </script then we are guaranteed to prevent any escape regardless. It can be very challenging to try and reliably and safely preserve the ability to enter the literal </script>, though we should do that if we can. It's not a matter of entering the double-escaped state when we that to come through, because double-escaped states do not nest.

$p = new WP_HTML_Tag_Processor( '<script>This <!-- <script> <!-- <script> </script> --> </script>' );
$p->next_token(); echo $p->get_modifiable_text();
// Output: "This <!-- <script> <!-- <script> </script> --> "

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

Personally I'd rather see us move forward and focus on HTML5 because it's so extremely rare in practice to find XHTML. Yes it's possible to send the content type header, and yes if you serve as a page as .xml the browser will follow, but that effectively doesn't happen in practice and for good reason. Even if WordPress claims to send XHTML the rest of the page will be broken for a variety of other reasons.

We can keep in mind that HTML cannot be fully expressed through XML and this is one of the failures of XHTML.

Why this is important to me is that once we start escaping content within a SCRIPT element we are knowingly corrupting that script and entering uncertain territory, possibly introducing vulnerabilities that wouldn't have been there through inaction.

This is one of the reasons I discourage use of DOMDocument, because even the act of loading and re-saving a document can introduce corruption and vulnerabilities.

$dom = new DOMDocument();
$dom->loadHTML( "<!DOCTYPE html><html><meta charset=utf8><body><script>This <!-- <script> </script> --> </script> alert(1);</body></html>" );
echo $dom->saveHTML();
// Output:
// <!DOCTYPE html>
// <html><head><meta charset="utf8"></head><body><script>This <!-- <script> </script> --&gt;  alert(1);</body></html>

In this case we instructed DOMDocument to make no changes to the document, and yet it has broken out of the SCRIPT. This causes the alert(1); to be executed, when properly it should have appeared as plaintext on the page.

When HTML5\DOMDocument appears it should not make these mistakes.

@sirreal
Copy link
Member Author

sirreal commented May 9, 2024

we need to make sure that </script{{ non-ascii alpha }} cannot be produced

I'm not sure where this comes from.

This was perhaps a poor attempt at expressing that we need to find at least </script to close a script tag. This is not sufficient, but it is necessary. It implies that we must find </scr etc. Finally, if we can't find </ or even < then there's no way to close the script tag.

@sirreal
Copy link
Member Author

sirreal commented May 9, 2024

Given all the discussion and just how difficult it is to serve XHTML, is the consensus that JSON_HEX_AMP should always be omitted?

It doesn't seem to be a security issue. There's only an edge case that the data may not be interpreted correctly or the data may cause the xhtml page to be invalid due to malformed cahracter references. And that's all if the page is actually served correctly as XHTML.

@sabernhardt
Copy link

sabernhardt commented May 9, 2024

Does anyone know of themes to test that don't support html5 😅

Themes that do not declare html5 support include Twenty Ten, Twenty Eleven and Twenty Twelve.

if ( ! current_theme_supports( 'html5' ) ) {

If you use a current_theme_supports() condition, you may need to check whether the theme supports HTML5 script specifically:
current_theme_supports( 'html5', 'script' )

* tag parsing.
*/
$json_encode_flags = JSON_HEX_TAG | JSON_UNESCAPED_SLASHES;
if ( 'UTF-8' === get_option( 'blog_charset' ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sadly this check is insufficient, as there are common case variants and the hyphen may not be present. The common example is below, but I am so unnerved by this pattern that I'm currently prepping a patch to WordPress to add a new semantic check.

$charset = get_option( 'blog_charset' );
if ( in_array( $charset, array( 'utf8', 'utf-8', 'UTF8' ), true ) ) {
	$charset = 'UTF-8';
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It looks like the _canonical_charset() function (intended to normalize to "UTF-8") is applied as a filter by default, so maybe this is sufficient as is? Your proposed patch would make that filter more robust without any changes here:

add_filter( 'option_blog_charset', '_canonical_charset' );

If the encoding is UTF-8 and we fail to it, it will do little harm. The only issue is that valid unicode would be escaped to its \u1234 form.

*
* - JSON_UNESCAPED_UNICODE: Encode multibyte Unicode characters literally
* (default is to escape as \uXXXX).
* - JSON_UNESCAPED_LINE_TERMINATORS: The line terminators are kept unescaped when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not escape line terminators? It seems like my normal expectation is that they be escaped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a poorly named and poorly documented flag, but it causes \u2028 (LINE SEPARATOR) and \u2029 (PARAGRAPH SEPARATOR) not to be escaped. This is included in the updated tests.

It has nothing to do with newlines or carriage returns or line feeds or anything like you might expect to be a "line terminator" 🤷. These two characters are not in the restricted JSON characters so its fine to print them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my thinking was that if a client renders these in a way that breaks the line it will look funny in the code or in a debug statement rather than having the escaped characters. so in this case it's less about resilience and more about making the strings human-readable.

I don't have any strong opinion on this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be a good point, \u2028 is easier to spot than whitespace. At the same time, we won't be encoding \u00a0 or most of the other whitespace characters that are likely much more common.

I'd also argue that the only place we control this is where they're printed in the HTML. Assuming that something reads the string out of HTML and parses the JSON —probably JavaScript with JSON.parse—, it doesn't matter how it was encoded when serialized.

I have a slight preference to keep this flag and leave these unescaped, but I don't feel strongly one way or the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what it looks like, this is "before\u2028middle\u2029after":

Browser Elements panel view source
Chrome Screenshot 2024-05-10 at 21 01 48 Screenshot 2024-05-10 at 20 34 35
Firefox Screenshot 2024-05-10 at 21 03 02 Screenshot 2024-05-10 at 21 05 33

Safari looks like Firefox, the characters are invisible and it reads "beforemiddleafter".

@sirreal
Copy link
Member Author

sirreal commented May 10, 2024

Thanks @sabernhardt. I did some testing with twentyten and it doesn't declare html5 support, but it certainly doesn't render an xhtml page. Unescaped & in script tags seems harmless and are interpreted correctly.

I'm fairly convinced at this point that we don't need JSON_HEX_AMP. Agree with @westonruter on this (#6520 (comment)).

I think most discussion is settled and this is ready for review.

@sabernhardt
Copy link

If you want to test with a theme that uses an XHTML doctype in the header.php template, you could try the old "Default" (Kubrick) theme. Directory searches found others too. Note that some of these themes are not available for download anymore, and some of them only use the doctype for a documentation page (not in header.php).
XHTML
HTML 4

@sirreal
Copy link
Member Author

sirreal commented May 10, 2024

Thanks again @sabernhardt. I tried with Kubrick and Colorway, and those themes still are not rendered as XHTML, no issues without JSON_HEX_AMP encoding.

Co-authored-by: Weston Ruter <westonruter@google.com>
@sirreal sirreal requested a review from westonruter May 10, 2024 17:39
sirreal added a commit to WordPress/gutenberg that referenced this pull request May 14, 2024
sirreal added a commit to sirreal/wordpress-develop that referenced this pull request May 14, 2024
sirreal added a commit to WordPress/gutenberg that referenced this pull request May 14, 2024
Previously, the flags were set as if UTF-8 were conditionally present,
but in most cases the blog character set will be UTF-8 and so flipping
the logic makes it clearer that this is a "happy path" and also remove
some logic that isn't necessary most of the time.

I've reworded and combined the comment explaining the flags to be more
specific about whey they are needed, and removed some discussions
about encodings that I thought muddied the important details.
Copy link
Contributor

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sirreal I merged trunk and made a commit to use the new is_utf8_charset()

Additionally I rearranged the comment and logic for swapping flags because I felt like it would be nicer if we expressed UTF-8 as the default and only removed the other flags if we're in a different charset. I removed some wording in the comment that I felt was more confusing about the encoding than helpful, since JSON is by spec required to be UTF-8 and by convention is universally so.

My commit is only to speed up the process, not demand the change. Please let me know how you feel. If you and @westonruter are still on board, I can merge this in soon.

@sirreal
Copy link
Member Author

sirreal commented May 15, 2024

Thanks @dmsnell, I'm happy with your changes 👍

sirreal added a commit to WordPress/gutenberg that referenced this pull request May 15, 2024
pento pushed a commit that referenced this pull request May 15, 2024
The Interactivity API has been rendering client data in a SCRIPT element with the
type `application/json` so that it's not executed as a script, but is available
to one. The data runs through `wp_json_encode()` and is encoded with some flags
to ensure that potentially-dangerous characters are escaped.

However, this can lead to some challenges. Eagerly escaping when not necessary
can make the data difficult to comprehend when reading the output HTML. For example,
all non-ASCII Unicode characters are escaped with their code point equivalent.
This results in `\ud83c\udd70` instead of `🅰`.

In this patch, the flags for JSON encoding are refined to ensure what's necessary
while relaxing other rules (leaving in those Unicode characters if the blog charset
is UTF-8). This makes for Interactivity API data that's quicker as a human reader
to decipher and diagnose.

In summary:

 - This data is JSON encoded and printed in a `<script type="application/json">` tag.

 - If we ensure that `<` is never printed inside the data, it should be impossible to
   break out of the script tag and the browser treats everything as the element's `textContent`.

 - All other escaping becomes unnecessary at that point, including unicode escaping 
   if the page uses the UTF-8 charset (the same encoding as JSON).

See #6433 (review)

Developed in #6520
Discussed in https://core.trac.wordpress.org/ticket/61170

Fixes: #61170
Follow-up to: [57563].
Props: bjorsch, dmsnell, jonsurrell, sabernhardt, westonruter.


git-svn-id: https://develop.svn.wordpress.org/trunk@58159 602fd350-edb4-49c9-b593-d223f7449a82
@dmsnell
Copy link
Contributor

dmsnell commented May 15, 2024

Merged in [58159]
13d5244

@dmsnell dmsnell closed this May 15, 2024
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request May 15, 2024
The Interactivity API has been rendering client data in a SCRIPT element with the
type `application/json` so that it's not executed as a script, but is available
to one. The data runs through `wp_json_encode()` and is encoded with some flags
to ensure that potentially-dangerous characters are escaped.

However, this can lead to some challenges. Eagerly escaping when not necessary
can make the data difficult to comprehend when reading the output HTML. For example,
all non-ASCII Unicode characters are escaped with their code point equivalent.
This results in `\ud83c\udd70` instead of `🅰`.

In this patch, the flags for JSON encoding are refined to ensure what's necessary
while relaxing other rules (leaving in those Unicode characters if the blog charset
is UTF-8). This makes for Interactivity API data that's quicker as a human reader
to decipher and diagnose.

In summary:

 - This data is JSON encoded and printed in a `<script type="application/json">` tag.

 - If we ensure that `<` is never printed inside the data, it should be impossible to
   break out of the script tag and the browser treats everything as the element's `textContent`.

 - All other escaping becomes unnecessary at that point, including unicode escaping 
   if the page uses the UTF-8 charset (the same encoding as JSON).

See WordPress/wordpress-develop#6433 (review)

Developed in WordPress/wordpress-develop#6520
Discussed in https://core.trac.wordpress.org/ticket/61170

Fixes: #61170
Follow-up to: [57563].
Props: bjorsch, dmsnell, jonsurrell, sabernhardt, westonruter.

Built from https://develop.svn.wordpress.org/trunk@58159


git-svn-id: http://core.svn.wordpress.org/trunk@57622 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request May 15, 2024
The Interactivity API has been rendering client data in a SCRIPT element with the
type `application/json` so that it's not executed as a script, but is available
to one. The data runs through `wp_json_encode()` and is encoded with some flags
to ensure that potentially-dangerous characters are escaped.

However, this can lead to some challenges. Eagerly escaping when not necessary
can make the data difficult to comprehend when reading the output HTML. For example,
all non-ASCII Unicode characters are escaped with their code point equivalent.
This results in `\ud83c\udd70` instead of `🅰`.

In this patch, the flags for JSON encoding are refined to ensure what's necessary
while relaxing other rules (leaving in those Unicode characters if the blog charset
is UTF-8). This makes for Interactivity API data that's quicker as a human reader
to decipher and diagnose.

In summary:

 - This data is JSON encoded and printed in a `<script type="application/json">` tag.

 - If we ensure that `<` is never printed inside the data, it should be impossible to
   break out of the script tag and the browser treats everything as the element's `textContent`.

 - All other escaping becomes unnecessary at that point, including unicode escaping 
   if the page uses the UTF-8 charset (the same encoding as JSON).

See WordPress/wordpress-develop#6433 (review)

Developed in WordPress/wordpress-develop#6520
Discussed in https://core.trac.wordpress.org/ticket/61170

Fixes: #61170
Follow-up to: [57563].
Props: bjorsch, dmsnell, jonsurrell, sabernhardt, westonruter.

Built from https://develop.svn.wordpress.org/trunk@58159


git-svn-id: https://core.svn.wordpress.org/trunk@57622 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell dmsnell deleted the add/iapi-better-json-encoding branch May 15, 2024 17:59
sirreal added a commit to WordPress/gutenberg that referenced this pull request May 16, 2024
sirreal added a commit to WordPress/gutenberg that referenced this pull request May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants