Improve Interactivity API store JSON encoding #6520

sirreal · 2024-05-07T15:20:58Z

Improve JSON serialization of the Interactivity API store. Based on this conversation: #6433 (review)

There is a more detailed analysis below, but in summary:

This data is JSON encoded and printed in a <script type="application/json"> tag.
If we ensure that < is never printed inside the data, it should be impossible to break out of the script tag and the browser treats everything as the element's textContent.
All other escaping becomes unnecessary at that point, including unicode escaping if the page uses the UTF-8 charset (the same encoding as JSON).

By using as much utf-8 text without redundant escaping, the output generally becomes smaller and more readable.

Trac ticket: https://core.trac.wordpress.org/ticket/61170

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

github-actions · 2024-05-07T15:35:42Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

src/wp-includes/interactivity-api/class-wp-interactivity-api.php

sirreal · 2024-05-09T08:43:48Z

@westonruter brought up a point about JSON_UNESCAPED_SLASHES (#6433 (comment)). I'll reply here.

Thanks for flagging that! This is interesting feedback, do you have more information on what an attack might look like?

I decided to investigate more deeply, so I reviewed reviewing the standard and as far as I can tell to break out of the script you need a literal sequence like </script in order to break out of a script tag. Either JSON unicode escaping the < or escaping the / (as \/) seems sufficient based on my interpretation, but scrutiny would be very welcome.

To close a script tag, we must find:

start in script data state (we could start from another of the script states, but we'd need to return here somehow)
U+003C LESS-THAN SIGN (<)
Switch to the script data less-than sign state.
U+002F SOLIDUS (/)
Set the temporary buffer to the empty string. Switch to the script data end tag open state.
ASCII alpha
Create a new end tag token, set its tag name to the empty string. Reconsume in the script data end tag name state (anything else adds </ to the script data and continues)
My interpretation: ASCII characters are lowercased and added to the "end tag" name. When the end tag name ends with a non-ascii character the following step checks if it matches the opening tag (script).
13.2.5.17 Script data end tag name state
- ASCII upper and lower alpha are added to the tag name (lowercased)
- Space, linefeed, formfeed, tab, / and > do what they do in normal tags if the tag name was matched (script in any upper/lowercase combination)
- Anything else (including the spacing, /, > characters if the tag name wasn't matched) treats the characters as script data and continues to parse in script data state.

Based on the above, my interpretation is that we need to make sure that </script{{ non-ascii alpha }} cannot be produced. Minimally, ensuring </ cannot be produced seems sufficient.

Here are the relevant JSON encoding flags:

JSON_HEX_TAG: All < and > are converted to \u003C and \u003E.
JSON_HEX_AMP: All & are converted to \u0026.
JSON_UNESCAPED_SLASHES: Don't escape /.
JSON_UNESCAPED_UNICODE: Encode multibyte Unicode characters literally (default is to escape as \uXXXX).
JSON_UNESCAPED_LINE_TERMINATORS: The line terminators are kept unescaped when JSON_UNESCAPED_UNICODE is supplied. It uses the same behaviour as it was before PHP 7.1 without this constant. Available as of PHP 7.1.0.

If my analysis so far is correct, either omitting JSON_UNESCAPED_SLASHES (so / is escaped as \/) OR including JSON_HEX_TAG (so < is escaped as \u003c) is sufficient. These conditions should print the dangerous</script as <\/script or \u003c/script respectively, both of which should be harmless.

That is to say, it seems that neither flag (0) OR both flags JSON_HEX_TAG | JSON_UNESCAPED_SLASHES seems sufficient to ensure the dangerous </script is not produced.

My intuition is to use JSON_HEX_TAG | JSON_UNESCAPED_SLASHES. I suspect / is more common in encoded data like URLs than that < and > characters, so / can remain unencoded and < and > are encoded.

One thing that concerns me is a sequence of states in a script tag that begins <!--, this enters an "escaped" flow of states, but I can't find any difference with the normal flow, it still seems to try to match </script and close the script tag at that point. Nonetheless, escaping < alone seems sufficient to prevent entering this state because without < we can never enter Script data less-than sign state.

JSON_HEX_AMP doesn't seem to have any value, character references like & are not decoded in a script tag. We can remove it. </script> is not dangerous in a script tag.

JSON_UNESCAPED_UNICODE seems good to include if we're confident the page encoding is appropriate, so we can include it if blog_charset is "UTF-8".

JSON_UNESCAPED_LINE_TERMINATORS also seems good to include. \u2028 (LINE SEPARATOR) and \u2029 (PARAGRAPH SEPARATOR) do not need to be escaped. It's tricky to find documented, but it seems that these are the only characters that are handled.

github-actions · 2024-05-09T10:23:47Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell, dmsnell, westonruter, bjorsch, sabernhardt.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

anomiex · 2024-05-09T13:33:12Z

If my analysis so far is correct, either omitting JSON_UNESCAPED_SLASHES (so / is escaped as \/) OR includeing JSON_HEX_TAG (so < is escaped as \u003c) is sufficient. These conditions should print the dangerous</script as <\/script or \u003c/script respectively, both of which should be harmless.

Exactly this.

My intuition is to use JSON_HEX_TAG | JSON_UNESCAPED_SLASHES. I suspect / is more common in encoded data like URLs than that < and > characters, so / can remain unencoded and < and > are encoded.

My intuition as well.

One thing that concerns me is a sequence of states in a script tag that begins <!--, this enters an "escaped" flow of states, but I can't find any difference with the normal flow, it still seems to try to match </script and close the script tag at that point. Nonetheless, escaping < alone seems sufficient to prevent entering this state because without < we can never enter Script data less-than sign state.

I've not looked too much into this either, but I've also had that in the back of my head when considering JSON_HEX_TAG | JSON_UNESCAPED_SLASHES over neither.

JSON_HEX_AMP doesn't seem to have any value, character references like & are not decoded in a script tag. We can remove it. </script> is not dangerous in a script tag.

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

sirreal · 2024-05-09T14:15:47Z

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

Good thought. I'm unable to get any modern browser to interpret a page as anything other than HTML5 so this may be a non-issue. There's a related ticket that suggests the same:

WordPress still officially supports HTML4 and XHTML, but the browsers it serves and the broader web effectively don't.
…
The appearance of serving HTML4 or XHTML stems from the fact that it's very rare to serve actual XHTML content, and perhaps impossible to serve HTML4 content, to any supported browser or environment.

westonruter · 2024-05-09T16:02:25Z

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

Good thought. I'm unable to get any modern browser to interpret a page as anything other than HTML5 so this may be a non-issue. There's a related ticket that suggests the same:

@sirreal Did you try sending Content-Type: application/xhtml+xml? This is the only way to trigger it. Here's a test page which is served as actual XHTML with a link to trigger a parse error to prove whether the XML parser is in use: https://xhtml-test-page.glitch.me/

Chrome, Safari, and Firefox all show parse errors when going to https://xhtml-test-page.glitch.me/?breakxml

westonruter · 2024-05-09T16:07:59Z

tests/phpunit/tests/interactivity-api/wpInteractivityAPI.php

+		$expected = <<<"JSON"
+{"config":{"myPlugin":{"chars":"&\\u003C\\u003E/"}},"state":{"myPlugin":{"ampersand":"&","less-than sign":"\\u003C","greater-than sign":"\\u003E","solidus":"/","line separator":"\\u2028","paragraph separator":"\\u2029","flag of england":"\\ud83c\\udff4\\udb40\\udc67\\udb40\\udc62\\udb40\\udc65\\udb40\\udc6e\\udb40\\udc67\\udb40\\udc7f","malicious script closer":"\\u003C/script\\u003E","entity-encoded malicious script closer":"&lt;/script&gt;"}}}
+JSON;
+		$this->assertEquals( $expected, $interactivity_data_string[1] );


Idea: This could use the HTML Processor to step over the tokens in $interactivity_data_markup as an additional check to ensure that the malicious script closer does not prematurely close the script. Maybe use the PHP's DOM API as well?

I thought the same thing as I worked with these tests. However, in this case I think it makes sense to match and work with literal strings. I don't have much confidence in any of the parsers to do what I expect and not try to interpret any of this as HTML markup. I want to know exactly what characters are output and don't want any entities to be transformed.

The HTML Processor doesn't support SCRIPT tags right now, and the tag processor is much more rudimentary. I'm not sure either is ready to handle what you suggest. @dmsnell thoughts on this?

in this case the Tag Processor should be inscrutable for testing SCRIPT elements, because it does consume the entire thing in one go. you can compare get_modifiable_text() to your expectation to see what was inside the script

anomiex · 2024-05-09T16:50:54Z

Did you try sending Content-Type: application/xhtml+xml? This is the only way to trigger it. Here's a test page which is served as actual XHTML with a link to trigger a parse error to prove whether the XML parser is in use: https://xhtml-test-page.glitch.me/

Based on that, here's a PHP file that illustrates a script injection that would be prevented by JSON_HEX_AMP:

<?php

header('Content-Type: application/xhtml+xml; charset=UTF-8');

$data = "&quot;;alert('eek');&quot;";

?><?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">
  <head>
   <script>
    var data = <?php echo json_encode( $data, JSON_HEX_TAG | JSON_UNESCAPED_SLASHES ); ?>;
   </script>
  </head>
  <body>
  </body>
</html>

Attacking the <script type="application/json"> used here would be harder, of course, as you can't directly inject code there. But you might be able to override the data structure in some manner that exposes a problem in whatever uses the data. Or at least you could break the page.

westonruter · 2024-05-09T17:05:31Z

as you can't directly inject code there

But you could still if your malicious code looked like </script><img src="bad" onerror="alert('evil')"> or even have it open another non-JSON script tag.

anomiex · 2024-05-09T17:28:57Z

If you can figure out a way to get that </script> past JSON_HEX_TAG there, please give an example.

sirreal · 2024-05-09T17:37:26Z

Fascinating! This was helpful. I'll add JSON_HEX_AMP for XHTML, detailed reasoning below.

In XHTML the entities will be decoded in script tags, although this seems to be a source of data corruption more than anything else. I don't think they can be abused to break out of the script tag - that's their purpose: < is not the start of a tag like < is 🙂

However, it can be dangerous not to escape & because:

it's easy to break the xhtml
html entities will be transformed

Here's the demo script I was playing with:

<?php
$xhtml = isset($_GET['xhtml']);
header( 'Content-Type: '
	. ( $xhtml ? 'application/xhtml+xml' : 'text/html' )
	. '; charset=UTF-8'
);

// PLAY WITH THE VALUE HERE
$s = "&lt;";

$flags = 0
	// always these
	| JSON_HEX_TAG
	| JSON_UNESCAPED_SLASHES

	// these with UTF-8
	| JSON_UNESCAPED_UNICODE
	| JSON_UNESCAPED_LINE_TERMINATORS
	| 0 ;

if ( isset( $_GET['amp'] ) ) {
	$flags |= JSON_HEX_AMP;
}

if ( $xhtml ) {
	echo '<!DOCTYPE html>'
	. PHP_EOL
	. '<html xmlns="http://www.w3.org/1999/xhtml">';
} else {
	echo '<!DOCTYPE html>'
	. PHP_EOL
	. '<html>';
}
?>
  <body>
	<script id='1' type="application/json"><?php echo json_encode($s,$flags); ?></script>
	<script>
		const j = JSON.parse(document.getElementById('1').textContent);
		console.log("%o", j);
	</script>
  </body>
</html>

HTML seems to work fine with these flags: JSON_HEX_TAG | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_LINE_TERMINATORS
XHTML will print < to the JavaScript console, we expected <!
XHTML breaks if we JSON encode and print &lt - it doesn't like that character reference without the ; termination!
Using JSON_HEX_AMP in XHTML seems to fix both of these issues.

sirreal · 2024-05-09T17:52:36Z

Does anyone know of themes to test that don't support html5 😅

I'm wondering if a check like this is sufficient:

if ( ! current_theme_supports( 'html5' ) ) {
	$json_encode_flags |= JSON_HEX_AMP;
}

Of if this need to check whether in WP Admin as well 🤔

westonruter · 2024-05-09T18:05:22Z

Given that pages written XHTML is almost never served as actual XHTML (see GoogleChromeLabs/wpp-research#74), I think we should always assume HTML5.

westonruter · 2024-05-09T18:07:48Z

The WP Admin is considered HTML5, at least according to wp_get_inline_script_tag().

dmsnell · 2024-05-09T18:12:39Z

This is a great discussion, and I think I'd need far more time to digest what's going on, but here are a few initial thoughts.

we need to make sure that </script{{ non-ascii alpha }} cannot be produced

I'm not sure where this comes from. It's hard to succinctly say other than to say we cannot produce a closing SCRIPT tag. </script{{ non-ascii alpha }} is safe to produce because it is not going to be a closing tag unless what follows is whitespace or / or >. What this rule is saying, is that once we start parsing a tag, if the tag name is SCRIPT and then we continue to complete the tag then it closes out of the SCRIPT state. However, if it's not a tag name match (e.g. SCRIPT😈 or DIV or SCRIPT:EVIL) then the entire tag parsing concludes and the consumed characters are flushed out as plaintext within the script.

You can see this with the Tag Processor, and note that it mirrors what you see in a browser.

$p = new WP_HTML_Tag_Processor( '<script>This is </script😄> and this is </script>' );
$p->next_token();
echo $p->get_modifiable_text();
// Output: "This is </script😄> and this is "

One final note is that once we match the SCRIPT tag name and enter the tag parsing, attributes may exist on the closing tag, even though they are ignored.

$p = new WP_HTML_Tag_Processor( '<script>This is a </script closing tag="</script>">' );
$p->next_token();
echo $p->get_modifiable_text();
// Output: "This is a "

The double-escaped script state can be very confusing, largely because of the wording around it. It's specifically there to allow printing SCRIPT elements from JavaScript. You not only need to open an HTML comment with <!-- but you also need to encounter a script opening tag.

These two discussions cover it better than the spec does IMO.

$p = new WP_HTML_Tag_Processor( '<script>This <!-- <script> </script> --> </script>' );
$p->next_token(); echo $p->get_modifiable_text();
// Output: "This <!-- <script> </script> --> "

$p = new WP_HTML_Tag_Processor( '<script>This <!-- <script> --> </script> --> </script>' );
$p->next_token(); echo $p->get_modifiable_text();
// Output: "This <!-- <script> --> "

You can see here how either a closing --> or a </script> tag exists the double-escaped state. The important thing that I see is that if we can prevent </script then we are guaranteed to prevent any escape regardless. It can be very challenging to try and reliably and safely preserve the ability to enter the literal </script>, though we should do that if we can. It's not a matter of entering the double-escaped state when we that to come through, because double-escaped states do not nest.

$p = new WP_HTML_Tag_Processor( '<script>This <!-- <script> <!-- <script> </script> --> </script>' );
$p->next_token(); echo $p->get_modifiable_text();
// Output: "This <!-- <script> <!-- <script> </script> --> "

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

Personally I'd rather see us move forward and focus on HTML5 because it's so extremely rare in practice to find XHTML. Yes it's possible to send the content type header, and yes if you serve as a page as .xml the browser will follow, but that effectively doesn't happen in practice and for good reason. Even if WordPress claims to send XHTML the rest of the page will be broken for a variety of other reasons.

We can keep in mind that HTML cannot be fully expressed through XML and this is one of the failures of XHTML.

Why this is important to me is that once we start escaping content within a SCRIPT element we are knowingly corrupting that script and entering uncertain territory, possibly introducing vulnerabilities that wouldn't have been there through inaction.

This is one of the reasons I discourage use of DOMDocument, because even the act of loading and re-saving a document can introduce corruption and vulnerabilities.

$dom = new DOMDocument();
$dom->loadHTML( "<!DOCTYPE html><html><meta charset=utf8><body><script>This <!-- <script> </script> --> </script> alert(1);</body></html>" );
echo $dom->saveHTML();
// Output:
// <!DOCTYPE html>
// <html><head><meta charset="utf8"></head><body><script>This <!-- <script> </script> --&gt;  alert(1);</body></html>

In this case we instructed DOMDocument to make no changes to the document, and yet it has broken out of the SCRIPT. This causes the alert(1); to be executed, when properly it should have appeared as plaintext on the page.

When HTML5\DOMDocument appears it should not make these mistakes.

sirreal · 2024-05-09T18:39:32Z

we need to make sure that </script{{ non-ascii alpha }} cannot be produced

I'm not sure where this comes from.

This was perhaps a poor attempt at expressing that we need to find at least </script to close a script tag. This is not sufficient, but it is necessary. It implies that we must find </scr etc. Finally, if we can't find </ or even < then there's no way to close the script tag.

sirreal · 2024-05-09T18:44:47Z

Given all the discussion and just how difficult it is to serve XHTML, is the consensus that JSON_HEX_AMP should always be omitted?

It doesn't seem to be a security issue. There's only an edge case that the data may not be interpreted correctly or the data may cause the xhtml page to be invalid due to malformed cahracter references. And that's all if the page is actually served correctly as XHTML.

sabernhardt · 2024-05-09T19:31:41Z

Does anyone know of themes to test that don't support html5 😅

Themes that do not declare html5 support include Twenty Ten, Twenty Eleven and Twenty Twelve.

if ( ! current_theme_supports( 'html5' ) ) {

If you use a current_theme_supports() condition, you may need to check whether the theme supports HTML5 script specifically:
current_theme_supports( 'html5', 'script' )

dmsnell · 2024-05-09T19:56:58Z

src/wp-includes/interactivity-api/class-wp-interactivity-api.php

+			 *      tag parsing.
+			 */
+			$json_encode_flags = JSON_HEX_TAG | JSON_UNESCAPED_SLASHES;
+			if ( 'UTF-8' === get_option( 'blog_charset' ) ) {


sadly this check is insufficient, as there are common case variants and the hyphen may not be present. The common example is below, but I am so unnerved by this pattern that I'm currently prepping a patch to WordPress to add a new semantic check.

$charset = get_option( 'blog_charset' ); if ( in_array( $charset, array( 'utf8', 'utf-8', 'UTF8' ), true ) ) { $charset = 'UTF-8'; }

Thanks! It looks like the _canonical_charset() function (intended to normalize to "UTF-8") is applied as a filter by default, so maybe this is sufficient as is? Your proposed patch would make that filter more robust without any changes here:

wordpress-develop/src/wp-includes/default-filters.php

Line 292 in 12dc165

add_filter( 'option_blog_charset', '_canonical_charset' );

If the encoding is UTF-8 and we fail to it, it will do little harm. The only issue is that valid unicode would be escaped to its \u1234 form.

dmsnell · 2024-05-09T23:08:19Z

src/wp-includes/interactivity-api/class-wp-interactivity-api.php

+				 *
+				 * - JSON_UNESCAPED_UNICODE: Encode multibyte Unicode characters literally
+				 *   (default is to escape as \uXXXX).
+				 * - JSON_UNESCAPED_LINE_TERMINATORS: The line terminators are kept unescaped when


Why not escape line terminators? It seems like my normal expectation is that they be escaped.

It's a poorly named and poorly documented flag, but it causes \u2028 (LINE SEPARATOR) and \u2029 (PARAGRAPH SEPARATOR) not to be escaped. This is included in the updated tests.

It has nothing to do with newlines or carriage returns or line feeds or anything like you might expect to be a "line terminator" 🤷. These two characters are not in the restricted JSON characters so its fine to print them.

my thinking was that if a client renders these in a way that breaks the line it will look funny in the code or in a debug statement rather than having the escaped characters. so in this case it's less about resilience and more about making the strings human-readable.

I don't have any strong opinion on this

That may be a good point, \u2028 is easier to spot than whitespace. At the same time, we won't be encoding \u00a0 or most of the other whitespace characters that are likely much more common.

I'd also argue that the only place we control this is where they're printed in the HTML. Assuming that something reads the string out of HTML and parses the JSON —probably JavaScript with JSON.parse—, it doesn't matter how it was encoded when serialized.

I have a slight preference to keep this flag and leave these unescaped, but I don't feel strongly one way or the other.

Here's what it looks like, this is "before\u2028middle\u2029after":

Browser Elements panel view source

Chrome

Firefox

Safari looks like Firefox, the characters are invisible and it reads "beforemiddleafter".

sirreal · 2024-05-10T08:35:59Z

Thanks @sabernhardt. I did some testing with twentyten and it doesn't declare html5 support, but it certainly doesn't render an xhtml page. Unescaped & in script tags seems harmless and are interpreted correctly.

I'm fairly convinced at this point that we don't need JSON_HEX_AMP. Agree with @westonruter on this (#6520 (comment)).

I think most discussion is settled and this is ready for review.

sabernhardt · 2024-05-10T14:56:27Z

If you want to test with a theme that uses an XHTML doctype in the header.php template, you could try the old "Default" (Kubrick) theme. Directory searches found others too. Note that some of these themes are not available for download anymore, and some of them only use the doctype for a documentation page (not in header.php).
XHTML
HTML 4

sirreal · 2024-05-10T16:52:05Z

Thanks again @sabernhardt. I tried with Kubrick and Colorway, and those themes still are not rendered as XHTML, no issues without JSON_HEX_AMP encoding.

tests/phpunit/tests/interactivity-api/wpInteractivityAPI.php

Co-authored-by: Weston Ruter <westonruter@google.com>

See WordPress/wordpress-develop#6520

See WordPress#6520

See WordPress/wordpress-develop#6520

Previously, the flags were set as if UTF-8 were conditionally present, but in most cases the blog character set will be UTF-8 and so flipping the logic makes it clearer that this is a "happy path" and also remove some logic that isn't necessary most of the time. I've reworded and combined the comment explaining the flags to be more specific about whey they are needed, and removed some discussions about encodings that I thought muddied the important details.

dmsnell

@sirreal I merged trunk and made a commit to use the new is_utf8_charset()

Additionally I rearranged the comment and logic for swapping flags because I felt like it would be nicer if we expressed UTF-8 as the default and only removed the other flags if we're in a different charset. I removed some wording in the comment that I felt was more confusing about the encoding than helpful, since JSON is by spec required to be UTF-8 and by convention is universally so.

My commit is only to speed up the process, not demand the change. Please let me know how you feel. If you and @westonruter are still on board, I can merge this in soon.

sirreal · 2024-05-15T07:41:01Z

Thanks @dmsnell, I'm happy with your changes 👍

Aligns with WordPress/wordpress-develop#6520

The Interactivity API has been rendering client data in a SCRIPT element with the type `application/json` so that it's not executed as a script, but is available to one. The data runs through `wp_json_encode()` and is encoded with some flags to ensure that potentially-dangerous characters are escaped. However, this can lead to some challenges. Eagerly escaping when not necessary can make the data difficult to comprehend when reading the output HTML. For example, all non-ASCII Unicode characters are escaped with their code point equivalent. This results in `\ud83c\udd70` instead of `🅰`. In this patch, the flags for JSON encoding are refined to ensure what's necessary while relaxing other rules (leaving in those Unicode characters if the blog charset is UTF-8). This makes for Interactivity API data that's quicker as a human reader to decipher and diagnose. In summary: - This data is JSON encoded and printed in a `<script type="application/json">` tag. - If we ensure that `<` is never printed inside the data, it should be impossible to break out of the script tag and the browser treats everything as the element's `textContent`. - All other escaping becomes unnecessary at that point, including unicode escaping if the page uses the UTF-8 charset (the same encoding as JSON). See #6433 (review) Developed in #6520 Discussed in https://core.trac.wordpress.org/ticket/61170 Fixes: #61170 Follow-up to: [57563]. Props: bjorsch, dmsnell, jonsurrell, sabernhardt, westonruter. git-svn-id: https://develop.svn.wordpress.org/trunk@58159 602fd350-edb4-49c9-b593-d223f7449a82

dmsnell · 2024-05-15T17:43:13Z

Merged in [58159]
13d5244

The Interactivity API has been rendering client data in a SCRIPT element with the type `application/json` so that it's not executed as a script, but is available to one. The data runs through `wp_json_encode()` and is encoded with some flags to ensure that potentially-dangerous characters are escaped. However, this can lead to some challenges. Eagerly escaping when not necessary can make the data difficult to comprehend when reading the output HTML. For example, all non-ASCII Unicode characters are escaped with their code point equivalent. This results in `\ud83c\udd70` instead of `🅰`. In this patch, the flags for JSON encoding are refined to ensure what's necessary while relaxing other rules (leaving in those Unicode characters if the blog charset is UTF-8). This makes for Interactivity API data that's quicker as a human reader to decipher and diagnose. In summary: - This data is JSON encoded and printed in a `<script type="application/json">` tag. - If we ensure that `<` is never printed inside the data, it should be impossible to break out of the script tag and the browser treats everything as the element's `textContent`. - All other escaping becomes unnecessary at that point, including unicode escaping if the page uses the UTF-8 charset (the same encoding as JSON). See WordPress/wordpress-develop#6433 (review) Developed in WordPress/wordpress-develop#6520 Discussed in https://core.trac.wordpress.org/ticket/61170 Fixes: #61170 Follow-up to: [57563]. Props: bjorsch, dmsnell, jonsurrell, sabernhardt, westonruter. Built from https://develop.svn.wordpress.org/trunk@58159 git-svn-id: http://core.svn.wordpress.org/trunk@57622 1a063a9b-81f0-0310-95a4-ce76da25c4cd

The Interactivity API has been rendering client data in a SCRIPT element with the type `application/json` so that it's not executed as a script, but is available to one. The data runs through `wp_json_encode()` and is encoded with some flags to ensure that potentially-dangerous characters are escaped. However, this can lead to some challenges. Eagerly escaping when not necessary can make the data difficult to comprehend when reading the output HTML. For example, all non-ASCII Unicode characters are escaped with their code point equivalent. This results in `\ud83c\udd70` instead of `🅰`. In this patch, the flags for JSON encoding are refined to ensure what's necessary while relaxing other rules (leaving in those Unicode characters if the blog charset is UTF-8). This makes for Interactivity API data that's quicker as a human reader to decipher and diagnose. In summary: - This data is JSON encoded and printed in a `<script type="application/json">` tag. - If we ensure that `<` is never printed inside the data, it should be impossible to break out of the script tag and the browser treats everything as the element's `textContent`. - All other escaping becomes unnecessary at that point, including unicode escaping if the page uses the UTF-8 charset (the same encoding as JSON). See WordPress/wordpress-develop#6433 (review) Developed in WordPress/wordpress-develop#6520 Discussed in https://core.trac.wordpress.org/ticket/61170 Fixes: #61170 Follow-up to: [57563]. Props: bjorsch, dmsnell, jonsurrell, sabernhardt, westonruter. Built from https://develop.svn.wordpress.org/trunk@58159 git-svn-id: https://core.svn.wordpress.org/trunk@57622 1a063a9b-81f0-0310-95a4-ce76da25c4cd

See WordPress/wordpress-develop#6520

Aligns with WordPress/wordpress-develop#6520

Improve Interactivity API store JSON encoding

a99ec25

sirreal mentioned this pull request May 7, 2024

Script Modules: Add data server->client data passing #6433

Draft

sirreal commented May 8, 2024

View reviewed changes

src/wp-includes/interactivity-api/class-wp-interactivity-api.php Outdated Show resolved Hide resolved

sirreal added 3 commits May 8, 2024 14:39

Only set unescaped unicode with UTF-8

8a6d30f

Remove JSON_HEX_AMP

312d6ea

Add JSON_UNESCAPED_LINE_TERMINATORS

3e30117

sirreal added 3 commits May 9, 2024 11:21

Add comments

cf29aa6

Smaller config in test

76c7a5e

Fix heredoc end

f2b07eb

sirreal marked this pull request as ready for review May 9, 2024 10:23

sirreal added 2 commits May 9, 2024 12:32

Remove bad function arg trailing comma

5c2d273

Add more details to comments

b9cd850

westonruter reviewed May 9, 2024

View reviewed changes

dmsnell reviewed May 9, 2024

View reviewed changes

sirreal requested review from dmsnell and westonruter May 10, 2024 08:36

westonruter reviewed May 10, 2024

View reviewed changes

tests/phpunit/tests/interactivity-api/wpInteractivityAPI.php Outdated Show resolved Hide resolved

Remove redundant tear_down method.

8a7e4e0

Co-authored-by: Weston Ruter <westonruter@google.com>

sirreal requested a review from westonruter May 10, 2024 17:39

westonruter approved these changes May 10, 2024

View reviewed changes

sirreal added a commit to WordPress/gutenberg that referenced this pull request May 14, 2024

Better JSON encoding

5c6c562

See WordPress/wordpress-develop#6520

sirreal added a commit to sirreal/wordpress-develop that referenced this pull request May 14, 2024

Improve data encoding

d185ac6

See WordPress#6520

sirreal added a commit to WordPress/gutenberg that referenced this pull request May 14, 2024

Better JSON encoding

f9f1610

See WordPress/wordpress-develop#6520

dmsnell added 2 commits May 14, 2024 17:23

Merge branch 'trunk' into add/iapi-better-json-encoding

67bd236

dmsnell approved these changes May 15, 2024

View reviewed changes

sirreal mentioned this pull request May 15, 2024

Add script module data implementation WordPress/gutenberg#61658

Open

sirreal added a commit to WordPress/gutenberg that referenced this pull request May 15, 2024

Update JSON flags and comments

72e8846

Aligns with WordPress/wordpress-develop#6520

dmsnell closed this May 15, 2024

dmsnell deleted the add/iapi-better-json-encoding branch May 15, 2024 17:59

sirreal added a commit to WordPress/gutenberg that referenced this pull request May 16, 2024

Better JSON encoding

5b6a21f

See WordPress/wordpress-develop#6520

sirreal added a commit to WordPress/gutenberg that referenced this pull request May 16, 2024

Update JSON flags and comments

4695f39

Aligns with WordPress/wordpress-develop#6520

Improve Interactivity API store JSON encoding #6520

Improve Interactivity API store JSON encoding #6520

Conversation

sirreal commented May 7, 2024 • edited

github-actions bot commented May 7, 2024

Test using WordPress Playground

Some things to be aware of

sirreal commented May 9, 2024 • edited

github-actions bot commented May 9, 2024 • edited

anomiex commented May 9, 2024

sirreal commented May 9, 2024 • edited

westonruter commented May 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anomiex commented May 9, 2024 • edited

westonruter commented May 9, 2024

anomiex commented May 9, 2024

sirreal commented May 9, 2024

sirreal commented May 9, 2024

westonruter commented May 9, 2024

westonruter commented May 9, 2024

dmsnell commented May 9, 2024

sirreal commented May 9, 2024 • edited

sirreal commented May 9, 2024

sabernhardt commented May 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sirreal commented May 10, 2024

sabernhardt commented May 10, 2024

sirreal commented May 10, 2024

dmsnell left a comment

Choose a reason for hiding this comment

sirreal commented May 15, 2024

dmsnell commented May 15, 2024

sirreal commented May 7, 2024 •

edited

sirreal commented May 9, 2024 •

edited

github-actions bot commented May 9, 2024 •

edited

sirreal commented May 9, 2024 •

edited

westonruter commented May 9, 2024 •

edited

anomiex commented May 9, 2024 •

edited

sirreal commented May 9, 2024 •

edited

sabernhardt commented May 9, 2024 •

edited