Normalize UTF-8 charset slug detection. #6535

dmsnell · 2024-05-09T23:03:52Z

There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The _canonical_charset() method exists, but is marked private for use.

In this patch the new unicode module provides is_utf8_charset() as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including _canonical_charset().

Finally, the test functions governing _canonical_charset() have been rewritten as a single test with a data provider instead of as separate test functions.

This patch raises a couple of questions:

Is there a need to normalize the ISO-8859-1 charsets since htmlspecialchars() does not need it?
Should WordPress normalize invalid variations of ISO-8859-1, such as "latin1", so that htmlspecialchars() does not choke?

One important question I'd like your help on, if you know the answer: is it still important to avoid calling get_option( 'blog_charset' ) multiple times in a row, or is that already cached? If it's already cached then I suspect further caching or use of static vars will only bloat the memory footprint of Core without speeding it up.

github-actions · 2024-05-09T23:04:05Z

Trac Ticket Missing

This pull request is missing a link to a Trac ticket. For a contribution to be considered, there must be a corresponding ticket in Trac.

To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description. More information about contributing to WordPress on GitHub can be found in the Core Handbook.

github-actions · 2024-05-09T23:04:06Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The `_canonical_charset()` method exists, but is marked private for use. In this patch the new `unicode` module provides `is_utf8_charset()` as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including `_canonical_charset()`. Finally, the test functions governing `_canonical_charset()` have been rewritten as a single test with a data provider instead of as separate test functions. This patch raises a couple of questions: - Is there a need to normalize the ISO-8859-1 charsets since `htmlspecialchars()` does not need it? - Should WordPress normalize invalid variations of ISO-8859-1, such as "latin1", so that `htmlspecialchars()` does not choke?

github-actions · 2024-05-09T23:20:21Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

src/wp-includes/functions.php

sirreal

This seems like a nice improvement, I'd like to have it 👍

src/wp-includes/unicode.php

dmsnell · 2024-05-10T14:58:19Z

src/wp-includes/functions.php

+	if (
+		(
+			10 === $charset_length &&
+			0 === strcasecmp( 'iso-8859-1', $charset )


thanks @sirreal - this is better than substr_compare

sirreal · 2024-05-10T16:24:42Z

tests/phpunit/tests/functions/canonicalCharset.php

+			array( 'Iso-8859-1', 'ISO-8859-1' ),
+			array( 'ISO8859-1', 'ISO-8859-1' ),
+
+			// Other charset slugs should not be adjusted.


I confirmed this is working correctly, but some substring (we do have '' test) and superstring tests might be good:

Suggested change

// Other charset slugs should not be adjusted.

// Other charset slugs should not be adjusted.

array( 'ISO8859-128', 'ISO8859-128' ),

array( 'ISO', 'ISO' ),

array( 'UTF8080', 'UTF8080' ),

array( 'UTF', 'UTF' ),

Note that those would be a bug with the substr_compare implementation which returns the index of the string. strcasecmp returns non-0 to indicate the strings don't match.

those would be a bug with the substr_compare implementation which returns the index of the string

not sure I follow. substr_compare() doesn't return the index of a string. it starts at an offset in the haystack, and compares the given number of characters. if the string lengths are equal then this assures that the strings are identical, up to case sensitivity.

It's not that well behaved, especially before php 8.2. It's… very strange. But we're fine since we're not using substr_compare:

I think we're talking past each other, as I'm not sure what the code snippet is supposed to suggest.

We don't care about the value of the response, only if it's less than, greater than, or equal to 0. For our purposes, we only care about 0 and equality, which it does fine.

Missing from all your examples is the critical $length parameter. The length parameter establishes that it will stop comparing after that many bytes, or whenever one of the string ends, if one ends before the $length has been reached.

Does this make more sense?

Before, the implementation looked like this where $charset with extra characters would match:

$charset = 'iso-8859-1337'; var_dump( 0 === substr_compare( 'iso-8859-1', $charset, 0, 10, true ) ); // bool(true)

That's what had me thinking along this line, but again, it's not a problem because the final implementation is using functions that don't have that problem.

We can drop it 🙂

We can drop it 🙂

yeah I realize it's not important, but I did originally have a length check, and a note about that. more or less at this point I want to ensure I'm not overlooking something, and that we both reach a shared understanding, if we want.

if the string lengths are equal then this assures that the strings are identical, up to case sensitivity.

so there is no bug because if it compares 10 characters and the charset is only 10 characters I don't think it's possible to be wrong.

a7d7696#diff-5526e2d644a04f0c466d6c749542fdb7952021784e7b1583ff45f9d12601ef52R7497-R7498

if ( 10 === $charset_length && substr_compare( $charset, 'iso-8859-1', 0, 10, true ) )

You are correct 😆

sirreal · 2024-05-10T16:42:58Z

src/wp-includes/formatting.php

-	if ( in_array( $charset, array( 'utf8', 'utf-8', 'UTF8' ), true ) ) {
-		$charset = 'UTF-8';
-	}
+	$charset = _canonical_charset( $charset || get_option( 'blog_charset' ) );


There's an issue here. On a test site, before this change, this function starts with a $charset=false value, then processes options and has $charset = "UTF-8" value. After this change, the charset starts as false and becomes true after this line.

Warning: htmlspecialchars(): Charset "1" is not supported, assuming UTF-8 in /var/www/html/wp-includes/formatting.php on line 984

The convoluted way of getting options here may have been because there are default filters that may create cycles here:

wordpress-develop/src/wp-includes/default-filters.php

Lines 291 to 292 in 8d5e55a

add_filter( 'option_blog_charset', '_wp_specialchars' ); // IMPORTANT: This must not be wp_specialchars() or esc_html() or it'll cause an infinite loop.

add_filter( 'option_blog_charset', '_canonical_charset' );

_wp_specialchars calls get_option which filters option_blog_charset which calls _wp_specialchars.

Took me a minute to figure this out, but I've made this mistake many times before. The || expression returns a boolean value, not the first non-false value. I've updated in b9347d7 to use a proper ternary.

That fixed it for me 🙂

…et-normalization

There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The `_canonical_charset()` method exists, but is marked private for use. In this patch the new `unicode` module provides `is_utf8_charset()` as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including `_canonical_charset()`. If no slug is provided, it will look up the current `get_option( 'blog_charset' )`. Finally, the test functions governing `_canonical_charset()` have been rewritten as a single test with a data provider instead of as separate test functions. Developed in #6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Props dmsnell, jonsurrell. git-svn-id: https://develop.svn.wordpress.org/trunk@58147 602fd350-edb4-49c9-b593-d223f7449a82

There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The `_canonical_charset()` method exists, but is marked private for use. In this patch the new `unicode` module provides `is_utf8_charset()` as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including `_canonical_charset()`. If no slug is provided, it will look up the current `get_option( 'blog_charset' )`. Finally, the test functions governing `_canonical_charset()` have been rewritten as a single test with a data provider instead of as separate test functions. Developed in WordPress/wordpress-develop#6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Props dmsnell, jonsurrell. Built from https://develop.svn.wordpress.org/trunk@58147 git-svn-id: http://core.svn.wordpress.org/trunk@57612 1a063a9b-81f0-0310-95a4-ce76da25c4cd

In the merge of [58417] the new file was missed. This commit adds the missing required file. Developed in #6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Follow-up to [58147]. Props dmsnell, jonsurrell. git-svn-id: https://develop.svn.wordpress.org/trunk@58148 602fd350-edb4-49c9-b593-d223f7449a82

There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The `_canonical_charset()` method exists, but is marked private for use. In this patch the new `unicode` module provides `is_utf8_charset()` as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including `_canonical_charset()`. If no slug is provided, it will look up the current `get_option( 'blog_charset' )`. Finally, the test functions governing `_canonical_charset()` have been rewritten as a single test with a data provider instead of as separate test functions. Developed in WordPress/wordpress-develop#6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Props dmsnell, jonsurrell. Built from https://develop.svn.wordpress.org/trunk@58147 git-svn-id: https://core.svn.wordpress.org/trunk@57612 1a063a9b-81f0-0310-95a4-ce76da25c4cd

In the merge of [58417] the new file was missed. This commit adds the missing required file. Developed in WordPress/wordpress-develop#6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Follow-up to [58147]. Props dmsnell, jonsurrell. Built from https://develop.svn.wordpress.org/trunk@58148 git-svn-id: http://core.svn.wordpress.org/trunk@57613 1a063a9b-81f0-0310-95a4-ce76da25c4cd

dmsnell · 2024-05-14T18:08:55Z

Merged in 58147 and 58148
d4967a3 and 7a031a9

In the merge of [58417] the new file was missed. This commit adds the missing required file. Developed in WordPress/wordpress-develop#6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Follow-up to [58147]. Props dmsnell, jonsurrell. Built from https://develop.svn.wordpress.org/trunk@58148 git-svn-id: https://core.svn.wordpress.org/trunk@57613 1a063a9b-81f0-0310-95a4-ce76da25c4cd

dmsnell changed the title ~~Normalize UTF-8 charset detection.~~ Normalize UTF-8 charset slug detection. May 9, 2024

dmsnell force-pushed the add/standard-charset-normalization branch from 27cfcc3 to a7d7696 Compare May 9, 2024 23:06

dmsnell mentioned this pull request May 9, 2024

Improve Interactivity API store JSON encoding #6520

Closed

sirreal suggested changes May 10, 2024

View reviewed changes

src/wp-includes/functions.php Outdated Show resolved Hide resolved

sirreal added 2 commits May 10, 2024 10:06

Fix iso-8859-1 substr_compare check

db84f13

Use strcasecmp - strings are the same length and ASII expected

55f0558

sirreal approved these changes May 10, 2024

View reviewed changes

src/wp-includes/unicode.php Outdated Show resolved Hide resolved

dmsnell commented May 10, 2024

View reviewed changes

Rely on strcasecmp

937e884

sirreal reviewed May 10, 2024

View reviewed changes

sirreal suggested changes May 10, 2024

View reviewed changes

Don't change the type in an attempt to short-circuit a condition.

b9347d7

sirreal approved these changes May 10, 2024

View reviewed changes

dmsnell added 4 commits May 12, 2024 17:35

Appease the linting gods

1090a61

Merge remote-tracking branch 'upstream/trunk' into add/standard-chars…

e6c92f6

…et-normalization

Merge remote-tracking branch 'upstream/trunk' into add/standard-chars…

f095712

…et-normalization

A few more tests

f41ea23

dmsnell closed this May 14, 2024

dmsnell deleted the add/standard-charset-normalization branch May 14, 2024 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize UTF-8 charset slug detection. #6535

Normalize UTF-8 charset slug detection. #6535

dmsnell commented May 9, 2024 •

edited

github-actions bot commented May 9, 2024

github-actions bot commented May 9, 2024 •

edited

github-actions bot commented May 9, 2024

sirreal left a comment

dmsnell May 10, 2024

sirreal May 10, 2024

dmsnell May 14, 2024

sirreal May 15, 2024

dmsnell May 15, 2024

sirreal May 15, 2024

dmsnell May 15, 2024

sirreal May 15, 2024

sirreal May 10, 2024

sirreal May 10, 2024

dmsnell May 10, 2024

sirreal May 10, 2024

dmsnell commented May 14, 2024

-			// Other charset slugs should not be adjusted.
+			// Other charset slugs should not be adjusted.
+			array( 'ISO8859-128', 'ISO8859-128' ),
+			array( 'ISO', 'ISO' ),
+			array( 'UTF8080', 'UTF8080' ),
+			array( 'UTF', 'UTF' ),

	add_filter( 'option_blog_charset', '_wp_specialchars' ); // IMPORTANT: This must not be wp_specialchars() or esc_html() or it'll cause an infinite loop.
	add_filter( 'option_blog_charset', '_canonical_charset' );

Normalize UTF-8 charset slug detection. #6535

Normalize UTF-8 charset slug detection. #6535

Conversation

dmsnell commented May 9, 2024 • edited

github-actions bot commented May 9, 2024

Trac Ticket Missing

github-actions bot commented May 9, 2024 • edited

github-actions bot commented May 9, 2024

Test using WordPress Playground

Some things to be aware of

sirreal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmsnell commented May 14, 2024

dmsnell commented May 9, 2024 •

edited

github-actions bot commented May 9, 2024 •

edited