Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize UTF-8 charset slug detection. #6535

Closed

Conversation

dmsnell
Copy link
Contributor

@dmsnell dmsnell commented May 9, 2024

Trac ticket: Core-61182

There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The _canonical_charset() method exists, but is marked private for use.

In this patch the new unicode module provides is_utf8_charset() as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including _canonical_charset().

Finally, the test functions governing _canonical_charset() have been rewritten as a single test with a data provider instead of as separate test functions.

This patch raises a couple of questions:

  • Is there a need to normalize the ISO-8859-1 charsets since htmlspecialchars() does not need it?
  • Should WordPress normalize invalid variations of ISO-8859-1, such as "latin1", so that htmlspecialchars() does not choke?

One important question I'd like your help on, if you know the answer: is it still important to avoid calling get_option( 'blog_charset' ) multiple times in a row, or is that already cached? If it's already cached then I suspect further caching or use of static vars will only bloat the memory footprint of Core without speeding it up.

@dmsnell dmsnell changed the title Normalize UTF-8 charset detection. Normalize UTF-8 charset slug detection. May 9, 2024
Copy link

github-actions bot commented May 9, 2024

Trac Ticket Missing

This pull request is missing a link to a Trac ticket. For a contribution to be considered, there must be a corresponding ticket in Trac.

To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description. More information about contributing to WordPress on GitHub can be found in the Core Handbook.

Copy link

github-actions bot commented May 9, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

There are several exist places in Core that attempt to detect if a blog
charset is UTF-8. Each place attempts to perform the same check, except
the logic is spread throughout and there's no single method provided to
make this determination in a consistent way. The `_canonical_charset()`
method exists, but is marked private for use.

In this patch the new `unicode` module provides `is_utf8_charset()` as
a method taking an optional charset slug and indicating if it represents
UTF-8, examining all of the allowable variants of that slug. Associated
code is updated to use this new function, including `_canonical_charset()`.

Finally, the test functions governing `_canonical_charset()` have been
rewritten as a single test with a data provider instead of as separate
test functions.

This patch raises a couple of questions:
 - Is there a need to normalize the ISO-8859-1 charsets since
   `htmlspecialchars()` does not need it?
 - Should WordPress normalize invalid variations of ISO-8859-1, such
   as "latin1", so that `htmlspecialchars()` does not choke?
Copy link

github-actions bot commented May 9, 2024

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

src/wp-includes/functions.php Outdated Show resolved Hide resolved
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a nice improvement, I'd like to have it 👍

src/wp-includes/unicode.php Outdated Show resolved Hide resolved
if (
(
10 === $charset_length &&
0 === strcasecmp( 'iso-8859-1', $charset )
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sirreal - this is better than substr_compare

array( 'Iso-8859-1', 'ISO-8859-1' ),
array( 'ISO8859-1', 'ISO-8859-1' ),

// Other charset slugs should not be adjusted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed this is working correctly, but some substring (we do have '' test) and superstring tests might be good:

Suggested change
// Other charset slugs should not be adjusted.
// Other charset slugs should not be adjusted.
array( 'ISO8859-128', 'ISO8859-128' ),
array( 'ISO', 'ISO' ),
array( 'UTF8080', 'UTF8080' ),
array( 'UTF', 'UTF' ),

Note that those would be a bug with the substr_compare implementation which returns the index of the string. strcasecmp returns non-0 to indicate the strings don't match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those would be a bug with the substr_compare implementation which returns the index of the string

not sure I follow. substr_compare() doesn't return the index of a string. it starts at an offset in the haystack, and compares the given number of characters. if the string lengths are equal then this assures that the strings are identical, up to case sensitivity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that well behaved, especially before php 8.2. It's… very strange. But we're fine since we're not using substr_compare:

Screenshot 2024-05-15 at 14 54 22

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're talking past each other, as I'm not sure what the code snippet is supposed to suggest.

We don't care about the value of the response, only if it's less than, greater than, or equal to 0. For our purposes, we only care about 0 and equality, which it does fine.

Missing from all your examples is the critical $length parameter. The length parameter establishes that it will stop comparing after that many bytes, or whenever one of the string ends, if one ends before the $length has been reached.

Screenshot 2024-05-15 at 10 19 39 AM Screenshot 2024-05-15 at 10 19 53 AM

Does this make more sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, the implementation looked like this where $charset with extra characters would match:

$charset = 'iso-8859-1337';
var_dump( 0 === substr_compare( 'iso-8859-1', $charset, 0, 10, true ) ); // bool(true)

That's what had me thinking along this line, but again, it's not a problem because the final implementation is using functions that don't have that problem.

We can drop it 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop it 🙂

yeah I realize it's not important, but I did originally have a length check, and a note about that. more or less at this point I want to ensure I'm not overlooking something, and that we both reach a shared understanding, if we want.

if the string lengths are equal then this assures that the strings are identical, up to case sensitivity.

so there is no bug because if it compares 10 characters and the charset is only 10 characters I don't think it's possible to be wrong.

a7d7696#diff-5526e2d644a04f0c466d6c749542fdb7952021784e7b1583ff45f9d12601ef52R7497-R7498

if (
	10 === $charset_length &&
	substr_compare( $charset, 'iso-8859-1', 0, 10, true )
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct 😆

if ( in_array( $charset, array( 'utf8', 'utf-8', 'UTF8' ), true ) ) {
$charset = 'UTF-8';
}
$charset = _canonical_charset( $charset || get_option( 'blog_charset' ) );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an issue here. On a test site, before this change, this function starts with a $charset=false value, then processes options and has $charset = "UTF-8" value. After this change, the charset starts as false and becomes true after this line.

Warning: htmlspecialchars(): Charset "1" is not supported, assuming UTF-8 in /var/www/html/wp-includes/formatting.php on line 984

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convoluted way of getting options here may have been because there are default filters that may create cycles here:

add_filter( 'option_blog_charset', '_wp_specialchars' ); // IMPORTANT: This must not be wp_specialchars() or esc_html() or it'll cause an infinite loop.
add_filter( 'option_blog_charset', '_canonical_charset' );

_wp_specialchars calls get_option which filters option_blog_charset which calls _wp_specialchars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a minute to figure this out, but I've made this mistake many times before. The || expression returns a boolean value, not the first non-false value. I've updated in b9347d7 to use a proper ternary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That fixed it for me 🙂

pento pushed a commit that referenced this pull request May 14, 2024
There are several exist places in Core that attempt to detect if a blog charset
is UTF-8. Each place attempts to perform the same check, except the logic is
spread throughout and there's no single method provided to make this
determination in a consistent way. The `_canonical_charset()` method exists,
but is marked private for use.

In this patch the new `unicode` module provides `is_utf8_charset()` as a method
taking an optional charset slug and indicating if it represents UTF-8,
examining all of the allowable variants of that slug. Associated code is
updated to use this new function, including `_canonical_charset()`. If no slug
is provided, it will look up the current `get_option( 'blog_charset' )`.

Finally, the test functions governing `_canonical_charset()` have been
rewritten as a single test with a data provider instead of as separate test
functions.

Developed in #6535
Discussed in https://core.trac.wordpress.org/ticket/61182

Fixes #61182.
Props dmsnell, jonsurrell.


git-svn-id: https://develop.svn.wordpress.org/trunk@58147 602fd350-edb4-49c9-b593-d223f7449a82
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request May 14, 2024
There are several exist places in Core that attempt to detect if a blog charset
is UTF-8. Each place attempts to perform the same check, except the logic is
spread throughout and there's no single method provided to make this
determination in a consistent way. The `_canonical_charset()` method exists,
but is marked private for use.

In this patch the new `unicode` module provides `is_utf8_charset()` as a method
taking an optional charset slug and indicating if it represents UTF-8,
examining all of the allowable variants of that slug. Associated code is
updated to use this new function, including `_canonical_charset()`. If no slug
is provided, it will look up the current `get_option( 'blog_charset' )`.

Finally, the test functions governing `_canonical_charset()` have been
rewritten as a single test with a data provider instead of as separate test
functions.

Developed in WordPress/wordpress-develop#6535
Discussed in https://core.trac.wordpress.org/ticket/61182

Fixes #61182.
Props dmsnell, jonsurrell.

Built from https://develop.svn.wordpress.org/trunk@58147


git-svn-id: http://core.svn.wordpress.org/trunk@57612 1a063a9b-81f0-0310-95a4-ce76da25c4cd
pento pushed a commit that referenced this pull request May 14, 2024
In the merge of [58417] the new file was missed. This commit adds the missing required file.

Developed in #6535
Discussed in https://core.trac.wordpress.org/ticket/61182

Fixes #61182.
Follow-up to [58147].
Props dmsnell, jonsurrell.


git-svn-id: https://develop.svn.wordpress.org/trunk@58148 602fd350-edb4-49c9-b593-d223f7449a82
github-actions bot pushed a commit to gilzow/wordpress-performance that referenced this pull request May 14, 2024
There are several exist places in Core that attempt to detect if a blog charset
is UTF-8. Each place attempts to perform the same check, except the logic is
spread throughout and there's no single method provided to make this
determination in a consistent way. The `_canonical_charset()` method exists,
but is marked private for use.

In this patch the new `unicode` module provides `is_utf8_charset()` as a method
taking an optional charset slug and indicating if it represents UTF-8,
examining all of the allowable variants of that slug. Associated code is
updated to use this new function, including `_canonical_charset()`. If no slug
is provided, it will look up the current `get_option( 'blog_charset' )`.

Finally, the test functions governing `_canonical_charset()` have been
rewritten as a single test with a data provider instead of as separate test
functions.

Developed in WordPress/wordpress-develop#6535
Discussed in https://core.trac.wordpress.org/ticket/61182

Fixes #61182.
Props dmsnell, jonsurrell.

Built from https://develop.svn.wordpress.org/trunk@58147


git-svn-id: https://core.svn.wordpress.org/trunk@57612 1a063a9b-81f0-0310-95a4-ce76da25c4cd
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request May 14, 2024
In the merge of [58417] the new file was missed. This commit adds the missing required file.

Developed in WordPress/wordpress-develop#6535
Discussed in https://core.trac.wordpress.org/ticket/61182

Fixes #61182.
Follow-up to [58147].
Props dmsnell, jonsurrell.

Built from https://develop.svn.wordpress.org/trunk@58148


git-svn-id: http://core.svn.wordpress.org/trunk@57613 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell
Copy link
Contributor Author

dmsnell commented May 14, 2024

Merged in 58147 and 58148
d4967a3 and 7a031a9

@dmsnell dmsnell closed this May 14, 2024
@dmsnell dmsnell deleted the add/standard-charset-normalization branch May 14, 2024 18:09
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request May 14, 2024
In the merge of [58417] the new file was missed. This commit adds the missing required file.

Developed in WordPress/wordpress-develop#6535
Discussed in https://core.trac.wordpress.org/ticket/61182

Fixes #61182.
Follow-up to [58147].
Props dmsnell, jonsurrell.

Built from https://develop.svn.wordpress.org/trunk@58148


git-svn-id: https://core.svn.wordpress.org/trunk@57613 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants