New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize UTF-8 charset slug detection. #6535
Normalize UTF-8 charset slug detection. #6535
Conversation
Trac Ticket MissingThis pull request is missing a link to a Trac ticket. For a contribution to be considered, there must be a corresponding ticket in Trac. To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description. More information about contributing to WordPress on GitHub can be found in the Core Handbook. |
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The `_canonical_charset()` method exists, but is marked private for use. In this patch the new `unicode` module provides `is_utf8_charset()` as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including `_canonical_charset()`. Finally, the test functions governing `_canonical_charset()` have been rewritten as a single test with a data provider instead of as separate test functions. This patch raises a couple of questions: - Is there a need to normalize the ISO-8859-1 charsets since `htmlspecialchars()` does not need it? - Should WordPress normalize invalid variations of ISO-8859-1, such as "latin1", so that `htmlspecialchars()` does not choke?
27cfcc3
to
a7d7696
Compare
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a nice improvement, I'd like to have it 👍
src/wp-includes/functions.php
Outdated
if ( | ||
( | ||
10 === $charset_length && | ||
0 === strcasecmp( 'iso-8859-1', $charset ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sirreal - this is better than substr_compare
array( 'Iso-8859-1', 'ISO-8859-1' ), | ||
array( 'ISO8859-1', 'ISO-8859-1' ), | ||
|
||
// Other charset slugs should not be adjusted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed this is working correctly, but some substring (we do have ''
test) and superstring tests might be good:
// Other charset slugs should not be adjusted. | |
// Other charset slugs should not be adjusted. | |
array( 'ISO8859-128', 'ISO8859-128' ), | |
array( 'ISO', 'ISO' ), | |
array( 'UTF8080', 'UTF8080' ), | |
array( 'UTF', 'UTF' ), |
Note that those would be a bug with the substr_compare
implementation which returns the index of the string. strcasecmp
returns non-0 to indicate the strings don't match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those would be a bug with the substr_compare implementation which returns the index of the string
not sure I follow. substr_compare()
doesn't return the index of a string. it starts at an offset in the haystack, and compares the given number of characters. if the string lengths are equal then this assures that the strings are identical, up to case sensitivity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're talking past each other, as I'm not sure what the code snippet is supposed to suggest.
We don't care about the value of the response, only if it's less than, greater than, or equal to 0
. For our purposes, we only care about 0
and equality, which it does fine.
Missing from all your examples is the critical $length
parameter. The length parameter establishes that it will stop comparing after that many bytes, or whenever one of the string ends, if one ends before the $length
has been reached.
Does this make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, the implementation looked like this where $charset
with extra characters would match:
$charset = 'iso-8859-1337';
var_dump( 0 === substr_compare( 'iso-8859-1', $charset, 0, 10, true ) ); // bool(true)
That's what had me thinking along this line, but again, it's not a problem because the final implementation is using functions that don't have that problem.
We can drop it 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop it 🙂
yeah I realize it's not important, but I did originally have a length check, and a note about that. more or less at this point I want to ensure I'm not overlooking something, and that we both reach a shared understanding, if we want.
if the string lengths are equal then this assures that the strings are identical, up to case sensitivity.
so there is no bug because if it compares 10 characters and the charset is only 10 characters I don't think it's possible to be wrong.
a7d7696#diff-5526e2d644a04f0c466d6c749542fdb7952021784e7b1583ff45f9d12601ef52R7497-R7498
if (
10 === $charset_length &&
substr_compare( $charset, 'iso-8859-1', 0, 10, true )
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct 😆
src/wp-includes/formatting.php
Outdated
if ( in_array( $charset, array( 'utf8', 'utf-8', 'UTF8' ), true ) ) { | ||
$charset = 'UTF-8'; | ||
} | ||
$charset = _canonical_charset( $charset || get_option( 'blog_charset' ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an issue here. On a test site, before this change, this function starts with a $charset=false
value, then processes options and has $charset = "UTF-8"
value. After this change, the charset starts as false
and becomes true
after this line.
Warning: htmlspecialchars(): Charset "1" is not supported, assuming UTF-8 in /var/www/html/wp-includes/formatting.php on line 984
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convoluted way of getting options here may have been because there are default filters that may create cycles here:
wordpress-develop/src/wp-includes/default-filters.php
Lines 291 to 292 in 8d5e55a
add_filter( 'option_blog_charset', '_wp_specialchars' ); // IMPORTANT: This must not be wp_specialchars() or esc_html() or it'll cause an infinite loop. | |
add_filter( 'option_blog_charset', '_canonical_charset' ); |
_wp_specialchars
calls get_option
which filters option_blog_charset
which calls _wp_specialchars
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a minute to figure this out, but I've made this mistake many times before. The ||
expression returns a boolean value, not the first non-false
value. I've updated in b9347d7 to use a proper ternary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That fixed it for me 🙂
There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The `_canonical_charset()` method exists, but is marked private for use. In this patch the new `unicode` module provides `is_utf8_charset()` as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including `_canonical_charset()`. If no slug is provided, it will look up the current `get_option( 'blog_charset' )`. Finally, the test functions governing `_canonical_charset()` have been rewritten as a single test with a data provider instead of as separate test functions. Developed in #6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Props dmsnell, jonsurrell. git-svn-id: https://develop.svn.wordpress.org/trunk@58147 602fd350-edb4-49c9-b593-d223f7449a82
There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The `_canonical_charset()` method exists, but is marked private for use. In this patch the new `unicode` module provides `is_utf8_charset()` as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including `_canonical_charset()`. If no slug is provided, it will look up the current `get_option( 'blog_charset' )`. Finally, the test functions governing `_canonical_charset()` have been rewritten as a single test with a data provider instead of as separate test functions. Developed in WordPress/wordpress-develop#6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Props dmsnell, jonsurrell. Built from https://develop.svn.wordpress.org/trunk@58147 git-svn-id: http://core.svn.wordpress.org/trunk@57612 1a063a9b-81f0-0310-95a4-ce76da25c4cd
In the merge of [58417] the new file was missed. This commit adds the missing required file. Developed in #6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Follow-up to [58147]. Props dmsnell, jonsurrell. git-svn-id: https://develop.svn.wordpress.org/trunk@58148 602fd350-edb4-49c9-b593-d223f7449a82
There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The `_canonical_charset()` method exists, but is marked private for use. In this patch the new `unicode` module provides `is_utf8_charset()` as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including `_canonical_charset()`. If no slug is provided, it will look up the current `get_option( 'blog_charset' )`. Finally, the test functions governing `_canonical_charset()` have been rewritten as a single test with a data provider instead of as separate test functions. Developed in WordPress/wordpress-develop#6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Props dmsnell, jonsurrell. Built from https://develop.svn.wordpress.org/trunk@58147 git-svn-id: https://core.svn.wordpress.org/trunk@57612 1a063a9b-81f0-0310-95a4-ce76da25c4cd
In the merge of [58417] the new file was missed. This commit adds the missing required file. Developed in WordPress/wordpress-develop#6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Follow-up to [58147]. Props dmsnell, jonsurrell. Built from https://develop.svn.wordpress.org/trunk@58148 git-svn-id: http://core.svn.wordpress.org/trunk@57613 1a063a9b-81f0-0310-95a4-ce76da25c4cd
In the merge of [58417] the new file was missed. This commit adds the missing required file. Developed in WordPress/wordpress-develop#6535 Discussed in https://core.trac.wordpress.org/ticket/61182 Fixes #61182. Follow-up to [58147]. Props dmsnell, jonsurrell. Built from https://develop.svn.wordpress.org/trunk@58148 git-svn-id: https://core.svn.wordpress.org/trunk@57613 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Trac ticket: Core-61182
There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The
_canonical_charset()
method exists, but is marked private for use.In this patch the new
unicode
module providesis_utf8_charset()
as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including_canonical_charset()
.Finally, the test functions governing
_canonical_charset()
have been rewritten as a single test with a data provider instead of as separate test functions.This patch raises a couple of questions:
htmlspecialchars()
does not need it?htmlspecialchars()
does not choke?One important question I'd like your help on, if you know the answer: is it still important to avoid calling
get_option( 'blog_charset' )
multiple times in a row, or is that already cached? If it's already cached then I suspect further caching or use ofstatic
vars will only bloat the memory footprint of Core without speeding it up.