Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update String#trim #316

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

lewisje
Copy link

@lewisje lewisje commented Jun 26, 2015

Test for incorrect trimming of two whitespace characters instead of just one, reduce the number of string concatenations, and more quickly determine hasTrimWhitespaceBug to be false in the usual pre-ES5 case.

I would change if (typeof this === 'undefined' || this === null) to if (this == null) but I don't know whether that's allowed by this library's style guide.

@ljharb
Copy link
Member

ljharb commented Jun 26, 2015

Just like on paulmillr/es6-shim#354, please add a test that would fail without this change.

Can you also link to the ES5 spec and show where these characters should be included?

@lewisje
Copy link
Author

lewisje commented Jun 26, 2015

I addressed the references to the spec in paulmillr/es6-shim#354 (comment), but a key point from that is that I probably got the improper-test trim wrong, as this shim definitely does: The set of non-trimmable characters includes \x85 but maybe not \u200B (depending on the version of Unicode targeted), but this shim checks for \u200B and not \x85, and my PR at this point checks both, as the es6-shim should (and currently doesn't).

@ljharb
Copy link
Member

ljharb commented Jun 26, 2015

Thanks - we'll still need an automated test to prevent future regressions.

@lewisje
Copy link
Author

lewisje commented Jun 26, 2015

I'll get a test in later today, and then feel the frustration that comes with not being able to rebase from GitHub's Web-based interface.

@ljharb
Copy link
Member

ljharb commented Jun 26, 2015

👍

@lewisje
Copy link
Author

lewisje commented Jun 26, 2015

By the way, analogously to a comment by mathiasbynens in paulmillr/es6-shim#354 (comment), should \u200B be considered trimmable here (because it is in Zs in the oldest permissible version of Unicode to target, 3.0), even though this would mean that String#trim will be overwritten at least once if, as recommended, both es5-shim and es6-shim are loaded (because then es5-shim would trim \u200B, while es6-shim properly forbids trimming \u200B)?

@ljharb
Copy link
Member

ljharb commented Jun 26, 2015

It's totally fine to have es5-shim use one version, and es6-shim use another.

@ljharb
Copy link
Member

ljharb commented Jul 20, 2015

@lewisje Are you planning on completing this PR?

@lewisje
Copy link
Author

lewisje commented Jul 20, 2015

I am, but I got distracted 😞

@lewisje
Copy link
Author

lewisje commented Aug 23, 2015

I forgot, with this re-commit, to explicitly trim U+200B, which is in Zs in the oldest version of Unicode targeted by ES5, or at least to stop ensuring it doesn't get trimmed; I just got the idea that if as I mused, this would mean that es5-shim + es6-shim means String#trim gets patched no matter what, this could actually be a speed improvement over the native trim method.

I remember seeing, for example, that if you trim using a regex from the beginning, then use a loop to find the index to do String#substring, that's faster than native while not going into insane code-golfing territory; that's for a later perf-oriented PR though.

I think I'll be re-committing my PRs for both this and es6-shim now.

@ljharb
Copy link
Member

ljharb commented Aug 23, 2015

It's totally fine if es5-shim and es6-shim together end up double-patching a method. Definitely let's worry first about correctness, and later about perf.

@lewisje
Copy link
Author

lewisje commented Aug 23, 2015

Maybe you can't patch String#trim on Node after all; it looks like it's not trimming U+200B even though the patched version did just that. I'll try allowing but not requiring U+200B to be trimmed.

@ljharb
Copy link
Member

ljharb commented Aug 23, 2015

You can patch it on any engine. If the spec requires it, the tests need to.

@lewisje
Copy link
Author

lewisje commented Aug 23, 2015

I made that comment because, when I had String#trim explicitly trim U+200B, I got errors from Travis CI, and all of them appeared to come from that character not actually getting trimmed: https://travis-ci.org/es-shims/es5-shim/builds/76831557

@ljharb
Copy link
Member

ljharb commented Aug 23, 2015

To me, that suggests that the runtime check was broken, and the test was correct :-)

@lewisje
Copy link
Author

lewisje commented Aug 23, 2015

I don't see where the runtime check was broken: 40e46d1

@ljharb
Copy link
Member

ljharb commented Aug 23, 2015

Try being more explicit rather than relying on truthiness or falsiness of an empty string - if indeed it still seems correct, then let's investigate that!

@lewisje
Copy link
Author

lewisje commented Aug 23, 2015

I noticed that weird stuff happens in the Node REPL if I re-define String#trim:

> var origTrim = String.prototype.trim;
undefined
> String.prototype.trim = function () {return 'lol';};
[Function]
> 'rofl'.trim();
ReferenceError: lol is not defined
    at repl:1:1
    at REPLServer.defaultEval (repl.js:164:27)
    at bound (domain.js:250:14)
    at REPLServer.runBound [as eval] (domain.js:263:12)
    at REPLServer.<anonymous> (repl.js:392:12)
    at emitOne (events.js:82:20)
    at REPLServer.emit (events.js:169:7)
    at REPLServer.Interface._onLine (readline.js:210:10)
    at REPLServer.Interface._line (readline.js:546:8)
    at REPLServer.Interface._ttyWrite (readline.js:823:14)

It looks like String#trim can be re-defined only once in the REPL:

> var origTrim = String.prototype.trim;
var origTrim = String.prototype.trim;
undefined
> String.prototype.trim = function trim() {return '"lol";';};
[Function: trim]
> 'rofl'.trim();
'lol'
> 'omg'.trim();
'lol'
> String.prototype.trim = origTrim;
'lol'
> 'omg'.trim();
'lol'
> 'rofl'.trim();
'lol'

My guess is that the REPL relies on String#trim internally; I haven't tested something like replacing it with a legitimate trim operation yet.

With that said, I'll make the trim checks more explicit now.

@lewisje
Copy link
Author

lewisje commented Aug 23, 2015

Okay, I just tried patching String#trim in the Node REPL with something that kinda-sorta works but is defective, then swapping out the original String#trim, and that seems to work:

> var origTrim = String.prototype.trim;
var origTrim = String.prototype.trim;
undefined
> String.prototype.trim = function trim() {return this.replace(/^[\t\r\f\v\n\x20\xA0]+|[\t\r\f\v\n\x20\xA0]+$/g, '');};
[Function: trim]
> ' a \uFEFF '.trim();
'a '
> origTrim.call(' a \uFEFF ');
'a'
> String.prototype.trim = origTrim;
[Function: trim]
> ' a \uFEFF '.trim();
'a'

Test for incorrect trimming of NEL, trim ZWSP (which *is* trimmable in the oldest Unicode spec ES5 targets), reduce the number of string concatenations, and more quickly determine `hasTrimWhitespaceBug` to be false in the usual pre-ES5 case.

I would change `if (typeof this === 'undefined' || this === null)` to `if (this == null)` but I don't know whether that's allowed by this library's style guide.
it('does not trim the zero-width space', function () {
expect('\u200b'.trim()).toBe('\u200b');
expect('\u200b'.trim().length).toBe(1);
it('does not trim the next-line character', function () {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link me to which part of the spec talks about this character?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This character is in the Unicode category Cc (Other, Control). See http://www.fileformat.info/info/unicode/char/0085/index.htm. The spec doesn't say anything about this category or this character.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://es5.github.io/#x7.2

ECMAScript implementations must recognize all of the white space characters defined in Unicode 3.0. Later editions of the Unicode Standard may define other white space characters. ECMAScript implementations may recognize white space characters from later editions of the Unicode Standard.

I once read through several editions of the Unicode Standard and found that U+0085 (NEL) has been in Cc ever since 1.1.5 at least, and in particular in 3.0 and all later versions, so although its purpose was a single-character replacement for \r\n, it's non-trimmable; also, as for actively trimming U+200B (ZWSP), that was in Zs before Unicode 4.0.1 (upon and after which it was in Cf), so ES5 must recognize it as trimmable (and ES6 must not): paulmillr/es6-shim#354 (comment)

(When I made that comment, I hadn't realized that even if an ES5 or ES6 implementation targets a later version of Unicode than the earliest permissible one, it must still recognize as whitespace all Zs characters from the earliest permissible version of Unicode, respectively 3.0 and 5.1.0.)

@lewisje
Copy link
Author

lewisje commented Aug 24, 2015

I copied the string from the Travis CI stack trace (latest io.js, but it was the same for a couple other versions of Node I looked at), showing the incorrectly trimmed test string, and ran it through rishida's Unicode converter:

'\u200B\u2028\u2029\u202F\u205F\u3000\uFEFFHello, World!\t\n  \u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B'

This makes it clear that the problem was the failure to trim U+200B even though String#trim should have been patched to do just that.

@@ -4,16 +4,18 @@ describe('String', function () {
'use strict';

describe('trim', function () {
var test = '\x09\x0A\x0B\x0C\x0D\x20\xA0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029\uFEFFHello, World!\x09\x0A\x0B\x0C\x0D\x20\xA0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029\uFEFF';
var wsp = '\t\n\v\f\r \xA0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005' +
'\u2006\u2007\u2008\u2009\u200A\u200B\u2028\u2029\u202F\u205F\u3000\uFEFF';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's leave this on one line - arbitrary line length limits are silly. it will show the diff better also.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this sentiment; the reason I had broken it up was that I thought the project had line-length limits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it used to, it doesn't anymore :-D

@ljharb
Copy link
Member

ljharb commented Oct 19, 2015

Please rebase this on top of master - merge commits make bisect sad - note: this means git fetch ; git rebase origin/master not clicking the "merge" button on github.

@ljharb
Copy link
Member

ljharb commented Jan 7, 2020

@lewisje would you mind checking "allow edits" on the RHS of the PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants