use unicode-width instead of len() or grapheme cluster #7. #71

prataprc · 2018-05-23T06:36:15Z

I have refactored each_split_within() to follow, hopefully, a simpler logic. The test cases are passing and I have added new test case to test multi-width characters.

Let me know if this PR will be useful for this issue or need modifications.

Thanks,

prataprc · 2018-05-23T08:54:55Z

Reason this PR is failing is due to ![feature(rustc_private)]. Please suggest alternative to get past this issue. Thanks,

SimonSapin · 2018-05-23T13:01:36Z

Cargo.toml

@@ -15,3 +15,4 @@ categories = ["command-line-interface"]

 [dev-dependencies]
 log = "0.4"
+unicode-width = "0.1.5"


[dev-dependencies] are not available for the "main" crate. You want [dependencies].

KodrAus

Thanks for taking this on @prataprc! I've left a few comments. Some of them are just little nitpicks.

KodrAus · 2018-05-27T23:05:09Z

src/lib.rs


 use std::error::Error;
 use std::ffi::OsStr;
 use std::fmt;
 use std::iter::{repeat, IntoIterator};
 use std::result;

+extern crate unicode_width;


nit: let's put this extern crate up with the extern crate log statement.

KodrAus · 2018-05-27T23:05:31Z

src/lib.rs

-                    } else {
-                        row.push_str("--");
-                    }
+                    row.push_str(if self.long_only { "-" } else { "--" });


KodrAus · 2018-05-27T23:06:19Z

src/lib.rs

+/// Note: Function was moved here from `std::str` because this module
+/// is the only place that uses it, and because it was too specific for
+/// a general string function.
+fn each_split_within(desc: &String, lim: usize) -> Vec<String> {


nit: This could be desc: &str

KodrAus · 2018-05-27T23:07:34Z

src/lib.rs

-                // A single word has gone over the limit.  In this
-                // case we just accept that the word will be too long.
-                B
+/// Note: Function was moved here from `std::str` because this module


This comment probably isn't accurate anymore.

KodrAus · 2018-05-28T00:04:49Z

src/lib.rs

+    let mut rows = Vec::new();
+    for line in desc.trim().lines() {
+        let mut words = Vec::new();
+        let mut word = String::new();


We've got a fair amount of allocation going on in this method that would be good to avoid if we can. It seems like we're processing the line multiple times to clear out excess whitespace. We could do this without the temporary strings by maintaining an index into the line that we're processing:

// Add an additional whitespace to flush the last word let line_chars = line.chars().chain(Some(' ')); let words = line_chars.fold((Vec::new(), 0, 0), |(mut words, word_start_idx, last_idx), c| { // Get the current byte offset let idx = last_idx + c.len_utf8(); // If the char is whitespace, advance the word start and maybe push a word if c.is_whitespace() { if word_start_idx != last_idx { words.push(&line[word_start_idx..last_idx]); } (words, idx, idx) } // If the char is not whitespace, continue, retaining the current else { (words, word_start_idx, idx) } }).0;

The example uses the Iterator::fold method to let us thread state through our chars, so we can find the point at which a word start, then keep that index until we hit the end. Here's a runnable version you can check out.

Looks neat ! 👍

KodrAus · 2018-06-03T04:38:17Z

src/lib.rs

-                C
+        }).0;
+
+        let mut row = String::new();


We could cut down some more allocations in this part of the function too. We don't need the filter anymore because the words we get from above are all greater than 0 in length. So we could do something like this:

let mut current_row = String::new(); for word in words.iter() { let sep = if current_row.len() > 0 { Some(" ") } else { None }; let mut width = current_row.width() + word.width() + sep.map(UnicodeWidthStr::width).unwrap_or(0); if width <= lim { if let Some(sep) = sep { current_row.push_str(sep); } current_row.push_str(word); continue } if current_row.len() > 0 { rows.push(current_row.clone()); current_row.clear(); } current_row.push_str(word); } if current_row.len() > 0 { rows.push(current_row); }

So we re-use the same current_row with its capacity already set somewhere up around lim instead of creating a new string buffer each time.

We also don't need to filter and copy rows, we can just return it as-is at the end of the method because it's only got valid rows in it.

What do you think?

Thanks for the suggestion. 👍 amended the PR.

KodrAus

This looks good to me! I think the new version of each_split_within is definitely easier to follow.

Thanks for working on this @prataprc!

KodrAus · 2018-06-09T07:39:34Z

The AppVeyor failure is transient.

prataprc mentioned this pull request May 23, 2018

getopts should use grapheme clusters for text alignment #7

Closed

SimonSapin reviewed May 23, 2018

View reviewed changes

KodrAus reviewed May 28, 2018

View reviewed changes

KodrAus reviewed Jun 3, 2018

View reviewed changes

prataprc added 5 commits June 3, 2018 13:42

use unicode-width instead of len() or grapheme cluster #7.

dd7634c

move unicode-width from [dev-dependencies] to [dependencies] issue #7.

9973a6c

for_each is not implemented until 1.21, issue #7.

7e37bd7

optimize parsing words in each_split_within(), issue #7.

547e4d5

more optimizations for issue #7.

27df9b2

KodrAus approved these changes Jun 9, 2018

View reviewed changes

KodrAus merged commit 4976a82 into rust-lang:master Jun 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use unicode-width instead of len() or grapheme cluster #7. #71

use unicode-width instead of len() or grapheme cluster #7. #71

prataprc commented May 23, 2018

prataprc commented May 23, 2018

SimonSapin May 23, 2018

prataprc May 23, 2018

KodrAus left a comment

KodrAus May 27, 2018

prataprc May 28, 2018

KodrAus May 27, 2018

KodrAus May 27, 2018

prataprc May 28, 2018

KodrAus May 27, 2018

prataprc May 28, 2018

KodrAus May 28, 2018

prataprc May 28, 2018

KodrAus Jun 3, 2018

prataprc Jun 3, 2018 •

edited

KodrAus left a comment

KodrAus commented Jun 9, 2018

use unicode-width instead of len() or grapheme cluster #7. #71

use unicode-width instead of len() or grapheme cluster #7. #71

Conversation

prataprc commented May 23, 2018

prataprc commented May 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KodrAus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prataprc Jun 3, 2018 • edited

Choose a reason for hiding this comment

KodrAus left a comment

Choose a reason for hiding this comment

KodrAus commented Jun 9, 2018

prataprc Jun 3, 2018 •

edited