Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use unicode-width instead of len() or grapheme cluster #7. #71

Merged
merged 5 commits into from Jun 28, 2018
Merged

use unicode-width instead of len() or grapheme cluster #7. #71

merged 5 commits into from Jun 28, 2018

Conversation

prataprc
Copy link
Contributor

I have refactored each_split_within() to follow, hopefully, a simpler logic. The test cases are passing and I have added new test case to test multi-width characters.

Let me know if this PR will be useful for this issue or need modifications.

Thanks,

@prataprc
Copy link
Contributor Author

Reason this PR is failing is due to ![feature(rustc_private)]. Please suggest alternative to get past this issue. Thanks,

Cargo.toml Outdated
@@ -15,3 +15,4 @@ categories = ["command-line-interface"]

[dev-dependencies]
log = "0.4"
unicode-width = "0.1.5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[dev-dependencies] are not available for the "main" crate. You want [dependencies].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. +1

Copy link
Contributor

@KodrAus KodrAus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on @prataprc! I've left a few comments. Some of them are just little nitpicks.

src/lib.rs Outdated

use std::error::Error;
use std::ffi::OsStr;
use std::fmt;
use std::iter::{repeat, IntoIterator};
use std::result;

extern crate unicode_width;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's put this extern crate up with the extern crate log statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

} else {
row.push_str("--");
}
row.push_str(if self.long_only { "-" } else { "--" });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

src/lib.rs Outdated
/// Note: Function was moved here from `std::str` because this module
/// is the only place that uses it, and because it was too specific for
/// a general string function.
fn each_split_within(desc: &String, lim: usize) -> Vec<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This could be desc: &str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

src/lib.rs Outdated
// A single word has gone over the limit. In this
// case we just accept that the word will be too long.
B
/// Note: Function was moved here from `std::str` because this module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment probably isn't accurate anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

src/lib.rs Outdated
let mut rows = Vec::new();
for line in desc.trim().lines() {
let mut words = Vec::new();
let mut word = String::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got a fair amount of allocation going on in this method that would be good to avoid if we can. It seems like we're processing the line multiple times to clear out excess whitespace. We could do this without the temporary strings by maintaining an index into the line that we're processing:

// Add an additional whitespace to flush the last word
let line_chars = line.chars().chain(Some(' '));

let words = line_chars.fold((Vec::new(), 0, 0), |(mut words, word_start_idx, last_idx), c| {
    // Get the current byte offset
    let idx = last_idx + c.len_utf8();

    // If the char is whitespace, advance the word start and maybe push a word
    if c.is_whitespace() {
        if word_start_idx != last_idx {
            words.push(&line[word_start_idx..last_idx]);
        }

        (words, idx, idx)
    }
    // If the char is not whitespace, continue, retaining the current
    else {
        (words, word_start_idx, idx)
    }
}).0;

The example uses the Iterator::fold method to let us thread state through our chars, so we can find the point at which a word start, then keep that index until we hit the end. Here's a runnable version you can check out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks neat ! 👍

src/lib.rs Outdated
C
}).0;

let mut row = String::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could cut down some more allocations in this part of the function too. We don't need the filter anymore because the words we get from above are all greater than 0 in length. So we could do something like this:

let mut current_row = String::new();
for word in words.iter() {
    let sep = if current_row.len() > 0 { Some(" ") } else { None };

    let mut width =
        current_row.width() + word.width() + sep.map(UnicodeWidthStr::width).unwrap_or(0);

    if width <= lim {
        if let Some(sep) = sep {
            current_row.push_str(sep);
        }
        current_row.push_str(word);

        continue
    }

    if current_row.len() > 0 {
        rows.push(current_row.clone());
        current_row.clear();
    }

    current_row.push_str(word);
}

if current_row.len() > 0 {
    rows.push(current_row);
}

So we re-use the same current_row with its capacity already set somewhere up around lim instead of creating a new string buffer each time.

We also don't need to filter and copy rows, we can just return it as-is at the end of the method because it's only got valid rows in it.

What do you think?

Copy link
Contributor Author

@prataprc prataprc Jun 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. 👍 amended the PR.

Copy link
Contributor

@KodrAus KodrAus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I think the new version of each_split_within is definitely easier to follow.

Thanks for working on this @prataprc!

@KodrAus
Copy link
Contributor

KodrAus commented Jun 9, 2018

The AppVeyor failure is transient.

@KodrAus KodrAus merged commit 4976a82 into rust-lang:master Jun 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants