Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No explicit control over CR LF line ending #453

Closed
koiuo opened this issue Jun 2, 2022 · 20 comments
Closed

No explicit control over CR LF line ending #453

koiuo opened this issue Jun 2, 2022 · 20 comments

Comments

@koiuo
Copy link

koiuo commented Jun 2, 2022

Helix editor recently added a feature to reflow a piece of text; helix uses textwrap for this

There's an issue, where reflow does not work correctly if CRLF sequence (\r\n) is used as a line terminator
helix-editor/helix#2645

helix does not support CR ending (pre macosx), but I guess, the issue would manifest with it as well.

As far as I can see, textwrap does not account for different line ending, there's no configuration option with which we could tune textwrap behavior.

Did I miss anything? Is there a way to support CRLF in textwrap?

If not, I'd be willing to contribute a fix.
My initial idea is to introduce another option into the Options struct, named line_ending (or line_terminator as a fancier alternative). To keep things tight, I'd express it as enum (we can always create something like Custom should we want to support an arbitrary byte sequence later, but I highly doubt that). How does that sound?

@mgeisler
Copy link
Owner

mgeisler commented Jun 3, 2022

Hey @edio, thanks for opening the issue. Super exciting that textwrap is used in the editor!

You're not missing anything, textwrap is hard-coded to split on \n in textwrap::wrap. I see you're using textwrap::refill and that is also hard-coded to use \n.

I've been thinking of \n as the logical way to separate lines and \r\n as more of a fileformat encoding of the logical \n characters. That is, my opinion is that programs should use \n throughout their processing and then encode the string to the target file format (e.g., do a .replace("\n", "\r\n") call at the point where the data leaves the program).

Searching a bit, it seems that this way of thinking comes from how languages like C++ handle text mode files. Rust doesn't have a text mode, so I think my way of thinking is outdated :-D

Now, I see a few different approaches for handling this:

  • We could either push the problem back to the caller: demand that you use \n delimited string with Textwrap. That matches my way of thinking above. Easiest solution to implement for Textwrap 😄
  • Alternatively, we could split on a custom line ending as you suggest.
  • Finally, we could perhaps also use .lines() internally and ask people to use textwrap::wrap (which return a Vec<&str>) instead of textwrap::fill if they want to use custom line endings.

Out of the options, I think the first two are the best. Since Textwrap tries to be useful for many use-cases, I will be happy to see a custom enum like you suggest. I agree that two enum variants ought to be enough for now. Since we split the input just once, I think the performance impact should be negligible, but please compare

% cargo criterion fill_optimal_fit_ascii/4800

before and after to make sure.

@mgeisler
Copy link
Owner

mgeisler commented Jun 3, 2022

Another thing: I think Helix is the first place which uses textwrap::refill for real. So if you run into problems with it, then please let me know. Most of the logic is in textwrap::unfill and it could use some real-world feedback since it's quite fuzzy what it should do on different input.

@koiuo
Copy link
Author

koiuo commented Jun 3, 2022

@mgeisler , thanks for the response.
I do not represent the helix dev team. I just had a couple of free evenings and wanted to find an interesting task to work on.

I considered transforming the input before feeding it to textwrap, but that seemed inefficient (and most of all cumbersome).

.lines() + textwrap::filltextwrap:wrap seems like a nice and clean workaround in terms of code. That still leaves us with the default \n in the output, however (so it is still cumbersome, even if reduced by half 😁)

Perhaps I should summon @kirawi here, who's been handling the issue in the helix repo and, I assume, familiar with the helix codebase (unlike me).
@kirawi, please read Martin's comment #453 (comment) and share your opinion about the proposed solutions. Thank you!

EDIT: I misread. The proposal was to use lines() + textwrap::wrap. Works better for the client, but introduces a slight behavoir inconsistency for this lib: it handles any line ending in the input, but only outputs \n. Not sure if it's a problem.

@mgeisler
Copy link
Owner

mgeisler commented Jun 5, 2022

Works better for the client, but introduces a slight behavoir inconsistency for this lib: it handles any line ending in the input, but only outputs \n. Not sure if it's a problem.

My thinking was that the client (the caller of wrap) would use \n or \r\n as desired to put the lines back together. Perhaps the caller would use neither: if the lines are meant to be displayed in an editor (:wink:), then perhaps the lines are just sent one by one to the display engine.

Thanks for the pull request, I'll go take a look now!

@koiuo
Copy link
Author

koiuo commented Jun 5, 2022

@mgeisler , I benchmarked wrap with lines() vs split('\n') on 4800 input, and consistently across multiple runs the lines() version is slower. lines() additionally removes \r in the end of the line after the split, so it being slower is expected.
My laptop isn't tuned for benchmarking (variable CPU frequency, etc.), but the best case of split('\n') is 5% faster than the best case of lines().

IMO, if there's no explicit control over the line ending, lines() should be the default behavior. I'm not sure 5% performance gain is worth the reduced portability (esp. considering Rust targets Windows too), and preprocessing of input will kill the 5% gain if input has CRLF.

Anyway switching to lines() conflicts with the line_ending option. So either we do lines() and call it a day (this will solve the issue for helix), or we introduce the LineEnding.

If we do lines() now, we can still introduce the LineEnding later (in this sense it is a safe option). We'll just want to have a special LineEnding::Auto value. This would be the default, and a specific value like CRLF would give a slight performance boost.

@mgeisler
Copy link
Owner

mgeisler commented Jun 6, 2022

@mgeisler , I benchmarked wrap with lines() vs split('\n') on 4800 input, and consistently across multiple runs the lines() version is slower. lines() additionally removes \r in the end of the line after the split, so it being slower is expected. My laptop isn't tuned for benchmarking (variable CPU frequency, etc.), but the best case of split('\n') is 5% faster than the best case of lines().

IMO, if there's no explicit control over the line ending, lines() should be the default behavior. I'm not sure 5% performance gain is worth the reduced portability (esp. considering Rust targets Windows too), and preprocessing of input will kill the 5% gain if input has CRLF.

Anyway switching to lines() conflicts with the line_ending option. So either we do lines() and call it a day (this will solve the issue for helix), or we introduce the LineEnding.

If we do lines() now, we can still introduce the LineEnding later (in this sense it is a safe option). We'll just want to have a special LineEnding::Auto value. This would be the default, and a specific value like CRLF would give a slight performance boost.

I like what you've done in #454: let the caller of wrap and friends specify what line ending to use both for splitting the lines and, in the case of fill, to use for joining the generated lines.

Let us continue discussing there!

@mgeisler
Copy link
Owner

Fixed by #454, thanks!

@Arnavion
Copy link

Arnavion commented Oct 18, 2022

This was a semver breaking change since textwrap::Options is an exhaustive struct with pub fields. Code like:

fn textwrap_options() -> textwrap::Options<'static> {
	textwrap::Options {
		width: textwrap::termwidth(),
		initial_indent: "    ",
		subsequent_indent: "    ",
		break_words: true,
		wrap_algorithm: textwrap::WrapAlgorithm::OptimalFit(Default::default()),
		word_separator: textwrap::WordSeparator::UnicodeBreakProperties,
		word_splitter: textwrap::WordSplitter::NoHyphenation,
	}
}

breaks when updating from v0.15.0 to v0.15.1

@mgeisler
Copy link
Owner

Hi @Arnavion, oh no, I didn't think of that... Thanks for noticing! I'll yank the release and redo the release on the weekend.

I'm curious through, do you have code like that? Why would you write it like that when Options come with a built-in builder?

@Arnavion
Copy link

I'm curious through, do you have code like that?

Yes. It failed to compile yesterday and that's why I started investigating and reached here.

Why would you write it like that when Options come with a built-in builder?

I wanted to initialize all the fields with values of my choice, so the builder would be more verbose than setting the fields directly. Also, having a struct with pub fields is an invitation to let users set those fields manually, ie it's an explicit signal that it's okay to set those fields.

@mgeisler
Copy link
Owner

an invitation to let users set those fields manually, ie it's an explicit signal that it's okay to set those fields.

Yeah, I get that — I made them public to make it easy to reassign a few fields when needed. I just didn't expect anybody to actually use the struct literal syntax itself.

Thanks for letting me know, I'll fix it as soon as I can.

@Arnavion
Copy link

Yeah, in that case you can make it #[non_exhaustive] to prevent this from happening in the future.

mgeisler added a commit that referenced this issue Oct 23, 2022
This allows us to add more fields in the future without breaking
backwards compatibility. This was pointed out in #453.
@mgeisler
Copy link
Owner

Thanks @Arnavion, I've yanked 0.15.1 and released 0.16.0 instead.

@nox
Copy link

nox commented Oct 24, 2022

Please unyank 0.15.1, clap 3.2.22 explicitly depends on it.

@nox
Copy link

nox commented Oct 24, 2022

clap-rs/clap#4418

@nox
Copy link

nox commented Oct 24, 2022

0.15.0 should have been rereleased as 0.15.2 to make sure downstream users explicitly depending on 0.15.1 don't get stuck.

@Arnavion
Copy link

If any crate depends on ^0.15.1 explicitly, it would normally be doing so because it depends on API newly added in 0.15.1, so it wouldn't work with 0.15.0 released as 0.15.2. So doing what you suggest would be pointless in general.

That said, clap 3 seems to have updated to 0.15.1 without depending on the change in 0.15.1 ( clap-rs/clap@2fd5507 ), so it would indeed work with 0.15.0 re-released as 0.15.2. Why it updated its dep for no reason is a mystery...

@nox
Copy link

nox commented Oct 24, 2022

If any crate depends on ^0.15.1 explicitly, it would normally be doing so because it depends on API newly added in 0.15.1, so it wouldn't work with 0.15.0 released as 0.15.2. So doing what you suggest would be pointless in general.

This is wrong. If I add a dependency to my system I always take the biggest version number, because then that enforces I don't rely on a bugfix without doing so explicitly, and it also circumvents issues with crates that don't believe in minor version bumps (e.g. serde). Anyway it's good practice to always rerelease a bigger version number whenever a version is yanked to restore whatever was before.

@Arnavion
Copy link

That's fine, but I hope you defensively bump every dep you have in every commit you make, no matter how unrelated it is. After all you might have added foo dep at its latest version last week, but your clean build today might be relying on a new bugfix it released yesterday. Otherwise you're not achieving the safety you desire :)

(The only way this can be solved for sure is with -Z minimal-versions.)

Anyway, this is not the place to argue about it, and I agree that there's no harm in releasing 0.15.2.

@mgeisler
Copy link
Owner

Hi @Arnavion and @nox Thanks for letting me know about this! I did not anticipate such a deadlock when I yanked the release... Let's discuss further fixes in #484.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants