No explicit control over CR LF line ending #453

koiuo · 2022-06-02T22:23:44Z

Helix editor recently added a feature to reflow a piece of text; helix uses textwrap for this

There's an issue, where reflow does not work correctly if CRLF sequence (\r\n) is used as a line terminator
helix-editor/helix#2645

helix does not support CR ending (pre macosx), but I guess, the issue would manifest with it as well.

As far as I can see, textwrap does not account for different line ending, there's no configuration option with which we could tune textwrap behavior.

Did I miss anything? Is there a way to support CRLF in textwrap?

If not, I'd be willing to contribute a fix.
My initial idea is to introduce another option into the Options struct, named line_ending (or line_terminator as a fancier alternative). To keep things tight, I'd express it as enum (we can always create something like Custom should we want to support an arbitrary byte sequence later, but I highly doubt that). How does that sound?

The text was updated successfully, but these errors were encountered:

mgeisler · 2022-06-03T16:46:42Z

Hey @edio, thanks for opening the issue. Super exciting that textwrap is used in the editor!

You're not missing anything, textwrap is hard-coded to split on \n in textwrap::wrap. I see you're using textwrap::refill and that is also hard-coded to use \n.

I've been thinking of \n as the logical way to separate lines and \r\n as more of a fileformat encoding of the logical \n characters. That is, my opinion is that programs should use \n throughout their processing and then encode the string to the target file format (e.g., do a .replace("\n", "\r\n") call at the point where the data leaves the program).

Searching a bit, it seems that this way of thinking comes from how languages like C++ handle text mode files. Rust doesn't have a text mode, so I think my way of thinking is outdated :-D

Now, I see a few different approaches for handling this:

We could either push the problem back to the caller: demand that you use \n delimited string with Textwrap. That matches my way of thinking above. Easiest solution to implement for Textwrap 😄
Alternatively, we could split on a custom line ending as you suggest.
Finally, we could perhaps also use .lines() internally and ask people to use textwrap::wrap (which return a Vec<&str>) instead of textwrap::fill if they want to use custom line endings.

Out of the options, I think the first two are the best. Since Textwrap tries to be useful for many use-cases, I will be happy to see a custom enum like you suggest. I agree that two enum variants ought to be enough for now. Since we split the input just once, I think the performance impact should be negligible, but please compare

% cargo criterion fill_optimal_fit_ascii/4800

before and after to make sure.

mgeisler · 2022-06-03T16:51:40Z

Another thing: I think Helix is the first place which uses textwrap::refill for real. So if you run into problems with it, then please let me know. Most of the logic is in textwrap::unfill and it could use some real-world feedback since it's quite fuzzy what it should do on different input.

koiuo · 2022-06-03T17:54:27Z

@mgeisler , thanks for the response.
I do not represent the helix dev team. I just had a couple of free evenings and wanted to find an interesting task to work on.

I considered transforming the input before feeding it to textwrap, but that seemed inefficient (and most of all cumbersome).

.lines() + ~~textwrap::fill~~textwrap:wrap seems like a nice and clean workaround in terms of code. ~~That still leaves us with the default \n in the output, however (so it is still cumbersome, even if reduced by half 😁)~~

Perhaps I should summon @kirawi here, who's been handling the issue in the helix repo and, I assume, familiar with the helix codebase (unlike me).
@kirawi, please read Martin's comment #453 (comment) and share your opinion about the proposed solutions. Thank you!

EDIT: I misread. The proposal was to use lines() + textwrap::wrap. Works better for the client, but introduces a slight behavoir inconsistency for this lib: it handles any line ending in the input, but only outputs \n. Not sure if it's a problem.

mgeisler · 2022-06-05T13:18:21Z

Works better for the client, but introduces a slight behavoir inconsistency for this lib: it handles any line ending in the input, but only outputs \n. Not sure if it's a problem.

My thinking was that the client (the caller of wrap) would use \n or \r\n as desired to put the lines back together. Perhaps the caller would use neither: if the lines are meant to be displayed in an editor (:wink:), then perhaps the lines are just sent one by one to the display engine.

Thanks for the pull request, I'll go take a look now!

koiuo · 2022-06-05T21:05:39Z

@mgeisler , I benchmarked wrap with lines() vs split('\n') on 4800 input, and consistently across multiple runs the lines() version is slower. lines() additionally removes \r in the end of the line after the split, so it being slower is expected.
My laptop isn't tuned for benchmarking (variable CPU frequency, etc.), but the best case of split('\n') is 5% faster than the best case of lines().

IMO, if there's no explicit control over the line ending, lines() should be the default behavior. I'm not sure 5% performance gain is worth the reduced portability (esp. considering Rust targets Windows too), and preprocessing of input will kill the 5% gain if input has CRLF.

Anyway switching to lines() conflicts with the line_ending option. So either we do lines() and call it a day (this will solve the issue for helix), or we introduce the LineEnding.

If we do lines() now, we can still introduce the LineEnding later (in this sense it is a safe option). We'll just want to have a special LineEnding::Auto value. This would be the default, and a specific value like CRLF would give a slight performance boost.

mgeisler · 2022-06-06T16:48:07Z

@mgeisler , I benchmarked wrap with lines() vs split('\n') on 4800 input, and consistently across multiple runs the lines() version is slower. lines() additionally removes \r in the end of the line after the split, so it being slower is expected. My laptop isn't tuned for benchmarking (variable CPU frequency, etc.), but the best case of split('\n') is 5% faster than the best case of lines().

IMO, if there's no explicit control over the line ending, lines() should be the default behavior. I'm not sure 5% performance gain is worth the reduced portability (esp. considering Rust targets Windows too), and preprocessing of input will kill the 5% gain if input has CRLF.

Anyway switching to lines() conflicts with the line_ending option. So either we do lines() and call it a day (this will solve the issue for helix), or we introduce the LineEnding.

If we do lines() now, we can still introduce the LineEnding later (in this sense it is a safe option). We'll just want to have a special LineEnding::Auto value. This would be the default, and a specific value like CRLF would give a slight performance boost.

I like what you've done in #454: let the caller of wrap and friends specify what line ending to use both for splitting the lines and, in the case of fill, to use for joining the generated lines.

Let us continue discussing there!

mgeisler · 2022-09-17T19:57:55Z

Fixed by #454, thanks!

Arnavion · 2022-10-18T02:26:53Z

This was a semver breaking change since textwrap::Options is an exhaustive struct with pub fields. Code like:

fn textwrap_options() -> textwrap::Options<'static> {
	textwrap::Options {
		width: textwrap::termwidth(),
		initial_indent: "    ",
		subsequent_indent: "    ",
		break_words: true,
		wrap_algorithm: textwrap::WrapAlgorithm::OptimalFit(Default::default()),
		word_separator: textwrap::WordSeparator::UnicodeBreakProperties,
		word_splitter: textwrap::WordSplitter::NoHyphenation,
	}
}

breaks when updating from v0.15.0 to v0.15.1

mgeisler · 2022-10-18T18:35:53Z

Hi @Arnavion, oh no, I didn't think of that... Thanks for noticing! I'll yank the release and redo the release on the weekend.

I'm curious through, do you have code like that? Why would you write it like that when Options come with a built-in builder?

Arnavion · 2022-10-18T18:43:17Z

I'm curious through, do you have code like that?

Yes. It failed to compile yesterday and that's why I started investigating and reached here.

Why would you write it like that when Options come with a built-in builder?

I wanted to initialize all the fields with values of my choice, so the builder would be more verbose than setting the fields directly. Also, having a struct with pub fields is an invitation to let users set those fields manually, ie it's an explicit signal that it's okay to set those fields.

mgeisler · 2022-10-18T18:51:25Z

an invitation to let users set those fields manually, ie it's an explicit signal that it's okay to set those fields.

Yeah, I get that — I made them public to make it easy to reassign a few fields when needed. I just didn't expect anybody to actually use the struct literal syntax itself.

Thanks for letting me know, I'll fix it as soon as I can.

Arnavion · 2022-10-18T18:56:46Z

Yeah, in that case you can make it #[non_exhaustive] to prevent this from happening in the future.

This allows us to add more fields in the future without breaking backwards compatibility. This was pointed out in #453.

mgeisler · 2022-10-23T19:59:00Z

Thanks @Arnavion, I've yanked 0.15.1 and released 0.16.0 instead.

nox · 2022-10-24T07:28:40Z

Please unyank 0.15.1, clap 3.2.22 explicitly depends on it.

nox · 2022-10-24T07:35:38Z

clap-rs/clap#4418

nox · 2022-10-24T07:47:23Z

0.15.0 should have been rereleased as 0.15.2 to make sure downstream users explicitly depending on 0.15.1 don't get stuck.

Arnavion · 2022-10-24T09:08:14Z

If any crate depends on ^0.15.1 explicitly, it would normally be doing so because it depends on API newly added in 0.15.1, so it wouldn't work with 0.15.0 released as 0.15.2. So doing what you suggest would be pointless in general.

That said, clap 3 seems to have updated to 0.15.1 without depending on the change in 0.15.1 ( clap-rs/clap@2fd5507 ), so it would indeed work with 0.15.0 re-released as 0.15.2. Why it updated its dep for no reason is a mystery...

nox · 2022-10-24T12:23:06Z

If any crate depends on ^0.15.1 explicitly, it would normally be doing so because it depends on API newly added in 0.15.1, so it wouldn't work with 0.15.0 released as 0.15.2. So doing what you suggest would be pointless in general.

This is wrong. If I add a dependency to my system I always take the biggest version number, because then that enforces I don't rely on a bugfix without doing so explicitly, and it also circumvents issues with crates that don't believe in minor version bumps (e.g. serde). Anyway it's good practice to always rerelease a bigger version number whenever a version is yanked to restore whatever was before.

Arnavion · 2022-10-24T12:50:21Z

That's fine, but I hope you defensively bump every dep you have in every commit you make, no matter how unrelated it is. After all you might have added foo dep at its latest version last week, but your clean build today might be relying on a new bugfix it released yesterday. Otherwise you're not achieving the safety you desire :)

(The only way this can be solved for sure is with -Z minimal-versions.)

Anyway, this is not the place to argue about it, and I agree that there's no harm in releasing 0.15.2.

mgeisler · 2022-10-24T14:59:37Z

Hi @Arnavion and @nox Thanks for letting me know about this! I did not anticipate such a deadlock when I yanked the release... Let's discuss further fixes in #484.

This was referenced Jun 4, 2022

Fix reflow for CRLF line ending (wip) helix-editor/helix#2672

Closed

Add configurable LineEnding #454

Merged

mgeisler closed this as completed Sep 17, 2022

mgeisler added a commit that referenced this issue Oct 23, 2022

Mark Options struct as non-exhaustive

5fe6800

This allows us to add more fields in the future without breaking backwards compatibility. This was pointed out in #453.

mgeisler mentioned this issue Oct 23, 2022

Mark Options as non_exhaustive #480

Merged

This was referenced Oct 24, 2022

cargo: downgrade off of yanked version of textwrap martinvonz/jj#667

Closed

crates.io version 0.15.1 #484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No explicit control over CR LF line ending #453

No explicit control over CR LF line ending #453

koiuo commented Jun 2, 2022 •

edited

mgeisler commented Jun 3, 2022 •

edited

mgeisler commented Jun 3, 2022

koiuo commented Jun 3, 2022 •

edited

mgeisler commented Jun 5, 2022

koiuo commented Jun 5, 2022 •

edited

mgeisler commented Jun 6, 2022

mgeisler commented Sep 17, 2022

Arnavion commented Oct 18, 2022 •

edited

mgeisler commented Oct 18, 2022

Arnavion commented Oct 18, 2022

mgeisler commented Oct 18, 2022

Arnavion commented Oct 18, 2022

mgeisler commented Oct 23, 2022

nox commented Oct 24, 2022

nox commented Oct 24, 2022

nox commented Oct 24, 2022

Arnavion commented Oct 24, 2022

nox commented Oct 24, 2022

Arnavion commented Oct 24, 2022

mgeisler commented Oct 24, 2022

No explicit control over CR LF line ending #453

No explicit control over CR LF line ending #453

Comments

koiuo commented Jun 2, 2022 • edited

mgeisler commented Jun 3, 2022 • edited

mgeisler commented Jun 3, 2022

koiuo commented Jun 3, 2022 • edited

mgeisler commented Jun 5, 2022

koiuo commented Jun 5, 2022 • edited

mgeisler commented Jun 6, 2022

mgeisler commented Sep 17, 2022

Arnavion commented Oct 18, 2022 • edited

mgeisler commented Oct 18, 2022

Arnavion commented Oct 18, 2022

mgeisler commented Oct 18, 2022

Arnavion commented Oct 18, 2022

mgeisler commented Oct 23, 2022

nox commented Oct 24, 2022

nox commented Oct 24, 2022

nox commented Oct 24, 2022

Arnavion commented Oct 24, 2022

nox commented Oct 24, 2022

Arnavion commented Oct 24, 2022

mgeisler commented Oct 24, 2022

koiuo commented Jun 2, 2022 •

edited

mgeisler commented Jun 3, 2022 •

edited

koiuo commented Jun 3, 2022 •

edited

koiuo commented Jun 5, 2022 •

edited

Arnavion commented Oct 18, 2022 •

edited